Hello, Greg, Ralf, Paul, James, Alexandre and Huacai
Many years ago when I was still in the middle scheool, I got a Loongson
Yeeloong laptop to explore the world of non-x86 world, as Geert Uytterhoeven
once said, there's lots of Linux beyond IA-32. At that time I've noticed there
was no platform driver for the laptop. I just assumed that nobody bothered
mainlining it and everyone were using Alexandre Oliva's tree. I've diffed
his tree, got the patches, discovered and fixed two minor bugs and submitted
to Alexandre, and called it a day.
Years later, I stared at my machine, and found Alexandre Oliva has stopped
maintaining his tree, yet there is still no platform driver in the mainline.
So I decided that submitting the Yeeloong driver to the mainline kernel is
the my current personal project of this year. I only work for myself for fun,
not for profit. I see Loongson as an interesting non-x86 platform, but I do
not work for Loongson, do not explictly support them, nor supported by them
in any way.
Reviewing the previous E-mails and patches at Linux/MIPS mailing list, I've
noticed that, from 2009 to 2019, in a 10-year span, there have been at least
3 attempts by 3 people to submit the platform driver for Yeeloong laptop to
the Linux/MIPS tree, all implicitly rejected. This is not meant to be a
criticism for maintainers' hard work and high quality standard, but rather,
I think it shows a serious challenge on development of platform drivers that
worth discussing. since there's even more unsubmitted platform drivers for
more devices, especially Loongson3. We need to find a solution both for
short-term and long-term further development of these platform drivers.
Here, I give a quick summary about the situation, and see what can I/we
do about it. Digging into the mailing list, in the hindsight, it's clear
to see the fate of this platform driver. In December 2009, Wu Zhangin
submitted a refined version of the initial platform driver to the mailing
list.
His intention was to create a MIPS-equivalent version of the x86 platform
drivers, as he said,
> I like the folks did under drivers/platform/x86/ ;) which
> will be better to maintain. for all of these drivers are really only
> YeeLoong platform specific ;)
However, Ralf Baechle commented, that
> You split old, big driver into several individual drivers - good.
[...]
> Experience has shown that drivers for a particular subsystem are best
> combined in a single menu, in a single directory. Otherwise any changes
> to subsystem's internal APIs will become a major pain.
Dmitry Torokhov gave another response,
> Umm, it still mixes up bunch of stuff not directly related to input.
> I'd vote for drivers/platform/mips (since we have a few of kitchen-sink
> style drivers for x86-based laptops in drivers/platform/x86).
This discussion is frozen without any progress in the following years.
For example, when the driver was resubmitted to James Hogan in late 2018,
he noted, possibly without knowledge of the discussion in 2009,
> I can't help thinking it would be better to separate this into separate
> drivers for each part (backlight, power supply etc), and move them into
> the appropriate driver directories (drivers/power/supply,
> drivers/video/backlight etc). That way each part would get proper review
> from the appropriate maintainers (or at least they should be Cc'd).
>
> Is there a particular reason for it to be a single driver?
So I think it's the time to restart this discussion. Should the driver be
submitted as a platform driver, or be splitted and merged into various
subsystems? I've considered reasons from both sides.
============================
Argument FOR platform driver
============================
* Unlike other drivers, platform drivers are only written for a specific
machine, or a specfic family of machines. They are usually useless for
other computers. A central point-of-maintenance is easier to manage than
running between different subsystems. Also, many drivers only have trivial
code for each device, splitting it creates more boilerplates.
* Platform drivers are not universal, architecture-wide, instead, they are
tightly-coupled to a specfic hardware controller and/or BIOS firmware. On
laptops, the Embedded Controller is responsible for controlling the battery,
keyboard, backlights, and other opeartions. In this way, it's better to see
a platform drivers as a driver for a single controller chip, instead of a
series of drivers for a keyboard, a battery, or a screen. These kitchen-sink
drivers don't really fit into a single category.
* Since a platform driver talk to the same controller, shares the same header
and I/O functions, it's difficult to split different operations on the same
chipset. First, code-sharing of the underlying EC operations can be a problem.
Also, modification to a single driver often touches the underlying EC logic,
which need to touch the core architecture code, and it also affects on drivers
in unrelated-subsystems, thus creating many troubles on coordination. Even worse,
they are only designed to interact on each other. For example, closing the lid
may deactivate the screen and changes the behavior of a light sensor, a power
button, and a 3G modem and their respective LED indicators. From a EC driver
perspective, it's easy to manage, but harder to do so independent drivers.
* Some computers have different hardware functionatilies, e.g. different
screen sizes, types of batteries, or keymap, but often share a single family
of the EC chipset and they can be controlled in a similar way, or have similar
characteristics. Using single platform drivers is easier to add support of a
new machise, or to reuse code among different computers - simply extend the
existing platform driver is enough. In our case, It's expected to support more
Loongson PCs of the same family, such as Yeeloong laptops, MengLoong machines
and FuLoong Mini PCs.
The platform drivers in drivers/platform/x86 is representative for the points
above. For example, Greg Kroah-Hartman's samsung-laptop.c.
================================
Argument AGAINST platform driver
================================
* Loongson is only a subfamily of MIPS systems. Unlike x86, where there is a
massive number of platforms which need custom platform drivers, the vast
majority of MIPS are servers/workstations or embedded systems, which does not
use platform drivers. Introducing platform drivers will encourage unrelated
drivers to be submitted to the Linux/MIPS mailing list, creating more pressure
to a small team like Linux/MIPS.
* Linux/MIPS is specialized in maintaining core MIPS code. The developers
should not be forced into reviewing and maintaining unrelated subsystems, such
as battery, LEDs, or hotkeys, which they are not responsible to. They should
be reviewed by the maintainers of their respective subsystem, this also
encourages higher-quality drivers.
* x86 platform drivers are maintained independently from core x86 code. in a
separate tree, and has its own community, [email protected].
However, there is nothing is this in Linux/MIPS.
* The code reuse and coordination problem is simply an excuse, it can be
solved by providing a generic EC interface in arch/mips/loongson64 for each
driver in its respective subsystem. Monolithic Kitchen-sink platform drivers
are anti-patterns and they shouldn't be developed.
In conclusion, in my opinion, although having a central location for Loongson
platform driver is preferable, but consider the current circumstances, it's a
bad idea, and in particular, bad for Linux/MIPS maintainers. I think the best
way of doing it under the current system is splitting the drivers to their
respective subsystems, and put the EC-related code in arch/mips/loongson64
and export the symbols. If there's consensus on it, I think I can starting
working on it immediately.
Nevertheless, I think a longterm solution is needed to facilitate the
development of lots more platform drivers of Loongson mainches. I've noticed
that Google has a similar problem developing their own platform drivers
for Chromebooks, their solution is creating drivers/platform/chrome, so
they can work on i2c, lightbar, LED backlight, etc, drivers tightly-coupled
with the Embedded Controller in a central location, with its dedicated tree
chrome-platform.git, without disturbing the ARM developers by sending unrelated
drivers. I think we should adopt this model, having a drivers/platform/loongson
should make the development easier.
Unsolved problems still remains, however. First, how to solve the coordination
problem? Linux/MIPS is a relatively low-volume and slow list, what to do if the
unmerged MIPS patches becomes a blocker for other drivers? Second, we already
have a hwmon driver under drivers/platform/mips, should we put a warning that
says new drivers under this directory shouldn't be written? Third, if we agree
to create drivers/platform/loongson in the future, how can we ask the maintainers
of other subsystems to review the drivers in it? I think Google bypassed the
problem because they have lots of kernel developers, but not us. Does anyone
know more about Chromebook's development procedures? Also, if we want
drivers/platform/loongson and its own tree, what are the procedures to get them
created?
All comments, suggestions, criticisms, etc, are appreciated.
Thanks,
Tom Li
Beijing GNU/Linux User Group
[1] 2009/12: https://lore.kernel.org/linux-mips/1259679137.12571.4.camel@falcon/
[2] 2010/05: https://lore.kernel.org/linux-mips/[email protected]/
[3] 2017/11: https://lore.kernel.org/linux-mips/[email protected]/
[4] https://www.spinics.net/lists/platform-driver-x86/
Hi Tom,
On Fri, Feb 08, 2019 at 04:30:39PM +0800, Tom Li wrote:
> Hello, Greg, Ralf, Paul, James, Alexandre and Huacai
>
> Many years ago when I was still in the middle scheool, I got a Loongson
> Yeeloong laptop to explore the world of non-x86 world, as Geert Uytterhoeven
> once said, there's lots of Linux beyond IA-32. At that time I've noticed there
> was no platform driver for the laptop. I just assumed that nobody bothered
> mainlining it and everyone were using Alexandre Oliva's tree. I've diffed
> his tree, got the patches, discovered and fixed two minor bugs and submitted
> to Alexandre, and called it a day.
>
> Years later, I stared at my machine, and found Alexandre Oliva has stopped
> maintaining his tree, yet there is still no platform driver in the mainline.
> So I decided that submitting the Yeeloong driver to the mainline kernel is
> the my current personal project of this year. I only work for myself for fun,
> not for profit. I see Loongson as an interesting non-x86 platform, but I do
> not work for Loongson, do not explictly support them, nor supported by them
> in any way.
>
> Reviewing the previous E-mails and patches at Linux/MIPS mailing list, I've
> noticed that, from 2009 to 2019, in a 10-year span, there have been at least
> 3 attempts by 3 people to submit the platform driver for Yeeloong laptop to
> the Linux/MIPS tree, all implicitly rejected. This is not meant to be a
> criticism for maintainers' hard work and high quality standard, but rather,
> I think it shows a serious challenge on development of platform drivers that
> worth discussing. since there's even more unsubmitted platform drivers for
> more devices, especially Loongson3. We need to find a solution both for
> short-term and long-term further development of these platform drivers.
>
> Here, I give a quick summary about the situation, and see what can I/we
> do about it. Digging into the mailing list, in the hindsight, it's clear
> to see the fate of this platform driver. In December 2009, Wu Zhangin
> submitted a refined version of the initial platform driver to the mailing
> list.
>
> His intention was to create a MIPS-equivalent version of the x86 platform
> drivers, as he said,
>
> > I like the folks did under drivers/platform/x86/ ;) which
> > will be better to maintain. for all of these drivers are really only
> > YeeLoong platform specific ;)
>
> However, Ralf Baechle commented, that
>
> > You split old, big driver into several individual drivers - good.
> [...]
> > Experience has shown that drivers for a particular subsystem are best
> > combined in a single menu, in a single directory. Otherwise any changes
> > to subsystem's internal APIs will become a major pain.
>
> Dmitry Torokhov gave another response,
>
> > Umm, it still mixes up bunch of stuff not directly related to input.
> > I'd vote for drivers/platform/mips (since we have a few of kitchen-sink
> > style drivers for x86-based laptops in drivers/platform/x86).
>
> This discussion is frozen without any progress in the following years.
>
> For example, when the driver was resubmitted to James Hogan in late 2018,
> he noted, possibly without knowledge of the discussion in 2009,
>
> > I can't help thinking it would be better to separate this into separate
> > drivers for each part (backlight, power supply etc), and move them into
> > the appropriate driver directories (drivers/power/supply,
> > drivers/video/backlight etc). That way each part would get proper review
> > from the appropriate maintainers (or at least they should be Cc'd).
> >
> > Is there a particular reason for it to be a single driver?
>
> So I think it's the time to restart this discussion. Should the driver be
> submitted as a platform driver, or be splitted and merged into various
> subsystems? I've considered reasons from both sides.
Right off the bat, let me express my whole-hearted agreement with James
& the question he asked in that quote. If there is a good reason for
some things to be one large driver then we can discuss that, but so far I
don't believe anybody has given one - nobody replied to James' mail that
you quote [1].
It looks to me like in 2009 someone did the work to split the driver up,
which shows that it's possible, but the drivers were placed under
arch/mips which is not where they belong. They should individually be
placed at the right points under drivers/, and submitted to the relevant
subsystem maintainers.
To address the particular quote you give from Dmitry Torokhov on the
yeeloong_hotkey driver - just because the driver as-is includes a bunch
of non-input related things doesn't mean that it should or has to. From
a look at the 2009 submission it seems to mix a bunch of policy into the
kernel which really ought to be elsewhere. Generally the input driver
reports that a key was pressed & something in userland decides what to
do with it, whereas this driver seems to attempt to bypass that & prod
at unrelated hardware all by itself.
> Unsolved problems still remains, however. First, how to solve the coordination
> problem? Linux/MIPS is a relatively low-volume and slow list, what to do if the
> unmerged MIPS patches becomes a blocker for other drivers?
Anyone watching will have noticed that Ralf was only intermittently
active for quite a while, and has more recently he's been absent due to
ill health. Correspondingly there was a time when patches sent to
linux-mips didn't go anywhere quickly. Towards the end of 2017 James
Hogan stepped up & things started moving again, and more recently I've
been maintaining arch/mips/. The number of patches being applied has
increased, the frequency of pull requests has increased & stopped being
so last minute, and hopefully responses on the mailing list have become
more regular.
So if unmerged arch/mips/ patches are holding you up, ping me, but
preferrably make sure code being added actually belongs under arch/mips/
first.
On Loongson patches in particular there's a recurring theme in which
patches are submitted, review comments are provided, they're ignored &
the patches are later submitted again without addressing the review
comments. That's not universally true of all Loongson-related patches,
but there are plenty of examples of it. Of course when resubmitted the
patches still don't get merged, because they still have unaddressed
problems.
The seeming lack of documentation for particular Loongson-based
machines, lack of English documentation for the CPU cores & lack of
detailed errata information all contribute to difficulties too. For a
recent example see the loongson_llsc_mb() workaround currently in the
mips-fixes branch - we got there in the end but it took a surprising
amount of back & forth effort to obtain enough information about what
the CPU bug actually is.
I want to be clear that I absolutely want Linux to have good support for
Loongson hardware, but that doesn't mean I'll merge things before
they're of sufficient quality & it doesn't mean I'll try to merge things
that really belong under the care of other maintainers.
> Second, we already have a hwmon driver under drivers/platform/mips,
> should we put a warning that says new drivers under this directory
> shouldn't be written?
Personally I see no reason for that hwmon driver to live under
drivers/platform/mips rather than drivers/hwmon. All that does is bypass
the attention of the drivers/hwmon/ maintainers who would be best placed
to offer input & ensure the driver is actually any good.
> Third, if we agree to create drivers/platform/loongson in the future,
> how can we ask the maintainers of other subsystems to review the
> drivers in it?
I think that question should prompt another - if we have maintainers for
various driver subsystems, why not place the drivers under their care in
the already established directories?
> I think Google bypassed the problem because they have lots of kernel
> developers, but not us. Does anyone know more about Chromebook's
> development procedures? Also, if we want drivers/platform/loongson and
> its own tree, what are the procedures to get them created?
I can't speak to Google's intentions or methods, but would expect the
procedure to be approximately:
1) Have a generally agreed upon good reason for the new directory.
2) Have someone willing & able to maintain it, preferrably with a good
track record of contributions to the kernel.
3) Ideally submit it for inclusion in linux-next.
4) Send a pull request to Linus.
5) Wait & see if he accepts or says no.
I think this particular case falls down at step 1.
Thanks,
Paul
[1] https://lore.kernel.org/linux-mips/20180124115833.GC5446@saruman/
> To address the particular quote you give from Dmitry Torokhov on the
> yeeloong_hotkey driver - just because the driver as-is includes a bunch
> of non-input related things doesn't mean that it should or has to. From
> a look at the 2009 submission it seems to mix a bunch of policy into the
> kernel which really ought to be elsewhere. Generally the input driver
> reports that a key was pressed & something in userland decides what to
> do with it, whereas this driver seems to attempt to bypass that & prod
> at unrelated hardware all by itself.
Sure, the hotkey driver has some problems in its current shape. I think
the existing code makes some hotkeys on the keyboard behave like a hardware
switch to order to implement rfkill hardblock, and also controls the video
output switch. I think I need to investigate it further.
> Personally I see no reason for that hwmon driver to live under
> drivers/platform/mips rather than drivers/hwmon. All that does is bypass
> the attention of the drivers/hwmon/ maintainers who would be best placed
> to offer input & ensure the driver is actually any good.
Yes. I think if our conclusion is "drivers/platform/mips is not a good idea",
it should be moved to the appropriate category in the future.
> I think that question should prompt another - if we have maintainers for
> various driver subsystems, why not place the drivers under their care in
> the already established directories?
>
> Thanks,
> Paul
Thanks for your reply, I'm fully agree with your comments about platform
drivers.
I find reorganization of the current Yeeloong platform driver is relatively
easy, since it only involves one machine, I'm already working on it.
If future developers find it's difficult to support new machines, we can simply
have more discussion, reorganize the existing hierarchy further, and make
incremental changes.
Cheers,
Tom Li
Beijing GNU/Linux User Group.
Hi Tom,
On Sat, Feb 09, 2019 at 06:11:33PM +0800, Tom Li wrote:
> > To address the particular quote you give from Dmitry Torokhov on the
> > yeeloong_hotkey driver - just because the driver as-is includes a bunch
> > of non-input related things doesn't mean that it should or has to. From
> > a look at the 2009 submission it seems to mix a bunch of policy into the
> > kernel which really ought to be elsewhere. Generally the input driver
> > reports that a key was pressed & something in userland decides what to
> > do with it, whereas this driver seems to attempt to bypass that & prod
> > at unrelated hardware all by itself.
>
> Sure, the hotkey driver has some problems in its current shape. I think
> the existing code makes some hotkeys on the keyboard behave like a hardware
> switch to order to implement rfkill hardblock, and also controls the video
> output switch. I think I need to investigate it further.
>%
> I find reorganization of the current Yeeloong platform driver is relatively
> easy, since it only involves one machine, I'm already working on it.
>
> If future developers find it's difficult to support new machines, we can simply
> have more discussion, reorganize the existing hierarchy further, and make
> incremental changes.
Thanks for working on it, I look forward to seeing your patches :)
Paul
On Feb 8, 2019, Tom Li <[email protected]> wrote:
> found Alexandre Oliva has stopped maintaining his tree
?!?
I still merge and tag every one of Torvalds' and Greg KH's releases into
the loongson-community tree, resolving trivial conflicts and trying to
verify that it at least builds and passes a smoke test on actual
hardware (after Linux-libre deblobbing, but still).
Occasionally I notice problems and I try to investigate through
bisecting, but, not being a real Linux developer, I seldom get much
further than that.
Two issues that bother me a bit are frequent failure to reboot/poweroff,
that's been around since around 4.10, and, more recently, a very slow
system overall, that's been present since 4.20. I haven't checked
whether they're caused by changes in the loongson-community tree that I
still carry, or if they're already present in upstream releases.
I've already tried to bisect the former, but since the issue is
intermittent, that proved to be too tricky and time-consuming for me.
I haven't tried to bisect the latter yet.
I've considered leaving the loongson-community tree behind and using
upstream releases, but every time I was about to do that, something else
came up that led me to keep at it. I think the last such occurrence was
the removal of the video driver (later reintroduced as a new driver).
So I've kept at it ;-)
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Mon, Feb 11, 2019 at 10:13:00AM -0200, Alexandre Oliva wrote:
> On Feb 8, 2019, Tom Li <[email protected]> wrote:
>
> > found Alexandre Oliva has stopped maintaining his tree
>
> ?!?
>
> I still merge and tag every one of Torvalds' and Greg KH's releases into
> the loongson-community tree, resolving trivial conflicts and trying to
> verify that it at least builds and passes a smoke test on actual
> hardware (after Linux-libre deblobbing, but still).
What is the current link to your repository? The original one from Lemote
is now a 404, and the GitHub one has been "archived".
> Two issues that bother me a bit are frequent failure to reboot/poweroff,
> that's been around since around 4.10, and, more recently, a very slow
> system overall, that's been present since 4.20. I haven't checked
> whether they're caused by changes in the loongson-community tree that I
> still carry, or if they're already present in upstream releases.
>
> I've already tried to bisect the former, but since the issue is
> intermittent, that proved to be too tricky and time-consuming for me.
>
> I haven't tried to bisect the latter yet.
You cannot bisect it. Because I've tried it for 24 hours without sleep, but
failed. ;-)
We've just identified and confirmed the source of the shutdown problem a
few days ago on this mailing list. It's related to a PREEMPT race condition
in the cpufreq framework that triggers a timing-dependent, probabilistic
lockup in the i8259 PIC driver. And it was first noticed by Ville Syrj?l?
on a x86 system, my debugging rediscovered his findings.
You can read the full investigation at here:
https://lkml.org/lkml/2018/11/13/857
You can pick up the patch from:
https://lore.kernel.org/lkml/[email protected]/
A patch has been authored and submitted by Aaro Koskinen, but currently it
seems stuck in the mailing list because the maintainer worries about regression
and asks for testing on more MIPS systems, although we believe it's a trivial
patch.
In addition, I've discovered and fixed another bug preventing the machine from
shutting down, this patch has already merged into the Linus tree.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a96669d77897ff3613157bf43f875739205d66d
> and, more recently, a very slow system overall, that's been present since
> 4.20.
The mainline framebuffer driver doesn't have any hardware drawing, printing
even a single line on the console is required a full screen redraw via memory-
mapped I/O over the slow PCI interface. I have rebased the 2D acceleration
code against the latest upstream driver and sent the patch to the fbdev mailing
list on February 2nd.
But the maintainers seem to be working on other tasks and the patches (including
mine and others') on the list didn't receive any response.
It's avaliable from
https://patchwork.kernel.org/project/linux-fbdev/list/?series=75029
> I've considered leaving the loongson-community tree behind and using
> upstream releases, but every time I was about to do that, something else
> came up that led me to keep at it. I think the last such occurrence was
> the removal of the video driver (later reintroduced as a new driver).
> So I've kept at it ;-)
I'll continue working on upstreaming these out-of-tree drivers as my personal
project. I hope you'll be able to use a fully-functional machine with the mainline
kernel soon, my current target is Linux 5.3.
Cheers,
Tom Li
On Feb 11, 2019, Tom Li <[email protected]> wrote:
> What is the current link to your repository?
git://dev.lemote.com/linux-loongson-community.git
I use a different git URL to push stuff, so I wouldn't have noticed if
the above was gone, but I've just tried to git clone form the above and
it seems to remain functional.
> We've just identified and confirmed the source of the shutdown problem a
> few days ago on this mailing list. It's related to a PREEMPT race condition
> in the cpufreq framework that triggers a timing-dependent, probabilistic
> lockup in the i8259 PIC driver. And it was first noticed by Ville Syrjälä
> on a x86 system, my debugging rediscovered his findings.
Oooh, thanks for the links.
>> and, more recently, a very slow system overall, that's been present since
>> 4.20.
> The mainline framebuffer driver doesn't have any hardware drawing, printing
> even a single line on the console is required a full screen redraw via memory-
That doesn't seem to explain even a quiet boot up taking several times
longer than 4.19, and package installation over an ethernet connection
(thus not using the console) also taking several times longer.
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Hi,
On Mon, Feb 11, 2019 at 08:38:00PM -0200, Alexandre Oliva wrote:
> On Feb 11, 2019, Tom Li <[email protected]> wrote:
> >> and, more recently, a very slow system overall, that's been present since
> >> 4.20.
>
> > The mainline framebuffer driver doesn't have any hardware drawing, printing
> > even a single line on the console is required a full screen redraw via memory-
>
> That doesn't seem to explain even a quiet boot up taking several times
> longer than 4.19, and package installation over an ethernet connection
> (thus not using the console) also taking several times longer.
Maybe it's a slow disk?
ATA (libata) CS5536 driver is having issues with spurious IRQs and often
disables IRQs completely during the boot. You should see a warning
in dmesg. This was the reason for slowness on my FuLoong mini-PC. A
workaround is to switch to old IDE driver.
Discussion: https://lore.kernel.org/linux-mips/[email protected]/
A.
> >
> > That doesn't seem to explain even a quiet boot up taking several times
> > longer than 4.19, and package installation over an ethernet connection
> > (thus not using the console) also taking several times longer.
>
> Maybe it's a slow disk?
>
> ATA (libata) CS5536 driver is having issues with spurious IRQs and often
> disables IRQs completely during the boot. You should see a warning
> in dmesg. This was the reason for slowness on my FuLoong mini-PC. A
> workaround is to switch to old IDE driver.
>
> Discussion: https://lore.kernel.org/linux-mips/[email protected]/
>
> A.
I find that this problem is odd. I've been using libata personally since 2014 and
never had any IRQ problem at all. How can the problem even occur at all? Perhaps
related to a rarely used kernel option that trigger it, or a bug in the PMON boot
firmware? I don't know...
But I think some versions of the PMON boot firmware have odd bugs. I'm currently
using the PMON firmware built by an OpenBSD developer, but I cannot find the link
anymore.
Tom Li
On Feb 11, 2019, Aaro Koskinen <[email protected]> wrote:
> ATA (libata) CS5536 driver is having issues with spurious IRQs and often
> disables IRQs completely during the boot. You should see a warning
> in dmesg.
Yup, thanks, it shows up first thing during boot. I hadn't seen that
one in a while. Thanks.
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Feb 11, 2019, Tom Li <[email protected]> wrote:
> We've just identified and confirmed the source of the shutdown problem a
> few days ago on this mailing list.
> You can pick up the patch from:
> https://lore.kernel.org/lkml/[email protected]/
> A patch has been authored and submitted by Aaro Koskinen, but currently it
> seems stuck in the mailing list because the maintainer worries about regression
> and asks for testing on more MIPS systems, although we believe it's a trivial
> patch.
Thanks. I've just brought it into my local copies of loongson-community
branches for 4.4, 4.9, 4.14, 4.19, and 4.20, and master, after verifying
that it enables at least 4.19, 4.20 and master to reboot without a power
cycle, and I'll include it in upcoming freeloong builds, except for 3.16
and 3.18, where it doesn't apply.
> In addition, I've discovered and fixed another bug preventing the machine from
> shutting down, this patch has already merged into the Linus tree.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a96669d77897ff3613157bf43f875739205d66d
Hmm, I don't think I've ever hit this one. Do you happen to know how
far back it might be needed?
> I'll continue working on upstreaming these out-of-tree drivers as my personal
> project. I hope you'll be able to use a fully-functional machine with the mainline
> kernel soon, my current target is Linux 5.3.
Thanks!
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Feb 11, 2019, Aaro Koskinen <[email protected]> wrote:
> ATA (libata) CS5536 driver is having issues with spurious IRQs and often
> disables IRQs completely during the boot. You should see a warning
> in dmesg. This was the reason for slowness on my FuLoong mini-PC. A
> workaround is to switch to old IDE driver.
Thanks. I see a NIEN quirk in ide-iops.c that's enabled for the hard
drive model I've got on my yeeloong, but that's not even compiled in my
freeloong builds. I don't see any changes in libata between 4.19 and
4.20 that could explain the regression either.
I'm afraid there's no observable change in behavior after installing the
proposed patch at
https://lore.kernel.org/linux-mips/[email protected]/
The kernel still disables irq14 early on, and then runs slow.
Tom, why do you say bisecting this is impossible? I realize you wrote
you did so for 24 hours non-stop, but... I'm curious as to what
obstacles you ran into. It's such a reproducible problem for me that I
can't see how bisecting it might be difficult.
Or were by any chance you talking about the reboot/shutdown problem
then?
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Sun, Feb 17, 2019 at 01:59:26AM -0300, Alexandre Oliva wrote:
> On Feb 11, 2019, Aaro Koskinen <[email protected]> wrote:
>
> > ATA (libata) CS5536 driver is having issues with spurious IRQs and often
> > disables IRQs completely during the boot. You should see a warning
> > in dmesg. This was the reason for slowness on my FuLoong mini-PC. A
> > workaround is to switch to old IDE driver.
>
> Thanks. I see a NIEN quirk in ide-iops.c that's enabled for the hard
> drive model I've got on my yeeloong, but that's not even compiled in my
> freeloong builds. I don't see any changes in libata between 4.19 and
> 4.20 that could explain the regression either.
I tested few older kernels, and it seems that the spurious IRQ issue has
been always there after switching to libata (commit 7ff7a5b1bfff). It has
been unnoticed as the 100000 irq limit wasn't reached during boot. But
since libata probe is asynchronous some other kernel thread may delay it,
and apparently some change in recent kernels adds enough delay to make the
IRQ count to hit the limit.
> I'm afraid there's no observable change in behavior after installing the
> proposed patch at
> https://lore.kernel.org/linux-mips/[email protected]/
>
> The kernel still disables irq14 early on, and then runs slow.
This hack works only for CONFIG_PATA_CS5536. You are probably using PATA_AMD.
A.
On Feb 17, 2019, Aaro Koskinen <[email protected]> wrote:
> I tested few older kernels, and it seems that the spurious IRQ issue has
> been always there after switching to libata (commit 7ff7a5b1bfff). It has
> been unnoticed as the 100000 irq limit wasn't reached during boot.
I see, thanks. That would probably make it hard to bisect indeed.
>> The kernel still disables irq14 early on, and then runs slow.
> This hack works only for CONFIG_PATA_CS5536. You are probably using PATA_AMD.
That's a reasonable guess, but I don't think so. I do have PATA_AMD
enabled as a module indeed, but it's not even loaded, much as I can
tell, whereas PATA_CS5536 is built into the kernel image, and dmesg
says:
[ 4.460000] scsi host0: pata_cs5536
[ 4.464000] scsi host1: pata_cs5536
[ 4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14
[ 4.464000] ata2: DUMMY
[ 4.464000] pcnet32: [...]
[ 4.644000] random: [...]
[ 5.908000] irq 14: nobody cared (try booting with the "irqpoll" option)
Just to be sure, I added some printks to cs5536_noop_freeze, and here's
what I got in dmesg instead:
[ 4.452000] pata_cs5536: freeze: checking status...
[ 4.452000] pata_cs5536: freeze: checked, clearing...
[ 4.452000] pata_cs5536: freeze: cleared
[ 4.460000] scsi host0: pata_cs5536
[ 4.464000] scsi host1: pata_cs5536
[ 4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14
[ 4.464000] ata2: DUMMY
[ 4.464000] pcnet32: [...]
[ 4.468000] pata_cs5536: freeze: checking status...
[ 4.468000] pata_cs5536: freeze: checked, clearing...
[ 4.468000] pata_cs5536: freeze: cleared
[ 4.652000] random: [...]
[ 5.920000] irq 14: nobody cared (try booting with the "irqpoll" option)
now, maybe I just don't understand what effects the patch was supposed
to have.
The system still feels very slow, but I haven't timed anything; could it
be that it had the effect of keeping the IRQ functional after all, but
5.0-rc6 is slower than earlier kernels for other reasons? Like,
/proc/irq/14/pata_cs5536/ is there, but I haven't checked whether it was
there before the patch.
Do you suggest any way to tell whether it had the intended effect?
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Sun, 17 Feb 2019, Alexandre Oliva wrote:
> That's a reasonable guess, but I don't think so. I do have PATA_AMD
> enabled as a module indeed, but it's not even loaded, much as I can
> tell, whereas PATA_CS5536 is built into the kernel image, and dmesg
> says:
>
> [ 4.460000] scsi host0: pata_cs5536
> [ 4.464000] scsi host1: pata_cs5536
> [ 4.464000] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x4c60 irq 14
> [ 4.464000] ata2: DUMMY
> [ 4.464000] pcnet32: [...]
> [ 4.644000] random: [...]
> [ 5.908000] irq 14: nobody cared (try booting with the "irqpoll" option)
Is there an MMIO completion barrier missing there somewhere by any chance
causing an IRQ that has been handled already to be redelivered because an
MMIO write meant to clear the IRQ at its origin at handler's completion
has not reached its destination before interrupts have been reenabled in
the issuing CPU? Just a thought.
Maciej
On Feb 17, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
> Is there an MMIO completion barrier missing there somewhere by any chance
> causing an IRQ that has been handled already to be redelivered because an
> MMIO write meant to clear the IRQ at its origin at handler's completion
> has not reached its destination before interrupts have been reenabled in
> the issuing CPU? Just a thought.
I've finally got a chance to bisect the IRQ14 (nobody cared) regression
on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO
accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f).
I've only just started trying to figure out what exactly in the change
leads to problems. So far, I've determined that changing both uses of
__BUILD_IOPORT_SINGLE so that barrier is passed as 0 rather than 1
removes the undesirable effects, both on top of that patch, and on top
of v5.0:
#define __BUILD_IOPORT_PFX(bus, bwlq, type) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p)
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0,) \
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0, _p)
Since the barriers didn't seem to be a problem for __BUILD_MEMORY_PFX, I
figured I'd try to enable barriers in the __mem_ variants, but leave
them alone for io, and that worked (without hitting the IRQ14 issue) at
least on the yeeloong:
diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 845fbbc7a2e34..0a3a327d4e764 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -467,13 +467,13 @@ BUILDIO_MEM(w, u16)
BUILDIO_MEM(l, u32)
BUILDIO_MEM(q, u64)
-#define __BUILD_IOPORT_PFX(bus, bwlq, type) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p)
+#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p)
#define BUILDIO_IOPORT(bwlq, type) \
- __BUILD_IOPORT_PFX(, bwlq, type) \
- __BUILD_IOPORT_PFX(__mem_, bwlq, type)
+ __BUILD_IOPORT_PFX(, bwlq, type, 0) \
+ __BUILD_IOPORT_PFX(__mem_, bwlq, type, 1)
BUILDIO_IOPORT(b, u8)
BUILDIO_IOPORT(w, u16)
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
Hi Alexandre,
I'm away on holiday and also connectivity is so-so here, so just a quick
reply.
> > Is there an MMIO completion barrier missing there somewhere by any chance
> > causing an IRQ that has been handled already to be redelivered because an
> > MMIO write meant to clear the IRQ at its origin at handler's completion
> > has not reached its destination before interrupts have been reenabled in
> > the issuing CPU? Just a thought.
>
> I've finally got a chance to bisect the IRQ14 (nobody cared) regression
> on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO
> accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f).
Thanks for looking into it.
> I've only just started trying to figure out what exactly in the change
> leads to problems. So far, I've determined that changing both uses of
> __BUILD_IOPORT_SINGLE so that barrier is passed as 0 rather than 1
> removes the undesirable effects, both on top of that patch, and on top
> of v5.0:
>
> #define __BUILD_IOPORT_PFX(bus, bwlq, type) \
> - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
> - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p)
> + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0,) \
> + __BUILD_IOPORT_SINGLE(bus, bwlq, type, 0, 0, _p)
So this seems backwards to me, port I/O is supposed to be strongly
ordered, so if removing the ordering guarantee "fixes" your problem, then
there must be a second bottom here. Offhand either there is a race
condition somewhere which the lack of ordering here covers somehow, or
there is a silicon erratum of some sort somewhere that the SYNC
instruction triggers.
A further investigation is required I'm afraid. Does your platform use
`war_io_reorder_wmb'?
Maciej
Hi,
On Thu, Mar 07, 2019 at 03:41:01AM -0300, Alexandre Oliva wrote:
> On Feb 17, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
>
> > Is there an MMIO completion barrier missing there somewhere by any chance
> > causing an IRQ that has been handled already to be redelivered because an
> > MMIO write meant to clear the IRQ at its origin at handler's completion
> > has not reached its destination before interrupts have been reenabled in
> > the issuing CPU? Just a thought.
>
> I've finally got a chance to bisect the IRQ14 (nobody cared) regression
> on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO
> accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f).
This is interesting, thanks for the research, but I'm afraid it doesn't seem
to be the root cause.
While your patch seems to help also on Mini-PC (only briefly tested with
a dozen of reboots), there's still excessive amount of interrupts during
the boot - I'm getting something like 50000-70000, while with the old
IDE driver it's around 2500.
~ # cat /proc/irq/14/spurious
count 57805
unhandled 57799
last_unhandled 4294673092 ms
~ # cat /proc/interrupts
CPU0
2: 0 XT-PIC 2 cascade
3: 174 XT-PIC 3 ttyS0
5: 14653 XT-PIC 5 timer
11: 0 XT-PIC 11 ehci_hcd:usb1, ohci_hcd:usb2
14: 57805 XT-PIC 14 pata_amd
15: 0 XT-PIC 15 pata_amd
18: 0 MIPS 2 cascade
22: 0 MIPS 6 cascade
ERR: 0
Could you check your /proc/interrupts counters after the boot with
your change?
A.
On Mar 7, 2019, Aaro Koskinen <[email protected]> wrote:
> Hi,
> On Thu, Mar 07, 2019 at 03:41:01AM -0300, Alexandre Oliva wrote:
>> On Feb 17, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
>>
>> > Is there an MMIO completion barrier missing there somewhere by any chance
>> > causing an IRQ that has been handled already to be redelivered because an
>> > MMIO write meant to clear the IRQ at its origin at handler's completion
>> > has not reached its destination before interrupts have been reenabled in
>> > the issuing CPU? Just a thought.
>>
>> I've finally got a chance to bisect the IRQ14 (nobody cared) regression
>> on my yeeloong. It took me to MIPS: Enforce strong ordering for MMIO
>> accessors (commit 3d474dacae72ac0f28228b328cfa953b05484b7f).
> This is interesting, thanks for the research, but I'm afraid it doesn't seem
> to be the root cause.
Oh, I didn't mean to offer anything definitive, just to report the
findings of my investigation so far, since I had no clue when I'd find
another significant time slot to get back onto it.
> Could you check your /proc/interrupts counters after the boot with
> your change?
16k to 18k interrupts after booting up into multi-user mode, including
some idle time (and possibly anacron jobs) in the case that got more
interrupts. That's not unlike what I get with 4.19.26-gnu.
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Mar 7, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
> So this seems backwards to me, port I/O is supposed to be strongly
> ordered, so if removing the ordering guarantee "fixes" your problem, then
> there must be a second bottom here.
Well, it partially restores the earlier state, so the actual bug goes
back to latent. We just have to find out what it is. My plan, in the
next time slot I can devote to this investigation, is to try to bisect
source files that use these interfaces to identify a minimal set that
can switch to the new barriers without breaking anything, then report
back after failing to make sense of it myself ;-) So, yeah, several
more bottoms to dig through.
> Does your platform use `war_io_reorder_wmb'?
Err... I'm not sure I understand your question.
It uses it in __BUILD_IOPORT_SINGLE within the expanded out function,
given !barrier, but you already knew that.
Did you mean to ask what war_io_reorder_wmb expand to, or whether there
are other uses of war_io_reorder_wmb, or what?
--
Alexandre Oliva, freedom fighter https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás-GNUChe
On Thu, 7 Mar 2019, Alexandre Oliva wrote:
> > Does your platform use `war_io_reorder_wmb'?
>
> Err... I'm not sure I understand your question.
>
> It uses it in __BUILD_IOPORT_SINGLE within the expanded out function,
> given !barrier, but you already knew that.
>
> Did you mean to ask what war_io_reorder_wmb expand to, or whether there
> are other uses of war_io_reorder_wmb, or what?
Umm, my question was ill-formed, sorry. There's a convoluted history
recorded for that macro in the repo and it could be that it can be removed
now that we have proper barriers in place, except possibly to avoid the
Octeon Errata Core-301 (whatever that is).
Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system?
Maciej
On Mar 8, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
> Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system?
No, it expands to `barrier' on the yeeloong:
CONFIG_CPU_LOONGSON2F=y
CONFIG_CPU_LOONGSON2F_WORKAROUNDS=y
CONFIG_CPU_LOONGSON2=y
CONFIG_SYS_HAS_CPU_LOONGSON2F=y
I've finally managed to do the bisection on object files I mentioned I'd
do to try to pinpoint where __BUILT_IOPORT_PFX with barrier rather than
!barrier regressed.
I found that forcing barrier off for drivers/irqchip/irq-i8259 was
enough to avoid the problem.
(I further narrowed it down to byte I/O, which is no surprise
considering irq-i8259 doesn't seem to use any non-byte I/O.)
Then I narrowed it down to output only.
A Loongson2F kernel built with the patch below works at normal speed.
I've also keyed the -1 barrier selector to compiling the irq-i8259
driver only, and that got me a functional kernel, but I'm not confident
that the same issues that affect the interrupt controller, preventing it
from initializing correctly, is not also affecting other drivers, just
in less visible ways, so the patch conservatively reverts to the older
barriers for all I/O (i.e., non-mem) out primitives.
I've tested this on a yeeloong on top of v5.1.5.
I'm tempted to start using this patch in my Freeloong builds of GNU
Linux-libre for gnewsense/yeeloong of 5.0 and 5.1 stable releases, to
try to make them usable. Can anyone suggest any reason why it might be
risky to do so, moving on as much as I could to the new barriers,
sticking to the 4.19-one only for non-mem out? As in, could mixing the
barriers be riskier than reverting to the 4.19 barriers everywhere?
Thanks in advance for any insights and recommendations,
diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 845fbbc7a2e3..04be4758d4ff 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -416,7 +416,7 @@ static inline void pfx##out##bwlq##p(type val, unsigned long port) \
volatile type *__addr; \
type __val; \
\
- if (barrier) \
+ if (barrier > 0) \
iobarrier_rw(); \
else \
war_io_reorder_wmb(); \
@@ -467,13 +467,22 @@ BUILDIO_MEM(w, u16)
BUILDIO_MEM(l, u32)
BUILDIO_MEM(q, u64)
-#define __BUILD_IOPORT_PFX(bus, bwlq, type) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
- __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p)
+#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \
+ __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p)
+
+/* Choose the kind of barrier used for out in __BUILD_IOPORT_SINGLE in
+ non-__mem_ variants. On Loongson2F, irq-i8259 fails to initialize
+ when this is defined to 1. */
+#if defined(CONFIG_CPU_LOONGSON2)
+#define USE_IO_BARRIER_FOR_NON_MEM_OUT -1
+#else
+#define USE_IO_BARRIER_FOR_NON_MEM_OUT 1
+#endif
#define BUILDIO_IOPORT(bwlq, type) \
- __BUILD_IOPORT_PFX(, bwlq, type) \
- __BUILD_IOPORT_PFX(__mem_, bwlq, type)
+ __BUILD_IOPORT_PFX(, bwlq, type, USE_IO_BARRIER_FOR_NON_MEM_OUT) \
+ __BUILD_IOPORT_PFX(__mem_, bwlq, type, 2)
BUILDIO_IOPORT(b, u8)
BUILDIO_IOPORT(w, u16)
--
Alexandre Oliva, freedom fighter he/him https://FSFLA.org/blogs/lxo
Be the change, be Free! FSF Latin America board member
GNU Toolchain Engineer Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara
Hi,
On Sun, May 26, 2019 at 06:19:00AM -0300, Alexandre Oliva wrote:
> On Mar 8, 2019, "Maciej W. Rozycki" <[email protected]> wrote:
>
> > Anyway I meant: does `war_io_reorder_wmb' expand to `wmb' on your system?
>
> No, it expands to `barrier' on the yeeloong:
>
> CONFIG_CPU_LOONGSON2F=y
> CONFIG_CPU_LOONGSON2F_WORKAROUNDS=y
> CONFIG_CPU_LOONGSON2=y
> CONFIG_SYS_HAS_CPU_LOONGSON2F=y
>
>
> I've finally managed to do the bisection on object files I mentioned I'd
> do to try to pinpoint where __BUILT_IOPORT_PFX with barrier rather than
> !barrier regressed.
>
> I found that forcing barrier off for drivers/irqchip/irq-i8259 was
> enough to avoid the problem.
>
> (I further narrowed it down to byte I/O, which is no surprise
> considering irq-i8259 doesn't seem to use any non-byte I/O.)
>
> Then I narrowed it down to output only.
>
> A Loongson2F kernel built with the patch below works at normal speed.
> I've also keyed the -1 barrier selector to compiling the irq-i8259
> driver only, and that got me a functional kernel, but I'm not confident
> that the same issues that affect the interrupt controller, preventing it
> from initializing correctly, is not also affecting other drivers, just
> in less visible ways, so the patch conservatively reverts to the older
> barriers for all I/O (i.e., non-mem) out primitives.
>
> I've tested this on a yeeloong on top of v5.1.5.
>
> I'm tempted to start using this patch in my Freeloong builds of GNU
> Linux-libre for gnewsense/yeeloong of 5.0 and 5.1 stable releases, to
> try to make them usable. Can anyone suggest any reason why it might be
> risky to do so, moving on as much as I could to the new barriers,
> sticking to the 4.19-one only for non-mem out? As in, could mixing the
> barriers be riskier than reverting to the 4.19 barriers everywhere?
>
> Thanks in advance for any insights and recommendations,
Thanks for this work! I'm not yet sure if this completely solves the
issue, but it's surely an improvement.
Testing multiple reboots on Loongson Mini-PC with the libata driver,
/proc/irq/14/spurious still shows tens of thousands spurious interrupts,
e.g.
count 79453
unhandled 76998
last_unhandled 4294673096 ms
where as it with the legacy IDE driver it stays reliably at 0 and the total
count is well under 10000.
However, with your patch the "nobody cared" is never reached so all is
good. I tried 10 boots with the patch and all were successful. Without
the patch 8 out of 10 failed with the "nobody cared" warning.
A.
> diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
> index 845fbbc7a2e3..04be4758d4ff 100644
> --- a/arch/mips/include/asm/io.h
> +++ b/arch/mips/include/asm/io.h
> @@ -416,7 +416,7 @@ static inline void pfx##out##bwlq##p(type val, unsigned long port) \
> volatile type *__addr; \
> type __val; \
> \
> - if (barrier) \
> + if (barrier > 0) \
> iobarrier_rw(); \
> else \
> war_io_reorder_wmb(); \
> @@ -467,13 +467,22 @@ BUILDIO_MEM(w, u16)
> BUILDIO_MEM(l, u32)
> BUILDIO_MEM(q, u64)
>
> -#define __BUILD_IOPORT_PFX(bus, bwlq, type) \
> - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,) \
> - __BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0, _p)
> +#define __BUILD_IOPORT_PFX(bus, bwlq, type, barrier) \
> + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0,) \
> + __BUILD_IOPORT_SINGLE(bus, bwlq, type, barrier, 0, _p)
> +
> +/* Choose the kind of barrier used for out in __BUILD_IOPORT_SINGLE in
> + non-__mem_ variants. On Loongson2F, irq-i8259 fails to initialize
> + when this is defined to 1. */
> +#if defined(CONFIG_CPU_LOONGSON2)
> +#define USE_IO_BARRIER_FOR_NON_MEM_OUT -1
> +#else
> +#define USE_IO_BARRIER_FOR_NON_MEM_OUT 1
> +#endif
>
> #define BUILDIO_IOPORT(bwlq, type) \
> - __BUILD_IOPORT_PFX(, bwlq, type) \
> - __BUILD_IOPORT_PFX(__mem_, bwlq, type)
> + __BUILD_IOPORT_PFX(, bwlq, type, USE_IO_BARRIER_FOR_NON_MEM_OUT) \
> + __BUILD_IOPORT_PFX(__mem_, bwlq, type, 2)
>
> BUILDIO_IOPORT(b, u8)
> BUILDIO_IOPORT(w, u16)
>
>
> --
> Alexandre Oliva, freedom fighter he/him https://FSFLA.org/blogs/lxo
> Be the change, be Free! FSF Latin America board member
> GNU Toolchain Engineer Free Software Evangelist
> Hay que enGNUrecerse, pero sin perder la terGNUra jam?s - Che GNUevara
On Tue, 11 Jun 2019, Aaro Koskinen wrote:
> However, with your patch the "nobody cared" is never reached so all is
> good. I tried 10 boots with the patch and all were successful. Without
> the patch 8 out of 10 failed with the "nobody cared" warning.
I wouldn't call it "good", just less obvious or painful. This is still
causing wasted CPU cycles that are used for taking the phantom interrupts.
There is clearly a completion barrier missing somewhere that causes the
interrupt request to linger beyond the point interrupts are reenabled at
the CPU.
One way to attempt to narrow it down might be taking a backtrace from
where IRQ 14 is found to be spurious. This would indicate the offending
interrupt unmask action. E.g. I see no explicit completion barrier
between the final `outb' in `mask_and_ack_8259A' and the following call to
`raw_spin_unlock_irqrestore', which are obviously otherwise unordered WRT
each other (because `outb' is I/O or MMIO and `raw_spin_unlock_irqrestore'
is contained within the CPU on UP). I can see provisions however for
issuing an architecture-specific barrier in `do_raw_spin_unlock', which is
the workhorse for `raw_spin_unlock_irqrestore', so maybe this is the place
to look into?
Also how's IRQ 14 registered as indicated by /proc/interrupts?
Maciej
Hi,
On Wed, Jun 12, 2019 at 06:55:28AM +0100, Maciej W. Rozycki wrote:
> On Tue, 11 Jun 2019, Aaro Koskinen wrote:
>
> > However, with your patch the "nobody cared" is never reached so all is
> > good. I tried 10 boots with the patch and all were successful. Without
> > the patch 8 out of 10 failed with the "nobody cared" warning.
>
> I wouldn't call it "good", just less obvious or painful. This is still
> causing wasted CPU cycles that are used for taking the phantom interrupts.
>
> There is clearly a completion barrier missing somewhere that causes the
> interrupt request to linger beyond the point interrupts are reenabled at
> the CPU.
>
> One way to attempt to narrow it down might be taking a backtrace from
> where IRQ 14 is found to be spurious. This would indicate the offending
> interrupt unmask action. E.g. I see no explicit completion barrier
The first spurious IRQ is right after the driver registers:
[ 4.732000] [<ffffffff8020efac>] show_stack+0x90/0x140
[ 4.732000] [<ffffffff8052850c>] ata_bmdma_interrupt+0x2b4/0x39c
[ 4.732000] [<ffffffff80260368>] __handle_irq_event_percpu+0xb0/0x178
[ 4.732000] [<ffffffff80260464>] handle_irq_event_percpu+0x34/0x9c
[ 4.732000] [<ffffffff80260508>] handle_irq_event+0x3c/0x74
[ 4.732000] [<ffffffff80264d28>] handle_level_irq+0x118/0x154
[ 4.732000] [<ffffffff8025f978>] generic_handle_irq+0x34/0x50
[ 4.732000] [<ffffffff806b9600>] do_IRQ+0x18/0x24
[ 4.732000] [<ffffffff80208ce4>] handle_int+0x17c/0x188
[ 4.732000] [<ffffffff806b30c8>] arch_local_irq_restore+0x18/0x30
[ 4.732000] [<ffffffff802621f0>] __setup_irq+0x660/0x7a0
[ 4.732000] [<ffffffff80262798>] request_threaded_irq+0x114/0x19c
[ 4.732000] [<ffffffff80265d7c>] devm_request_threaded_irq+0xa0/0x10c
[ 4.732000] [<ffffffff80527f00>] ata_pci_sff_activate_host+0x1c0/0x274
[ 4.732000] [<ffffffff80528a30>] ata_pci_init_one+0x170/0x1c4
[ 4.732000] [<ffffffff8052a288>] cs5536_init_one+0x94/0xb8
and the following ones do not seem to provide much info as I can only
see the IRQ stack:
[ 4.736000] [<ffffffff8020efac>] show_stack+0x90/0x140
[ 4.736000] [<ffffffff8052850c>] ata_bmdma_interrupt+0x2b4/0x39c
[ 4.736000] [<ffffffff80260368>] __handle_irq_event_percpu+0xb0/0x178
[ 4.736000] [<ffffffff80260464>] handle_irq_event_percpu+0x34/0x9c
[ 4.736000] [<ffffffff80260508>] handle_irq_event+0x3c/0x74
[ 4.736000] [<ffffffff80264d28>] handle_level_irq+0x118/0x154
[ 4.736000] [<ffffffff8025f978>] generic_handle_irq+0x34/0x50
[ 4.736000] [<ffffffff806b9600>] do_IRQ+0x18/0x24
[ 4.736000] [<ffffffff80208ce4>] handle_int+0x17c/0x188
[ 4.736000] [<ffffffff8022f330>] irq_exit+0x68/0xcc
> between the final `outb' in `mask_and_ack_8259A' and the following call to
> `raw_spin_unlock_irqrestore', which are obviously otherwise unordered WRT
> each other (because `outb' is I/O or MMIO and `raw_spin_unlock_irqrestore'
> is contained within the CPU on UP). I can see provisions however for
> issuing an architecture-specific barrier in `do_raw_spin_unlock', which is
> the workhorse for `raw_spin_unlock_irqrestore', so maybe this is the place
> to look into?
>
> Also how's IRQ 14 registered as indicated by /proc/interrupts?
Not sure what you mean but here's the output:
$ cat /proc/interrupts
CPU0
2: 0 XT-PIC 2 cascade
3: 20 XT-PIC 3 ttyS0
5: 543358 XT-PIC 5 timer
11: 0 XT-PIC 11 ehci_hcd:usb1, ohci_hcd:usb2
14: 100000 XT-PIC 14 pata_cs5536
18: 0 MIPS 2 cascade
22: 0 MIPS 6 cascade
36: 3052 bonito_irq eth0
ERR: 0
A.
On Wed, Jun 12, 2019 at 12:25 PM Aaro Koskinen <[email protected]> wrote:
>
> Hi,
>
> On Wed, Jun 12, 2019 at 06:55:28AM +0100, Maciej W. Rozycki wrote:
> > On Tue, 11 Jun 2019, Aaro Koskinen wrote:
> >
> > > However, with your patch the "nobody cared" is never reached so all is
> > > good. I tried 10 boots with the patch and all were successful. Without
> > > the patch 8 out of 10 failed with the "nobody cared" warning.
> >
> > I wouldn't call it "good", just less obvious or painful. This is still
> > causing wasted CPU cycles that are used for taking the phantom interrupts.
> >
> > There is clearly a completion barrier missing somewhere that causes the
> > interrupt request to linger beyond the point interrupts are reenabled at
> > the CPU.
> >
> > One way to attempt to narrow it down might be taking a backtrace from
> > where IRQ 14 is found to be spurious. This would indicate the offending
> > interrupt unmask action. E.g. I see no explicit completion barrier
>
> The first spurious IRQ is right after the driver registers:
>
> [ 4.732000] [<ffffffff8020efac>] show_stack+0x90/0x140
> [ 4.732000] [<ffffffff8052850c>] ata_bmdma_interrupt+0x2b4/0x39c
> [ 4.732000] [<ffffffff80260368>] __handle_irq_event_percpu+0xb0/0x178
> [ 4.732000] [<ffffffff80260464>] handle_irq_event_percpu+0x34/0x9c
> [ 4.732000] [<ffffffff80260508>] handle_irq_event+0x3c/0x74
> [ 4.732000] [<ffffffff80264d28>] handle_level_irq+0x118/0x154
> [ 4.732000] [<ffffffff8025f978>] generic_handle_irq+0x34/0x50
> [ 4.732000] [<ffffffff806b9600>] do_IRQ+0x18/0x24
> [ 4.732000] [<ffffffff80208ce4>] handle_int+0x17c/0x188
> [ 4.732000] [<ffffffff806b30c8>] arch_local_irq_restore+0x18/0x30
> [ 4.732000] [<ffffffff802621f0>] __setup_irq+0x660/0x7a0
> [ 4.732000] [<ffffffff80262798>] request_threaded_irq+0x114/0x19c
> [ 4.732000] [<ffffffff80265d7c>] devm_request_threaded_irq+0xa0/0x10c
> [ 4.732000] [<ffffffff80527f00>] ata_pci_sff_activate_host+0x1c0/0x274
> [ 4.732000] [<ffffffff80528a30>] ata_pci_init_one+0x170/0x1c4
> [ 4.732000] [<ffffffff8052a288>] cs5536_init_one+0x94/0xb8
>
> and the following ones do not seem to provide much info as I can only
> see the IRQ stack:
>
> [ 4.736000] [<ffffffff8020efac>] show_stack+0x90/0x140
> [ 4.736000] [<ffffffff8052850c>] ata_bmdma_interrupt+0x2b4/0x39c
> [ 4.736000] [<ffffffff80260368>] __handle_irq_event_percpu+0xb0/0x178
> [ 4.736000] [<ffffffff80260464>] handle_irq_event_percpu+0x34/0x9c
> [ 4.736000] [<ffffffff80260508>] handle_irq_event+0x3c/0x74
> [ 4.736000] [<ffffffff80264d28>] handle_level_irq+0x118/0x154
> [ 4.736000] [<ffffffff8025f978>] generic_handle_irq+0x34/0x50
> [ 4.736000] [<ffffffff806b9600>] do_IRQ+0x18/0x24
> [ 4.736000] [<ffffffff80208ce4>] handle_int+0x17c/0x188
> [ 4.736000] [<ffffffff8022f330>] irq_exit+0x68/0xcc
>
> > between the final `outb' in `mask_and_ack_8259A' and the following call to
> > `raw_spin_unlock_irqrestore', which are obviously otherwise unordered WRT
> > each other (because `outb' is I/O or MMIO and `raw_spin_unlock_irqrestore'
> > is contained within the CPU on UP). I can see provisions however for
> > issuing an architecture-specific barrier in `do_raw_spin_unlock', which is
> > the workhorse for `raw_spin_unlock_irqrestore', so maybe this is the place
> > to look into?
> >
> > Also how's IRQ 14 registered as indicated by /proc/interrupts?
>
> Not sure what you mean but here's the output:
>
> $ cat /proc/interrupts
> CPU0
> 2: 0 XT-PIC 2 cascade
> 3: 20 XT-PIC 3 ttyS0
> 5: 543358 XT-PIC 5 timer
> 11: 0 XT-PIC 11 ehci_hcd:usb1, ohci_hcd:usb2
> 14: 100000 XT-PIC 14 pata_cs5536
> 18: 0 MIPS 2 cascade
> 22: 0 MIPS 6 cascade
> 36: 3052 bonito_irq eth0
> ERR: 0
>
> A.
Has any more progress been made?
Thanks,
Matt
Hi,
On Wed, Jan 22, 2020 at 04:20:26PM -0800, Matt Turner wrote:
> Has any more progress been made?
I have given up using this hardware.
A.