2008-06-16 18:24:46

by Arjan van de Ven

[permalink] [raw]
Subject: Oops report for the week preceding June 16th, 2008

This week, a total of 3877 oopses and warnings have been reported,
compared to 3390 reports in the previous week.

Recently, Fedora put out an updated kernel that contained a wireless update;
unfortunately, this update was rather broken and caused various things to show up
in the top 20. A few days later, another update fixing the most obvious ones
got released with the result that only rank 2 and 9 are from the broken update,
rather than a lot more..

A new feature this week: for certain type of oopses (need to happen in vmlinux etc),
the website now shows a mixed view of code/disassembly of the oops.
For example, on http://www.kerneloops.org/searchweek.php?search=page_remove_rmap
you can hover your mouse over the "Decode" word and it'll show the disassembled view.
This information is also part of the detailed view of each oops that is suitable,
and is also in the git exported view (git clone git://http://www.kerneloops.org/ )



Per file statistics
570 external/madwifi/wrapper (P)
323 drivers/net/wireless/b43/main.c
284 external/madwifi/binary (P)
276 drivers/parport/procfs.c
230 fs/sysfs/dir.c
208 fs/jbd/journal.c
174 security/selinux/hooks.c
137 net/mac80211/util.c
81 kernel/time/tick-broadcast.c
58 fs/ext3/super.c
49 net/mac80211/main.c
48 external/nvidia/binary (P)
45 mm/rmap.c


Seen with untainted kernels
---------------------------
Rank 2: b43_generate_noise_sample (warning)
Reported 323 times (389 total reports)
[fixed] too strict WARN_ON in b43 driver
Fix available; not merged in mainline yet
This warning was last seen in version 2.6.26-rc4-git5, and first seen in 2.6.25.3.
More info: http://www.kerneloops.org/searchweek.php?search=b43_generate_noise_sample

Rank 4: parport_device_proc_register (warning)
Reported 276 times (1290 total reports)
Duplicate /proc registration in the parport driver
This warning was last seen in version 2.6.26-rc5-git7, and first seen in 2.6.24-rc5.
More info: http://www.kerneloops.org/searchweek.php?search=parport_device_proc_register

Rank 5: sysfs_add_one (warning)
Reported 217 times (1280 total reports)
Standard duplicate device name registration issue, still very much alive
This warning was last seen in version 2.6.26-rc3, and first seen in 2.6.24-rc6.
More info: http://www.kerneloops.org/searchweek.php?search=sysfs_add_one

Rank 6: journal_update_superblock (warning)
Reported 208 times (972 total reports)
Likely caused by the user removing a USB stick while mounted
This warning was last seen in version 2.6.26, and first seen in 2.6.24-rc6-git1.
More info: http://www.kerneloops.org/searchweek.php?search=journal_update_superblock

Rank 9: ieee80211_iterate_active_interfaces (warning)
Reported 137 times (240 total reports)
[fedora] Fedora merged a buggy rt25xx wireless driver patch
This warning was last seen in version 2.6.25.4, and first seen in 2.6.25.4.
More info: http://www.kerneloops.org/searchweek.php?search=ieee80211_iterate_active_interfaces

Rank 10: tick_broadcast_oneshot_control (softlockup)
Reported 81 times (425 total reports)
Hard to trace down issue; but it's too frequent to be a fluke
This softlockup was last seen in version 2.6.25.6, and first seen in 2.6.24-rc4.
More info: http://www.kerneloops.org/searchweek.php?search=tick_broadcast_oneshot_control

Rank 11: ext3_commit_super (warning)
Reported 53 times (237 total reports)
Likely caused by the user removing a USB stick while mounted
This warning was last seen in version 2.6.25.6, and first seen in 2.6.24.
More info: http://www.kerneloops.org/searchweek.php?search=ext3_commit_super

Rank 12: default_idle (oops)
Reported 44 times (118 total reports)
Similar to the tick_broadcast_oneshot_control issue; hard to trace down
This oops was last seen in version 2.6.26-rc5, and first seen in 2.6.21.3.
More info: http://www.kerneloops.org/searchweek.php?search=default_idle


Only seen with tainted kernels
------------------------------
Rank 1: ath_dynamic_sysctl_register (warning)
Reported 376 times (3007 total reports)
[external] Bug in the proprietary madwifi driver
warning only shows up in tainted kernels
This warning was last seen in version 2.6.25.6, and first seen in 2.6.24.
More info: http://www.kerneloops.org/searchweek.php?search=ath_dynamic_sysctl_register

Rank 3: init_ath_hal (warning)
Reported 284 times (1843 total reports)
[external] Bug in the proprietary madwifi driver
warning only shows up in tainted kernels
This warning was last seen in version 2.6.25.6, and first seen in 2.6.24.
More info: http://www.kerneloops.org/searchweek.php?search=init_ath_hal

Rank 7: ath_sysctl_register (warning)
Reported 194 times (810 total reports)
[external] Bug in the proprietary madwifi driver
warning only shows up in tainted kernels
This warning was last seen in version 2.6.25.6, and first seen in 2.6.24-rc4-git4.
More info: http://www.kerneloops.org/searchweek.php?search=ath_sysctl_register

Rank 8: task_has_capability (warning)
Reported 166 times (580 total reports)
[out of tree] Bug in the proprietary firegl driver
warning only shows up in tainted kernels
This warning was last seen in version 2.6.25.6, and first seen in 2.6.25.
More info: http://www.kerneloops.org/searchweek.php?search=task_has_capability

Rank 13: VNetBridgeDown (warning)
Reported 41 times (230 total reports)
[external] Bug in the proprietary VMWare drivers
warning only shows up in tainted kernels
This warning was last seen in version 2.6.25.6, and first seen in 2.6.24.
More info: http://www.kerneloops.org/searchweek.php?search=VNetBridgeDown


2008-06-17 09:21:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008


* Arjan van de Ven <[email protected]> wrote:

> This week, a total of 3877 oopses and warnings have been reported,
> compared to 3390 reports in the previous week.
>
> Recently, Fedora put out an updated kernel that contained a wireless
> update; unfortunately, this update was rather broken and caused
> various things to show up in the top 20. A few days later, another
> update fixing the most obvious ones got released with the result that
> only rank 2 and 9 are from the broken update, rather than a lot more..

sidenote: i suspect Fedora has done this to enable more hardware, and/or
to fix mainline wireless bugs?

I wish we would do such new driver merging in mainline instead, so that
we had a single point of testing and single point of effort.

Same for Nouveau: Fedora carries it and i dont understand why such a
major piece of work is not done in mainline and not _helped by_
mainline. It's not like there would be any big risk from having such a
new, experimental 3D driver around - instead of people running nvidia.ko
that causes trouble in all sorts of other subsystems.

All the years of moaning about nvidia.ko and finally we have some real
OSS project and real chance of action but after a year of development
Nouveau still has not been picked up ...

When distros feel the need to add large and risky patches that IMO shows
process failure on our part and further isolates mainline from distros
and from testers.

While we dont want to merge anything that gets thrown at us, not merging
new, major, new-hardware-enabling OSS drivers in the mainline kernel is
almost the same thing as intentionally hurting OSS projects and helping
binary-only drivers.

Ingo

2008-06-17 09:27:03

by David Miller

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

From: Ingo Molnar <[email protected]>
Date: Tue, 17 Jun 2008 11:20:23 +0200

> When distros feel the need to add large and risky patches that IMO shows
> process failure on our part and further isolates mainline from distros
> and from testers.

You say this to the point where you sound like a broken record. It's
a bit tiring, and nothing positive ever comes of these rants.

And I think you massively oversimplify the situation, to top it
off.

On the wireless front, I severely doubt... in fact I know because I'm
looking at every wireless merge going into my tree, that John Linville
is not holding back new drivers submissions from the current 2.6.26
tree. Neither is Jeff Garzik for non-wireless net drivers.

If the Fedora9 tree is based off of 2.6.25 or similar (it is), well
that's how life works. Stuff gets backported from mainline into
whatever they are using and stuff breaks from time to time.

Did you investigate any of these facts to figure out what the specific
situation is here? Or did some of your favorite keywords pop up Arjan's
report so that you could unleash your favorite whine?

2008-06-17 15:34:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008


* David Miller <[email protected]> wrote:

> On the wireless front, I severely doubt... in fact I know because I'm
> looking at every wireless merge going into my tree, that John Linville
> is not holding back new drivers submissions from the current 2.6.26
> tree. Neither is Jeff Garzik for non-wireless net drivers.

i have no gripes about the current situation of wireless in linux-next,
other than it all came 1-2 years too late:

$ for ((v=12; v<27; v++)); do v2=v2.6.$[$v+1]; \
[ $v = 25 ] && v2=linus/master; \
[ $v = 26 ] && v2=linux-next/master; \
echo -n v2.6.$[$v+1]": "; \
git-diff --shortstat -M v2.6.$v..$v2 drivers/net/wireless/; done

v2.6.13: 16 files changed, 1707 insertions(+), 1353 deletions(-)
v2.6.14: 46 files changed, 40734 insertions(+), 756 deletions(-)
v2.6.15: 53 files changed, 8016 insertions(+), 4183 deletions(-)
v2.6.16: 37 files changed, 1818 insertions(+), 2513 deletions(-)
v2.6.17: 64 files changed, 17829 insertions(+), 2214 deletions(-)
v2.6.18: 78 files changed, 11159 insertions(+), 1427 deletions(-)
v2.6.19: 63 files changed, 3441 insertions(+), 1500 deletions(-)
v2.6.20: 58 files changed, 1290 insertions(+), 1028 deletions(-)
v2.6.21: 42 files changed, 729 insertions(+), 678 deletions(-)
v2.6.22: 85 files changed, 18989 insertions(+), 552 deletions(-)
v2.6.23: 42 files changed, 2824 insertions(+), 356 deletions(-)
v2.6.24: 208 files changed, 100960 insertions(+), 4303 deletions(-)
v2.6.25: 227 files changed, 54467 insertions(+), 23126 deletions(-)
-git: 214 files changed, 21940 insertions(+), 34143 deletions(-)
-next: 126 files changed, 13585 insertions(+), 10146 deletions(-)

up to v2.6.24 (released only 4 months ago!) we amassed a huge backlog of
~100+ KLOC wireless changes - there were OSS wireless drivers that
havent been merged for up to 1.5 years.

v2.6.24 was no doubt a huge step in the right direction but it came too
late and we are still suffering from the fallout today as we have not
reached test cycle equilibrium yet: by the time mainline gets the
patches a new large batch comes up, invalidating much of mainline's role
and forcing distros to gamble with (much untested and thus detached from
reality) experimental branches.

That's my main point: when we mess up and dont merge OSS driver code
that was out there in time - and we messed up big time with wireless -
we should admit the screwup and swallow the bitter pill.

I.e. should merge the _full_ pipeline, open up to every developer who is
willing to help with the mess, face instability for a short while until
the dust settles and go for absolutely short turnaround for fixes and
even enhancements - because there's little QA value in the existing
code. Instead of pretending that we are "stable" (in this area of the
kernel) - with a code base that distros end up skipping over.

Have a look at Fedora's kernel-2.6.25.6-24.fc8.src.rpm (which is the
Fedora kernel Arjan referred to and which we are talking about here) to
see how this all ends up in distros in practice:

earth4:/usr/src/redhat/SOURCES> ls -ldt *wireless*

-rw-r--r-- 1 root root 4102957 2008-06-03 23:01 linux-2.6-wireless.patch

(excluding renames: 214 files changed, 21940 insertions, 34143 deletions)

-rw-r--r-- 1 root root 1663540 2008-05-29 20:46 linux-2.6-wireless-pending.patch

(excluding renames: 126 files changed, 13585 insertions, 10146 deletions)

-rw-r--r-- 1 root root 38430 2008-05-29 20:46 linux-2.6-wireless-fixups.patch

linux-2.6-wireless.patch [4 MB patch, 55KLOC flux] is what v2.6.26 will
be in a month or so, and it is already an obsolete, historic version,
compared to what Fedora ships today...

linux-2.6-wireless-pending.patch [1.6 MB patch, 23 KLOC flux] is what is
in linux-next in essence and what will go into v2.6.27. Do you think
Fedora jumped to the linux-next version of wireless because the current
(not even released) mainline version was working so well?

And lets finally admit that this pain is all happening to us because we:

_didnt merge drivers soon enough_

Just about anyone who tried to use 3D and wireless on a Linux PC in the
past 3 years will attest to that, without the need for much background
research ;-)

IMO we are not learning and are repeating history once again, as the
Nouveau situation is building up towards a similar "we didnt merge it in
time" pain point. From kernel-2.6.25.6-24.fc8.src.rpm:

-rw-r--r-- 1 root root 513639 2008-05-22 04:31 nouveau-drm.patch

39 files changed, 13960 insertions(+), 5 deletions(-)

Nouveau has been started in 2006, about two years ago. It's a lot less
painful (not the least it is a lot faster as well) if such things are
developed gradually in mainline.

Ingo

2008-06-17 17:18:58

by Bob Copeland

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Mon, Jun 16, 2008 at 2:24 PM, Arjan van de Ven <[email protected]> wrote:
> Only seen with tainted kernels
> ------------------------------
> Rank 1: ath_dynamic_sysctl_register (warning)
> Reported 376 times (3007 total reports)
> [external] Bug in the proprietary madwifi driver
> warning only shows up in tainted kernels
> This warning was last seen in version 2.6.25.6, and first seen in
> 2.6.24.
> More info:
> http://www.kerneloops.org/searchweek.php?search=ath_dynamic_sysctl_register

I just looked at my moldy old copy of madwifi - AFAICT this is in the non-HAL
section. We can probably fix this one by filling out newer fields in the
ctl_table, just to get it off the #1 list.

Even though they should be using ath5k anyway...

--
Bob Copeland %% http://www.bobcopeland.com

2008-06-17 17:55:47

by Greg KH

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, Jun 17, 2008 at 05:33:56PM +0200, Ingo Molnar wrote:
> IMO we are not learning and are repeating history once again, as the
> Nouveau situation is building up towards a similar "we didnt merge it in
> time" pain point. From kernel-2.6.25.6-24.fc8.src.rpm:
>
> -rw-r--r-- 1 root root 513639 2008-05-22 04:31 nouveau-drm.patch
>
> 39 files changed, 13960 insertions(+), 5 deletions(-)
>
> Nouveau has been started in 2006, about two years ago. It's a lot less
> painful (not the least it is a lot faster as well) if such things are
> developed gradually in mainline.

Not to dispute your original claim of wanting to merge drivers earlier,
but a lot of the time, there are good reasons why the code doesn't get
merged.

As recently pointed out by the nouveau driver authors on the xorg
mailing list, they don't want the driver to be added to the main
kernel.org tree yet as they feel that their userspace/kernelspace
inteface is not complete and will change in the future.

We try to respect the authors of the code when not including them into
the kernel tree for situations like this :)

As for why Fedora added it, it might be because they can control both
sides of the boundry with matching packages much easier.

thanks,

greg k-h

2008-06-17 18:21:59

by Dave Jones

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, Jun 17, 2008 at 10:54:14AM -0700, Greg KH wrote:

> We try to respect the authors of the code when not including them into
> the kernel tree for situations like this :)
>
> As for why Fedora added it, it might be because they can control both
> sides of the boundry with matching packages much easier.

That's exactly it. Dave Airlie keeps both the X and kernel side of DRI
in check in Fedora, and with him being the DRI maintainer, he tends to have
a good handle on the state of things.

Nouveau has been a bit bumpy, and isn't ready for mass-use, which is why
we don't enable it by default. We ship it, but a user has to actually
install it, and set it up to explicitly use it instead of the 'nv' X driver
right now. Given it's there as a sort of 'preview' for interested parties,
I don't think the world is ending because we jumped the gun by shipping
this even though it's not upstream.

Dave

--
http://www.codemonkey.org.uk

2008-06-17 18:43:57

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, 17 Jun 2008, Greg KH wrote:

> On Tue, Jun 17, 2008 at 05:33:56PM +0200, Ingo Molnar wrote:
> > IMO we are not learning and are repeating history once again, as the
> > Nouveau situation is building up towards a similar "we didnt merge it in
> > time" pain point. From kernel-2.6.25.6-24.fc8.src.rpm:
> >
> > -rw-r--r-- 1 root root 513639 2008-05-22 04:31 nouveau-drm.patch
> >
> > 39 files changed, 13960 insertions(+), 5 deletions(-)
> >
> > Nouveau has been started in 2006, about two years ago. It's a lot less
> > painful (not the least it is a lot faster as well) if such things are
> > developed gradually in mainline.
>
> Not to dispute your original claim of wanting to merge drivers earlier,
> but a lot of the time, there are good reasons why the code doesn't get
> merged.
>
> As recently pointed out by the nouveau driver authors on the xorg
> mailing list, they don't want the driver to be added to the main
> kernel.org tree yet as they feel that their userspace/kernelspace
> inteface is not complete and will change in the future.

That's the same reason the wireless drivers didn't get merged sooner, with
the slight difference that they were waiting on the new 802.11 stack's
interface to stabilize, rather than their own interface.

On the other hand, it would be good if there were a way to include
unstable APIs in the mainline kernel so that they could get some exposure
before they're set in stone, and that would also eliminate that reason for
keeping drivers out so long.

-Daniel
*This .sig left intentionally blank*

2008-06-17 19:25:17

by Johannes Berg

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008


> i have no gripes about the current situation of wireless in linux-next,
> other than it all came 1-2 years too late:

Clearly, you don't have a clue about wireless. I'll admit to being
pissed off by statements like this because I personally spent a lot of
time getting wireless code into shape for merging, and it took a long
time.

If we'd have merged the existing wireless drivers 2 years ago, we would
have (at least) four 802.11 stacks in the kernel now, at least two
legally questionable drivers (the ath5k legal situation would probably
never have been cleared up, acx100 still isn't), no uniform API so it
would be impossible to write userspace support tools etc.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-17 19:32:30

by Johannes Berg

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, 2008-06-17 at 14:43 -0400, Daniel Barkalow wrote:

> That's the same reason the wireless drivers didn't get merged sooner, with
> the slight difference that they were waiting on the new 802.11 stack's
> interface to stabilize, rather than their own interface.
>
> On the other hand, it would be good if there were a way to include
> unstable APIs in the mainline kernel so that they could get some exposure
> before they're set in stone, and that would also eliminate that reason for
> keeping drivers out so long.

Small correction here: We actually did evolve the mac80211 APIs quite
radically in mainline (and new API revamps will be landing in .26
and .27).

Most drivers, however, were waiting to be written against the mac80211
API, i.e. there were drivers against net80211 or (even more of them)
drivers that had their own 802.11 stack in the driver, which meant the
driver needed to be "ported" (rewritten) to mac80211's API.

Stuff like that takes a time when each driver has at best one or two
interested developers.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-17 19:48:56

by Dave Jones

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, Jun 17, 2008 at 09:24:14PM +0200, Johannes Berg wrote:
>
> > i have no gripes about the current situation of wireless in linux-next,
> > other than it all came 1-2 years too late:
>
> Clearly, you don't have a clue about wireless. I'll admit to being
> pissed off by statements like this because I personally spent a lot of
> time getting wireless code into shape for merging, and it took a long
> time.
>
> If we'd have merged the existing wireless drivers 2 years ago, we would
> have (at least) four 802.11 stacks in the kernel now, at least two
> legally questionable drivers (the ath5k legal situation would probably
> never have been cleared up, acx100 still isn't), no uniform API so it
> would be impossible to write userspace support tools etc.

FWIW, the fact that there's so much churn happening in wireless right
now is IMO, a sign of its health. When I told John "commit whatever
wireless bits you think need to be in Fedora" many months back, I admit
I wasn't expecting as much churn as there has been.

It's been something of a double edged sword. It's great that users are
getting the latest drivers & fixes, but at the same time, it means they
get exposed to all the latest breakage at the same time.
Given the volume of change occuring, cherry-picking isn't an enviable task,
so distros are stuck between this reality, or leaving users hanging until we
get to the next point release.

FWIW, wireless isn't unique in this regard. For eg, the last few months we've
always been shipping the latest ALSA bits rather than what's in kernel.org too,
for similar reasons -- when bugs appear, the developers want to know
"does it still happen with the latest bits?"

The situation isn't perfect, but I don't think it's quite as bleak
as Ingo painted it to be.

Dave

--
http://www.codemonkey.org.uk

2008-06-17 21:52:09

by David Miller

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

From: Ingo Molnar <[email protected]>
Date: Tue, 17 Jun 2008 17:33:56 +0200

> v2.6.24 was no doubt a huge step in the right direction but it came too
> late and we are still suffering from the fallout today as we have not
> reached test cycle equilibrium yet: by the time mainline gets the
> patches a new large batch comes up, invalidating much of mainline's role
> and forcing distros to gamble with (much untested and thus detached from
> reality) experimental branches.
>
> That's my main point: when we mess up and dont merge OSS driver code
> that was out there in time - and we messed up big time with wireless -
> we should admit the screwup and swallow the bitter pill.

Your point seems to be that, even though we've acknowledged and
entirely corrected the problem now, you still will whack us over the
head and complain because it took in your opinion too long to get to
that point.

How nice. That makes the wireless folks feel great I imagine.

You also have no idea what infrastructure or other invasive wireless
subsystem changes might have been necessary to merge in some of those
drivers. Of course, that doesn't suit your goal of making the
wireless folks look like a bunch of incompetant twits, so it doesn't
surprise me that you haven't investigated any such facts.

It is impossible, therefore, to please you since we cannot change the
past. So all we can do at this point is continue doing the right
thing and completely ignore your pointless whines.

In this context your complaints are beyond unfair and beyond
pointless, therefore you're finally in my kill file now, have a nice
day Ingo.

2008-06-17 22:51:18

by Greg KH

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, Jun 17, 2008 at 02:43:02PM -0400, Daniel Barkalow wrote:
>
> On the other hand, it would be good if there were a way to include
> unstable APIs in the mainline kernel so that they could get some exposure
> before they're set in stone, and that would also eliminate that reason for
> keeping drivers out so long.

That's exactly what the documentation in Documentation/ABI is there for.
Document your "experimental" API, along with any userspace programs that
are using it, and work to try to finalize it.

thanks,

greg k-h

2008-06-18 02:41:06

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

On Tue, 17 Jun 2008, Greg KH wrote:

> On Tue, Jun 17, 2008 at 02:43:02PM -0400, Daniel Barkalow wrote:
> >
> > On the other hand, it would be good if there were a way to include
> > unstable APIs in the mainline kernel so that they could get some exposure
> > before they're set in stone, and that would also eliminate that reason for
> > keeping drivers out so long.
>
> That's exactly what the documentation in Documentation/ABI is there for.
> Document your "experimental" API, along with any userspace programs that
> are using it, and work to try to finalize it.

Documentation/ABI/README doesn't list an "experimental" level of
stability. I suppose a developer with an API they expect to change could
create it as "obsolete" (since the experimental verison will get removed
when the real one is done), but that's a little odd as a use of that
category. Also, that doesn't stop people from looking through sysfs for
useful stuff they expect to be undocumented but easy enough to figure out,
and starting to use it without realizing that it's not intended to be
maintained. And the "stable/syscalls" entry implies that all syscalls are
stable when they get merged, which means that a patch that adds a syscall
can't stablize in mainline.

If there are people, like the Nouveau developers, using the instability of
their userspace API as a reason not to submit their drivers, and we would
ideally like the drivers to stabilize with mainline exposure, then we need
to do something more to address these authors' concerns.

-Daniel
*This .sig left intentionally blank*

2008-06-18 03:35:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

Dave Jones wrote:
> On Tue, Jun 17, 2008 at 09:24:14PM +0200, Johannes Berg wrote:
> >
> > > i have no gripes about the current situation of wireless in linux-next,
> > > other than it all came 1-2 years too late:
> >
> > Clearly, you don't have a clue about wireless. I'll admit to being
> > pissed off by statements like this because I personally spent a lot of
> > time getting wireless code into shape for merging, and it took a long
> > time.
> >
> > If we'd have merged the existing wireless drivers 2 years ago, we would
> > have (at least) four 802.11 stacks in the kernel now, at least two
> > legally questionable drivers (the ath5k legal situation would probably
> > never have been cleared up, acx100 still isn't), no uniform API so it
> > would be impossible to write userspace support tools etc.
>
> FWIW, the fact that there's so much churn happening in wireless right
> now is IMO, a sign of its health.

I totally agree with that. In fact I'm quite happy with the progress.


> It's been something of a double edged sword. It's great that users are
> getting the latest drivers & fixes, but at the same time, it means they
> get exposed to all the latest breakage at the same time.
> Given the volume of change occuring, cherry-picking isn't an enviable task,
> so distros are stuck between this reality, or leaving users hanging until we
> get to the next point release.
>
> FWIW, wireless isn't unique in this regard. For eg, the last few months we've
> always been shipping the latest ALSA bits rather than what's in kernel.org too,
> for similar reasons -- when bugs appear, the developers want to know
> "does it still happen with the latest bits?"
>


this is the part that concerns me. The fact that you feel the need to use "not yet in mainline" pieces
(I'm not so much talking about backporting from 2.6.26-git to 2.6.25; that's perfectly fine, but I'm
talking about code not in 2.6.26-git) is NOT a healthy sign.... if that truely is the case then that code surely
deserves to be in mainline as well?

2008-06-18 07:18:33

by Johannes Berg

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008


> > It's been something of a double edged sword. It's great that users are
> > getting the latest drivers & fixes, but at the same time, it means they
> > get exposed to all the latest breakage at the same time.
> > Given the volume of change occuring, cherry-picking isn't an enviable task,
> > so distros are stuck between this reality, or leaving users hanging until we
> > get to the next point release.
> >
> > FWIW, wireless isn't unique in this regard. For eg, the last few months we've
> > always been shipping the latest ALSA bits rather than what's in kernel.org too,
> > for similar reasons -- when bugs appear, the developers want to know
> > "does it still happen with the latest bits?"
> >
>
>
> this is the part that concerns me. The fact that you feel the need to use "not yet in mainline" pieces
> (I'm not so much talking about backporting from 2.6.26-git to 2.6.25; that's perfectly fine, but I'm
> talking about code not in 2.6.26-git) is NOT a healthy sign.... if that truely is the case then that code surely
> deserves to be in mainline as well?

That's more a case of Fedora living on the bleeding edge. The code is
fairly stable, all in linux-next, but the churn tends to be high because
of internal API changes that affect all drivers. Currently, I don't
think there is actually any _feature_ pending in linux-next, only
internal cleanups. Such cleanups are desirable, but at the same time can
lead to instability, hence being kept out of .26-git for the time being,
and are in -next for .27. Mostly because we only wrote them after .26
started.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2008-06-18 14:23:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

Johannes Berg wrote:
>>> FWIW, wireless isn't unique in this regard. For eg, the last few months we've
>>> always been shipping the latest ALSA bits rather than what's in kernel.org too,
>>> for similar reasons -- when bugs appear, the developers want to know
>>> "does it still happen with the latest bits?"
>>>
>>
>> this is the part that concerns me. The fact that you feel the need to use "not yet in mainline" pieces
>> (I'm not so much talking about backporting from 2.6.26-git to 2.6.25; that's perfectly fine, but I'm
>> talking about code not in 2.6.26-git) is NOT a healthy sign.... if that truely is the case then that code surely
>> deserves to be in mainline as well?
>
> That's more a case of Fedora living on the bleeding edge. The code is
> fairly stable, all in linux-next, but the churn tends to be high because
> of internal API changes that affect all drivers. Currently, I don't
> think there is actually any _feature_ pending in linux-next, only
> internal cleanups. Such cleanups are desirable, but at the same time can
> lead to instability, hence being kept out of .26-git for the time being,
> and are in -next for .27. Mostly because we only wrote them after .26
> started.
>

My concern is that if there's something technological in the "bleeding tree" that is so valuable to users
that distros feel that it's ready "enough" and that they need to pick it up for their users, we have a flaw
in our processes in moving to slow for users. From what you described that's not the case for wireless
(more a case of Fedora jumping off the bridge while forgetting to tie down the bungee cord ;-), and
that's good. I hope the same applies for the ALSA parts....

2008-06-19 00:22:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 17 Jun 2008 17:33:56 +0200
>
> > v2.6.24 was no doubt a huge step in the right direction but it came
> > too late and we are still suffering from the fallout today as we
> > have not reached test cycle equilibrium yet: by the time mainline
> > gets the patches a new large batch comes up, invalidating much of
> > mainline's role and forcing distros to gamble with (much untested
> > and thus detached from reality) experimental branches.
> >
> > That's my main point: when we mess up and dont merge OSS driver code
> > that was out there in time - and we messed up big time with wireless
> > - we should admit the screwup and swallow the bitter pill.
>
> Your point seems to be that, even though we've acknowledged and
> entirely corrected the problem now, you still will whack us over the
> head and complain because it took in your opinion too long to get to
> that point.

from the discussion it was not at all clear to me that you appear to
agree with me - all i saw really was that you tried to ridicule my
position.

> How nice. That makes the wireless folks feel great I imagine.

my only worry was about the current situation, which, according to
kerneloops.org, with 17442 oopses reported against v2.6.25, isnt
anything to feel too great about. (And that's not limited to wireless in
any way - there is a rather prominent tick_broadcast_oneshot_control()
soft lockup entry as well that we are trying to figure out.)

It will all get better i'm sure - we now finally have objective
visibility of bugs as they happen to users.

Ingo

2008-06-20 06:01:49

by Len Brown

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008



> my only worry was about the current situation, which, according to
> kerneloops.org, with 17442 oopses reported against v2.6.25, isnt
> anything to feel too great about. (And that's not limited to wireless in
> any way - there is a rather prominent tick_broadcast_oneshot_control()
> soft lockup entry as well that we are trying to figure out.)
>
> It will all get better i'm sure - we now finally have objective
> visibility of bugs as they happen to users.

kerneloops.org is indeed a wonderful thing (kudos to Arjan, once again!).

Note, however, that the large number of recent reports isn't necessarily a
fair comparison with numbers for previous releases. For the number of
clients to report to kerneloops.org is not constant. (eg. it was included
with Fedora Core 9, but not with Fedora Core 8)

cheers,
-Len

2008-06-23 17:17:16

by John W. Linville

[permalink] [raw]
Subject: Re: Oops report for the week preceding June 16th, 2008

Arjan van de Ven wrote:
> Johannes Berg wrote:
> > That's more a case of Fedora living on the bleeding edge. The code is
> > fairly stable, all in linux-next, but the churn tends to be high because
> > of internal API changes that affect all drivers. Currently, I don't
> > think there is actually any _feature_ pending in linux-next, only
> > internal cleanups. Such cleanups are desirable, but at the same time can
> > lead to instability, hence being kept out of .26-git for the time being,
> > and are in -next for .27. Mostly because we only wrote them after .26
> > started.
> >
>
> My concern is that if there's something technological in the "bleeding tree" that is
> so valuable to users that distros feel that it's ready "enough" and that they need to
> pick it up for their users, we have a flaw in our processes in moving to slow for
> users. From what you described that's not the case for wireless (more a case of
> Fedora jumping off the bridge while forgetting to tie down the bungee cord ;-), and
> that's good. I hope the same applies for the ALSA parts....

My first question is "how do you guys know to start these discussions
when I go on vacation?" :-) I was out of town last week and missed
this little eruption, so forgive my late reply. Given my pertinent
role in the topic, I thought I should still remark.

I would remind anyone that at one time there was still a lot of pressue
to _not_ merge the new wireless bits upstream. The reasons for that
pressure were mentioned elsewhere in this thread -- mostly fear of
introducing new/instable userland ABI as well as general concerns
about the design/implementation of what is now the mac80211 component.
My own lack of experience as a maintainer contributed here, as I
was often uncertain about how to get things moving along sooner.
Thankfully my experience dealing with these maintenance issues
has increased. Moreover the external pressure against merging has
subsided due both to some technical resolutions and also perhaps to
a shift in attitude about what is mergeable upstream. I don't think
there is any remaining logjam with regard to upstream wireless merges.

The practice of pushing cutting-edge wireless stuff into Fedora
started as a means of getting testers. Once it was in Rawhide, it
never made sense to yank it away from users. So, I have continued
the process of merging what is now known as -next wireless bits into
Fedora. This is at least partly because I haven't figured-out how to
gracefully stop doing that. :-) In fact, in Fedora 9 I have started
to stage those bits more slowly between -next, Rawhide, F9, and F8.
FWIW, I think this staging (rather than pushing new -next stuff into
F{10,9,8} more-or-less immediately) may have created the window for
releasing the bad Fedora kernels that plagued kerneloops.org last week.

Anyway, the wireless bits in Fedora are all on their way upstream.
The ones that aren't in linux-2.6 are only missing due to the
"bugfixes only after -rc" policy, not some systemic refusal to merge.
Given the current process, it would be impossible to get them upstream
any faster. In fact, getting exposure in Fedora gives us an early jump
in _avoiding_ upstream regressions when these bits get into 2.6.27-rc1.
The fact that it _usually_ make things better for Fedora wireless
users is just gravy. :-)

Thanks,

John
--
John W. Linville
[email protected]