2007-01-22 22:15:41

by Chuck Ebbert

[permalink] [raw]
Subject: 2.6.18-stable release plans?

Is there going to be another 2.6.18-stable release?


2007-01-23 00:24:00

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On 22/01/07, Chuck Ebbert <[email protected]> wrote:
> Is there going to be another 2.6.18-stable release?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Now that 2.6.19 is out, most likely not. -stable releases are made
for the latest stable 2.6.x kernel, once 2.6.x+1 is out that's the one
-stable patches are made for (2.6.16 is an exception)..


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2007-01-23 13:38:59

by Sunil Naidu

[permalink] [raw]
Subject: Re: [Proposal] 2.6.18-stable release plans?

On 1/23/07, Jesper Juhl <[email protected]> wrote:
>
> Now that 2.6.19 is out, most likely not. -stable releases are made
> for the latest stable 2.6.x kernel, once 2.6.x+1 is out that's the one
> -stable patches are made for (2.6.16 is an exception)..

Earlier I was going through the stable paches which apply for these
kernel version:

http://kernelnewbies.org/Linux_2_6_19
http://kernelnewbies.org/Linux_2_6_18
http://kernelnewbies.org/Linux_2_6_16

I have to dig deep into the patch & kernel version to understand what
are the features/implementations or fixes (patch). Problem here is 2
ways:-

1) Identifying which is a better kernel (features) for
Desktop/Embedded/Server (I know, info mentioned in Changelog. But
still...)

2) Easily knowing which stable patch applies for which arch or sub
system (better way)

Proposal: Can maintainers implement this for better understanding of
the Linux community (picking the best kernel or patch for the
requirement) by adding a column on http://www.kernel.org home page?

Examples:

The latest stable version of the Linux kernel is: 2.6.19.2
*networking subsystem* 2007-01-10 19:11 UTC F V VI C Changelog

The latest prepatch for the stable Linux kernel tree is: 2.6.20-rc5
*x86 32/64* 2007-01-12 19:29 UTC B V VI C Changelog

> --
> Jesper Juhl <[email protected]>

Thanks,

Akula2

2007-01-23 17:41:32

by Stefan Richter

[permalink] [raw]
Subject: Re: [Proposal] 2.6.18-stable release plans?

Sunil Naidu wrote:
> I have to dig deep into the patch & kernel version to understand what
> are the features/implementations or fixes (patch). Problem here is 2
> ways:-
>
> 1) Identifying which is a better kernel (features) for
> Desktop/Embedded/Server (I know, info mentioned in Changelog. But
> still...)
>
> 2) Easily knowing which stable patch applies for which arch or sub
> system (better way)
>
> Proposal: Can maintainers implement this for better understanding of
> the Linux community (picking the best kernel or patch for the
> requirement) by adding a column on http://www.kernel.org home page?
...

This would be hard to organize and support. There are news sites like
LWN which give outlines of important kernel changes, and there are
mailinglists or community sites for architectures or driver subsystems
if you are interested in special platforms or drivers, and there is the
git repository metadata (via gitweb or directly from a locally cloned
git repo).

I for one am actually posting release notes to a users' mailinglist of
the drivers I'm interested in. But I can do this only because these are
drivers with very low rate of fixes or feature additions.

It's also not only a question of who writes such release notes, but also
of who the intended audience is. How fine-grained should the release
notes be?

Anyway --- if in doubt, your distributor's current kernel is the best one.
--
Stefan Richter
-=====-=-=== ---= =-===
http://arcgraph.de/sr/

2007-01-23 19:25:40

by Sunil Naidu

[permalink] [raw]
Subject: Re: [Proposal] 2.6.18-stable release plans?

On 1/23/07, Stefan Richter <[email protected]> wrote:
>
> This would be hard to organize and support. There are news sites like
> LWN which give outlines of important kernel changes, and there are
> mailinglists or community sites for architectures or driver subsystems
> if you are interested in special platforms or drivers, and there is the
> git repository metadata (via gitweb or directly from a locally cloned
> git repo).

That's the problem all about. I don't have problems finding info about
a kernel feature or about a driver patch. But even for me it takes
some time to figure it out. (I don't work only with Linux......do work
with Mac OS, and Solaris too. But less of Windows!)

Am talking from the Linux users point of view (who need to know deep
about the Kernel development, etc. Imagine, a student or a
professional who wants to build a kernel with a patch for a i686
machine? Can't we simplify this at http://www.kernel.org home page instead of
user being lost in searching on Google or wherever?

> It's also not only a question of who writes such release notes, but also
> of who the intended audience is. How fine-grained should the release
> notes be?

I would like to contribute here, in any way. Fine graining might be a
problem. But we can give info in general? Example, if patch-2.6.19.3
comes out - simply we can say as *x86 32/64* based on the weightage of
fixes in that patch. What do you say?

> Anyway --- if in doubt, your distributor's current kernel is the best one.

I never depend on a distributor's kernel, especially Fedora has a
Generic one with many things enabled into the kernel. The moment I
install a distro....my hands would be itchy to build my own kernel
which is very specific to my H/W. No less or more. In this way I can
live with my best made kernel ;-)


[OT] Some posts I send does not appear on LKML. Even this one, I
couldn't find on list :( What might be the problem?

> Stefan Richter

Thanks,

~Akula2

2007-01-23 20:07:35

by Stefan Richter

[permalink] [raw]
Subject: Re: [Proposal] 2.6.18-stable release plans?

Sunil Naidu wrote:
...
> Am talking from the Linux users point of view (who need to know deep
> about the Kernel development, etc. Imagine, a student or a
> professional who wants to build a kernel with a patch for a i686
> machine?

You speak of advanced users with very special requirements. They can
watch respective mailinglists, git trees, bugzillas, ... The file
MAINTAINERS contains some pointers for them.

...
> But we can give info in general? Example, if patch-2.6.19.3
> comes out - simply we can say as *x86 32/64* based on the weightage of
> fixes in that patch. What do you say?

This cannot be measured objectively.
--
Stefan Richter
-=====-=-=== ---= =-===
http://arcgraph.de/sr/

2007-01-23 20:35:33

by Chuck Ebbert

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

Jesper Juhl wrote:
> On 22/01/07, Chuck Ebbert <[email protected]> wrote:
>> Is there going to be another 2.6.18-stable release?
>>
>
> Now that 2.6.19 is out, most likely not. -stable releases are made
> for the latest stable 2.6.x kernel, once 2.6.x+1 is out that's the one
> -stable patches are made for (2.6.16 is an exception)..
>
Great... just as 2.6.18 approaches actual stability.

Adrian, how much longer are you going to support 2.6.16? Would you consider
moving to 2.6.18 any time soon?

2007-01-23 20:55:57

by Adrian Bunk

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Tue, Jan 23, 2007 at 03:33:48PM -0500, Chuck Ebbert wrote:
> Jesper Juhl wrote:
> > On 22/01/07, Chuck Ebbert <[email protected]> wrote:
> >> Is there going to be another 2.6.18-stable release?
> >>
> >
> > Now that 2.6.19 is out, most likely not. -stable releases are made
> > for the latest stable 2.6.x kernel, once 2.6.x+1 is out that's the one
> > -stable patches are made for (2.6.16 is an exception)..
> >
> Great... just as 2.6.18 approaches actual stability.
>
> Adrian, how much longer are you going to support 2.6.16? Would you consider
> moving to 2.6.18 any time soon?

I'll continue to maintain 2.6.16 - moving to 2.6.18 or any other kernel
would defeat the purpose of what I am doing.

I'm not yet decided whether I'll create other stable kernel branches,
but it this will happen it won't be 2.6.18, more like 2.6.25 or 2.6.30.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2007-01-24 04:50:48

by Daniel Barkalow

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Tue, 23 Jan 2007, Jesper Juhl wrote:

> Now that 2.6.19 is out, most likely not. -stable releases are made
> for the latest stable 2.6.x kernel, once 2.6.x+1 is out that's the one
> -stable patches are made for (2.6.16 is an exception)..

There's generally a bit of overlap. 2.6.17.14 was about the same time as
2.6.18.1, and 2.6.18.6 was after 2.6.19.1. But 2.6.18.x must be over now,
because the -stable team didn't release a 2.6.18.7 to match 2.6.19.2, and
all of 2.6.x except for 2.6.19.2 has that weird file corruption bug
(although rarely triggered).

-Daniel
*This .sig left intentionally blank*

2007-01-24 13:30:39

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

> But 2.6.18.x must be over now, because the -stable team didn't release a 2.6.18.7 to match
> 2.6.19.2, and all of 2.6.x except for 2.6.19.2 has that weird file corruption bug .

Personally, I dumped 2.6.19.x like a hot coal as soon as I tripped over this bug:

http://bugzilla.kernel.org/show_bug.cgi?id=7707

I didn't take much to trigger it, either. But the silence has been deafening.

Cheers,
Chris






___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-01-24 14:37:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, 24 Jan 2007, Chris Rankin wrote:
>
> Personally, I dumped 2.6.19.x like a hot coal as soon as I tripped over this bug:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=7707
>
> I didn't take much to trigger it, either. But the silence has been deafening.

Oh, the page_remove_rmap BUG, page_mapcount negative.

Sorry for the deafening silence, see I was CC'ed but dropped the ball.

That's surely no reason to dump 2.6.19.x, you'll find the occasional
such report on every(?) release since page mapcount went into 2.6.7.

Oftentimes it's bad RAM (try memtest86), sometimes it's a bad driver
(probably the case for the tainted P report appended to your untainted
one), sometimes it's unidentified memory corruption. Not once (except
during experimental patch testing) has it been proved due to an actual
VM problem.

Hugh

2007-01-24 15:06:25

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

> That's surely no reason to dump 2.6.19.x, you'll find the occasional
> such report on every(?) release since page mapcount went into 2.6.7.

This was the only time I've seen it, before or since.

> Oftentimes it's bad RAM (try memtest86)

There is nothing wrong with my RAM. I tested it quite extensively when I upgraded to 2 GB.

> sometimes it's a bad driver (probably the case for the tainted P report appended
> to your untainted one), sometimes it's unidentified memory corruption.

But MY kernel is clearly untainted. So what other explanation is there apart from a kernel bug?

Cheers,
Chris




___________________________________________________________
All New Yahoo! Mail ? Tired of unwanted email come-ons? Let our SpamGuard protect you. http://uk.docs.yahoo.com/nowyoucan.html

2007-01-24 15:40:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, 24 Jan 2007, Chris Rankin wrote:
>
> But MY kernel is clearly untainted.
> So what other explanation is there apart from a kernel bug?

If it's me you're asking: I don't know (overheating, cosmic rays, ...)

Hugh

2007-01-24 15:53:22

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

> > But MY kernel is clearly untainted.
> > So what other explanation is there apart from a kernel bug?

> If it's me you're asking: I don't know (overheating, cosmic rays, ...)

I suppose what I'm *really* asking is what the basis is for assuming that this *isn't* a kernel
bug and can therefore be safely ignored, seeing as the oops is real, the hardware is fine and the
kernel is untainted? That seems to cover the bases from where I'm sitting.

Cheers,
Chris

P.S. No micro-heatwaves have occurred here, either. Do we all need to install muon detectors in
our homes before reporting bugs now, so that we can exclude cosmic ray events too?



___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

2007-01-24 16:12:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, 24 Jan 2007, Chris Rankin wrote:
>
> I suppose what I'm *really* asking is what the basis is for assuming that this *isn't* a kernel
> bug and can therefore be safely ignored, seeing as the oops is real, the hardware is fine and the
> kernel is untainted? That seems to cover the bases from where I'm sitting.

All I'm claiming is that it's no more a reason to avoid 2.6.19*
than to avoid any other release (the kernels before 2.6.7 happened
to have no such check, but that doesn't imply they were any safer).

It may indeed be due to a kernel bug, but I can't tell you where: sorry.

Hugh

2007-01-24 16:28:50

by Mark D Rustad

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Jan 24, 2007, at 9:53 AM, Chris Rankin wrote:

>>> But MY kernel is clearly untainted.
>>> So what other explanation is there apart from a kernel bug?
>
>> If it's me you're asking: I don't know (overheating, cosmic
>> rays, ...)
>
> I suppose what I'm *really* asking is what the basis is for
> assuming that this *isn't* a kernel
> bug and can therefore be safely ignored, seeing as the oops is
> real, the hardware is fine and the
> kernel is untainted? That seems to cover the bases from where I'm
> sitting.
>
> Cheers,
> Chris
>
> P.S. No micro-heatwaves have occurred here, either. Do we all need
> to install muon detectors in
> our homes before reporting bugs now, so that we can exclude cosmic
> ray events too?

Well, do you have ECC memory? If not, it is at least possible that
that the solar flares that occurred last month may have affected your
system. There were three X-class solar flares in the month of
December. Even if you have ECC memory, it is still possible to suffer
data corruption because many BIOSes do not turn on bus parity error
detection. And are the memories in your disk drive controllers ECC or
parity protected? I wouldn't bet on it...

All too often, modern PCs are data corruption accidents waiting to
happen.

--
Mark Rustad, [email protected]


2007-01-24 17:33:44

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Hugh Dickins <[email protected]> wrote:
> All I'm claiming is that it's no more a reason to avoid 2.6.19*
> than to avoid any other release (the kernels before 2.6.7 happened
> to have no such check, but that doesn't imply they were any safer).

There is *one* reason to avoid 2.6.19.x: it has actually bitten me, while none of the others has.
And if I can't trust a kernel to compile something then what good is it?

Cheers,
Chris






___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-01-24 22:37:24

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Mark Rustad <[email protected]> wrote:
> Well, do you have ECC memory? If not, it is at least possible that
> that the solar flares that occurred last month may have affected your
> system.

I am going to assume that you are being facaetious, because it would be the rarified pinnacle of
supreme arrogance to suggest that a cosmic ray event is a more likely explanation than a bug in
the kernel.

Cheers,
Chris




___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-01-24 23:00:27

by Alan

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

> I am going to assume that you are being facaetious, because it would be the rarified pinnacle of
> supreme arrogance to suggest that a cosmic ray event is a more likely explanation than a bug in
> the kernel.

A one off non repeatable error experienced by two people out of the
millions using it does fit the cosmic ray description quite well. That's
not to say there isn't a bug, but you don't have enough data to even
begin debugging it unless its rather more reproducable.

Alan

2007-01-24 23:05:49

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Alan <[email protected]> wrote:
> A one off non repeatable error experienced by two people out of the
> millions using it does fit the cosmic ray description quite well.

Actually it's "unrepeated", not "non repeatable". And that's because I switched back to 2.6.18.x
immediately since I no longer trusted 2.6.19.x.

Cheers,
Chris






___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-01-24 23:33:05

by Mark D Rustad

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Jan 24, 2007, at 5:11 PM, Alan wrote:

>> I am going to assume that you are being facaetious, because it
>> would be the rarified pinnacle of
>> supreme arrogance to suggest that a cosmic ray event is a more
>> likely explanation than a bug in
>> the kernel.
>
> A one off non repeatable error experienced by two people out of the
> millions using it does fit the cosmic ray description quite well.
> That's
> not to say there isn't a bug, but you don't have enough data to even
> begin debugging it unless its rather more reproducable.

Exactly. Halting use of a version of the kernel based on a single
incident provides no insight to the source of the problem. It could
be anything...

--
Mark Rustad, [email protected]


2007-01-24 23:45:59

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Mark Rustad <[email protected]> wrote:
> Exactly. Halting use of a version of the kernel based on a single
> incident provides no insight to the source of the problem. It could
> be anything...

There is a world of difference between a polite request for more information (although I gave you
everything I had), and fobbing someone off with a story about cosmic rays.

Cheers,
Chris




___________________________________________________________
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

2007-01-25 01:00:29

by Ken Moffat

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, Jan 24, 2007 at 11:45:57PM +0000, Chris Rankin wrote:
>
> There is a world of difference between a polite request for more information (although I gave you
> everything I had), and fobbing someone off with a story about cosmic rays.
>
Chris,

I doubt there was a single version of the kernel which ever worked
well for all its users. In a production environment, reverting to an
older version may be the best short-term answer, but if nobody
recognises the problem you won't get any closer to a proper fix. At
the moment, you have a problem that nobody recognises. If you're not
willing to test if the problem happens repeatably, (you appear to
have had one failure and immediately reverted to an old kernel), who
do you think will be able to fix it? And if it turns out it doesn't
fail repeatably, maybe the responses you've received could be
correct.

The stable team are only there to maintain the current release of
the kernel. There is no maintenance of earlier releases (except
Adrian's work on 2.6.16), other than what a distro chooses to do to
backport fixes. Of course, if the problem _is_ reproducible on your
machine and config, you might get asked to try to identify when it
got introduced (e.g. was 2.6.19 itself, or an arbitrary 2.6.19-rc,
ok?).

Ken
--
das eine Mal als Trag?die, das andere Mal als Farce

2007-01-25 03:05:14

by Mark D Rustad

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Jan 24, 2007, at 5:45 PM, Chris Rankin wrote:
>> --- Mark Rustad <[email protected]> wrote:
>> Exactly. Halting use of a version of the kernel based on a single
>> incident provides no insight to the source of the problem. It could
>> be anything...
>
> There is a world of difference between a polite request for more
> information (although I gave you
> everything I had), and fobbing someone off with a story about
> cosmic rays.

I'm sorry. I didn't mean to imply anything like that. I just happened
to notice that the date of the bug report appeared to correlate
pretty well with one of the solar flare events last month. I was
really trying to share some information that just conceivably might
have been related, based on the earlier messages in this thread
regarding memory errors.

I don't normally follow solar activity. I have been looking into some
system failures that happened last month. The systems had been
running with all bus error detection enabled ? the hardware set to
spontaneously reboot on any uncorrectable error. Since our systems
are redundant, performing a reset simply means that the redundant
partner will take over, so the reset is the best way to be certain
that there is no data corruption. I eventually recalled a radio
report last month about a coronal mass ejection on the sun and how
things might be disrupted here. I checked out http://www.spaceweather.com
and found that December was a very active month, with three separate
X-class flares. I have no way to conclude that the failures that I
have seen were influenced by events on the sun, but it seems
possible. Compared to our systems, most PCs and even much server-
class hardware systems are likely to corrupt a bit just keep on going.

We'll never know if any of these things were correlated with the
solar flares because they all seem to be one-off failures. I do find
it interesting though. Our systems seem to be doing statistically
better this month. What do you think?

--
Mark Rustad, [email protected]



2007-01-25 08:51:22

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Mark Rustad <[email protected]> wrote:
> We'll never know if any of these things were correlated with the
> solar flares because they all seem to be one-off failures. I do find
> it interesting though. Our systems seem to be doing statistically
> better this month. What do you think?

Personally, I think it's all a bit moot unless you also have particle detectors above and below
all your machines so that you can interpolate particle tracks. I certainly see no reason why a
random high-energy particle passing through two different machines is ever likely to cause the
same error, unless that error is an outright systems crash. (Although that second report was from
someone with a tainted kernel, which makes it suspect.)

Cheers,
Chris




___________________________________________________________
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

2007-01-25 09:16:08

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Ken Moffat <[email protected]> wrote:
> At the moment, you have a problem that nobody recognises. If you're not
> willing to test if the problem happens repeatably, (you appear to
> have had one failure and immediately reverted to an old kernel), who
> do you think will be able to fix it?

This bug seems to be in the kernel's "memory management", and the last memory-related bug I had
(caused by a bad DIMM on another machine) caused creeping filesystem corruption. However, this
machine is my main desktop, and so I am keen to keep the filesystems intact. So yes, that involves
not running a kernel that has shown itself to be unreliable.

I was hoping that someone with a deeper knowledge of the differences between 2.6.18 and 2.6.19
would have an idea of what might have triggered this problem, and yes, I was also thinking that
some more people would trip over it and help debug it.

But anyway - can someone please tell me what "Eeek! page_mapcount(page) went negative! (-1)" is
*really* saying/implying? Because I am currently translating this as "I WANT TO EAT YOUR
FILESYSTEMS".

Cheers,
Chris






___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

2007-01-25 19:36:53

by Ken Moffat

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Thu, Jan 25, 2007 at 09:16:04AM +0000, Chris Rankin wrote:
>
> But anyway - can someone please tell me what "Eeek! page_mapcount(page) went negative! (-1)" is
> *really* saying/implying? Because I am currently translating this as "I WANT TO EAT YOUR
> FILESYSTEMS".
>
I can't, but Dave Jones had a similar problem earlier this month,
archived at http://uwsg.iu.edu/hypermail/linux/kernel/0701.0/1822.html
which I think is a followup from
http://www.mail-archive.com/[email protected]/msg105370.html
- and seems to be a possible hardware failure (bulging capacitors)
becoming apparent under load.

Ken
--
das eine Mal als Trag?die, das andere Mal als Farce

2007-01-25 21:16:59

by Matt Mackall

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, Jan 24, 2007 at 11:11:53PM +0000, Alan wrote:
> > I am going to assume that you are being facaetious, because it would be the rarified pinnacle of
> > supreme arrogance to suggest that a cosmic ray event is a more likely explanation than a bug in
> > the kernel.
>
> A one off non repeatable error experienced by two people out of the
> millions using it does fit the cosmic ray description quite well. That's
> not to say there isn't a bug, but you don't have enough data to even
> begin debugging it unless its rather more reproducable.

The soft error rate (cosmic rays, alpha decay, etc.) for modern memory
at sea level is estimated to be somewhere around 1000 - 5000
FIT/Mbit[1]. FIT is Failures in Time - errors per billion hours of
use. If you've got 1GB of memory, you've got 8000Mbits. So you'd
expect 8M - 40M errors per billion hours on your machine. Or 8 to 40
errors per 1000 hours. That's about one single-bit error per week to
one per day.

Yes, that's a lot. Can it really be that high? Big supercomputer
installations actually measure it in errors per day or hour.

Most of these errors will go completely unnoticed because they happen
in data structures that aren't revisited (stale cache, unused code,
empty memory). The remainder will often look like random disk read or
write errors or random application bugs/crashes. Sound familiar? That's why
people buy ECC memory.

Now if we say that 10% of of that 1GB of RAM (~100MB) is kernel code/data
(not including page cache) and that, say, 1-10% of errors trigger
BUG/WARN code, we'll see these bug messages once every 100 days to
once every 1000 weeks (per GB per user).

As for the relative error rate vs kernel bugs - there are no shortage
of Linux boxes with trouble-free uptimes much longer than the 100 days
above.

So yes, if a user reports a bug that's attributable to a single bit
memory error that's otherwise unreproduced and unexplained, it's
totally reasonable to chalk it up to cosmic rays until some sort of
pattern of reports emerges.

As for your particular bug:

Eeek! page_mapcount(page) went negative! (-1)
page->flags = 14
page->count = 0
page->mapping = 00000000

This check occurs whenever the last mapping is removed from a page.
It's a very heavily used piece of code. The check is there as
sanity-checking from when this logic was introduced. If there were a
new bug here that could be triggered by gcc or telnet, odds are very
good that it would trigger for TONS of people.

So more likely theories are: a) pointer scribble from something
completely unrelated or b) cosmic rays. As the nearby data (flags,
count, mapping) doesn't appear to be scribbled on, (a) looks less
promising.

[1] http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf

--
Mathematics is the supreme nostalgia of our time.

2007-01-25 23:26:59

by Alistair John Strachan

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Thursday 25 January 2007 09:16, Chris Rankin wrote:
> But anyway - can someone please tell me what "Eeek! page_mapcount(page)
> went negative! (-1)" is *really* saying/implying? Because I am currently
> translating this as "I WANT TO EAT YOUR FILESYSTEMS".

Hugh already did, multiple times. If there's an external hardware event that
corrupts memory, code executing on your CPU is no longer going to behave
deterministically. So cases that are typically "impossible" in the design of
the code have a chance to trigger.

You can continue to flame 2.6.19, but you're an extreme minority when it comes
to this kind of bug and as, again, Hugh already said, almost all of the
reports of this and similar other bugs have led to hardware problems that
were either unchecked or difficult to detect.

Imagine this scenario. It might seem unrealistic to you, but it's not
impossible!

First Use of Linux -> Upgrading to 2.6.19
Undetected hardware error never triggered.

Running 2.6.19
Hardware error triggers. Linux crashes.

Going back to 2.6.18
Hardware error has not yet triggered again.

Will it eat your filesystem? Maybe. But it probably won't, if you claim the
memory is tested, it could have been a single bit error, or a cosmic ray
event, or a brownout, or anything similar. It's much more likely to simply
crash your machine, as it did.

Not running the affected kernel again is a sure way to have _nobody_ listen to
your complaints about 2.6.19 having a real software bug, because you're
totally unwilling to test the kernel again and see if it triggers. A single
report is simply not enough evidence.

Additionally, reports from other users (who may have a million different
experimental variables involved) are also insufficient, for reasons which
have already been explained (drivers, proprietary code, et cetera).

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.

2007-01-26 13:02:15

by Chris Rankin

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

--- Ken Moffat <[email protected]> wrote:
> I can't, but Dave Jones had a similar problem earlier this month,
> archived at http://uwsg.iu.edu/hypermail/linux/kernel/0701.0/1822.html
> which I think is a followup from
> http://www.mail-archive.com/[email protected]/msg105370.html
> - and seems to be a possible hardware failure (bulging capacitors)
> becoming apparent under load.

Interesting, although I don't believe I have a hardware fault. The box is perfectly stable under
2.6.18.x. Anyway, my particular problem happened very quickly last time so I am hoping that
recompiling xine from scratch will trigger something again. This time, however, I have built a
2.6.19.x kernel with a few memory debugging options turned on.

Cheers,
Chris



___________________________________________________________
Copy addresses and emails from any email account to Yahoo! Mail - quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html

2007-02-02 04:02:10

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Wed, 24 Jan 2007 22:37:20 GMT, Chris Rankin said:
> --- Mark Rustad <[email protected]> wrote:
> > Well, do you have ECC memory? If not, it is at least possible that
> > that the solar flares that occurred last month may have affected your
> > system.
>
> I am going to assume that you are being facaetious, because it would be the rarified pinnacle of
> supreme arrogance to suggest that a cosmic ray event is a more likely explanation than a bug in
> the kernel.

Sorry for the late reply, but cosmic ray events (actually, self-induced alpha
particle events from decays within the chipset itself) *are* a likely
explanation:

http://stason.org/TULARC/pc/pc_hardware_faq/2_20_What_does_parity_ECC_memory_protect_the_system_from.html

Most important take-away here:

"With 100 million computers in use today, we should expect roughly 6 million
single bit errors per year. Computer hardware and software companies must
receive thousands of "side effect" bug reports and support calls due to memory
errors alone. The costs of NOT including parity memory must be huge!"


Attachments:
(No filename) (226.00 B)

2007-02-02 06:48:33

by Jon Masters

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

[email protected] wrote:

> "With 100 million computers in use today, we should expect roughly 6 million
> single bit errors per year. Computer hardware and software companies must
> receive thousands of "side effect" bug reports and support calls due to memory
> errors alone. The costs of NOT including parity memory must be huge!"

I must be weird or something, but I often think about this and the sheer
number of clock cycles executing at any one time around the world. Have
you ever stopped to think how many copies of schedule() (or whatever)
are currently running somewhere in the world? It's just nuts :-)

More seriously, if nobody cared about this stuff then we wouldn't have
all this MCE reporting and tools to handle differentiating between
actual failing DRAMs and temporary bit transitions in ECC memory.

Jon.

2007-02-02 08:18:29

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-stable release plans?

On Fri, 02 Feb 2007 01:47:38 EST, Jon Masters said:

> I must be weird or something, but I often think about this and the sheer
> number of clock cycles executing at any one time around the world. Have
> you ever stopped to think how many copies of schedule() (or whatever)
> are currently running somewhere in the world? It's just nuts :-)

That's why we count single cycles in fast-path sections of code. :)


Attachments:
(No filename) (226.00 B)