2005-02-23 20:41:47

by Ammar T. Al-Sayegh

[permalink] [raw]
Subject: kernel BUG at mm/rmap.c:483!

Hi All,

I recently installed Fedora RC3 on a new server.
The kernel is 2.6.10-1.741_FC3smp. The server
crashes every few days. When I examine /var/log/messages,
I find the following line just before the crash:

Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!

No further debug lines are given to diagnose the
source of the problem.

I have been using kernel 2.4 for few years now without
any problem. This is the first time I see this problem
with kernel 2.6. I'm not sure if this is related to
the kernel itself, the new hardware, or some other
installed software. I'm thinking about downgrading to
kernel 2.4. Do you think this will resolve this issue?
Any suggestion on what else I can do to mitigate this
problem?

Thanks.


-ammar


2005-02-23 20:59:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Wed, 2005-02-23 at 15:41 -0500, Ammar T. Al-Sayegh wrote:
> Hi All,
>
> I recently installed Fedora RC3 on a new server.
> The kernel is 2.6.10-1.741_FC3smp. The server
> crashes every few days. When I examine /var/log/messages,
> I find the following line just before the crash:
>
> Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
> Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!
>
> No further debug lines are given to diagnose the
> source of the
no oops at all?

which modules are you using?

2005-02-23 21:34:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Wed, 23 Feb 2005, Ammar T. Al-Sayegh wrote:
>
> I recently installed Fedora RC3 on a new server.
> The kernel is 2.6.10-1.741_FC3smp.

I can't really speak for Fedora RC3 kernels,
perhaps there's some special patch in there that happens to
trigger it for you, but certainly there have been occasional
other reports of this BUG with vanilla kernel.org kernels.

> The server
> crashes every few days. When I examine /var/log/messages,
> I find the following line just before the crash:
>
> Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
> Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!
>
> No further debug lines are given to diagnose the
> source of the problem.

It's odd that you get no more lines, but it doesn't really
matter in this case. Sadly, the debug info accompanying this
BUG has done very little to shed light on its causes (and it's
on my todo list to change it to something less of a hindrance).

> I have been using kernel 2.4 for few years now without
> any problem. This is the first time I see this problem
> with kernel 2.6. I'm not sure if this is related to
> the kernel itself, the new hardware, or some other
> installed software. I'm thinking about downgrading to
> kernel 2.4. Do you think this will resolve this issue?

Downgrading to 2.4 will most certainly stop that particular
BUG, since 2.4 has no equivalent. But it won't necessarily
fix the underlying issue.

> Any suggestion on what else I can do to mitigate this
> problem?

The first thing to do is to give memtest86 a good (say
overnight) run. Many of the rmap.c BUG reporters have
subsequently found memtest86 failures, and we believe those
instances are accounted for by bad memory. And if that's so
in your case, you don't really want to be running 2.4 on it.

But not all cases could be accounted in that way. If you
report back that memtest86 ran cleanly, then I'll have to
rework a debug patch against your Fedora RC3 kernel to try
to give us more info - though quite possibly you cannot afford
such experiments on this server, and will revert to 2.4 for now.

Hugh

2005-02-23 21:47:15

by Ammar T. Al-Sayegh

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

> On Wed, 2005-02-23 at 15:41 -0500, Ammar T. Al-Sayegh wrote:
>> Hi All,
>>
>> I recently installed Fedora RC3 on a new server.
>> The kernel is 2.6.10-1.741_FC3smp. The server
>> crashes every few days. When I examine /var/log/messages,
>> I find the following line just before the crash:
>>
>> Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
>> Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!
>>
>> No further debug lines are given to diagnose the
>> source of the
> no oops at all?

No. Is there a way to enable the kernel to give more
diagnostic debug output next time this error happens?

Is there a way to at least make the server reboot itself
next time the kernel is alerted to this problem before
crashing? When the server is rebooted, it works fine for
few more days before encountering this problem and
crashing again.


> which modules are you using?

[root ~]# lsmod
Module Size Used by
ip_conntrack_ftp 76145 0
md5 8001 1
ipv6 236769 78
autofs4 21829 0
i2c_dev 13249 0
i2c_core 24513 1 i2c_dev
sunrpc 135077 1
ipt_REJECT 10561 2
ipt_state 5825 79
ip_conntrack 45317 2 ip_conntrack_ftp,ipt_state
iptable_filter 7489 1
ip_tables 20929 3 ipt_REJECT,ipt_state,iptable_filter
microcode 11489 0
dm_mod 57925 0
video 19653 0
button 10577 0
battery 13253 0
ac 8773 0
uhci_hcd 33497 0
ehci_hcd 33737 0
e1000 84629 0
floppy 56913 0
ext3 117961 6
jbd 57177 1 ext3
3w_xxxx 30561 0
ata_piix 12485 7
libata 44101 1 ata_piix
sd_mod 20545 9
scsi_mod 116033 3 3w_xxxx,libata,sd_mod

Any clue?


-ammar

2005-02-23 22:03:36

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Wed, 2005-02-23 at 16:45 -0500, Ammar T. Al-Sayegh wrote:
> > On Wed, 2005-02-23 at 15:41 -0500, Ammar T. Al-Sayegh wrote:
> >> Hi All,
> >>
> >> I recently installed Fedora RC3 on a new server.
> >> The kernel is 2.6.10-1.741_FC3smp. The server
> >> crashes every few days. When I examine /var/log/messages,
> >> I find the following line just before the crash:
> >>
> >> Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
> >> Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!
> >>
> >> No further debug lines are given to diagnose the
> >> source of the
> > no oops at all?
>
> No. Is there a way to enable the kernel to give more
> diagnostic debug output next time this error happens?

not really; it was supposed to do that already

> i2c_dev 13249 0
> i2c_core 24513 1 i2c_dev

try for fun to not use i2c for a while

> microcode 11489 0
same for microcode... try removing that so that the microcode of your
system doesn't get updated at boot



2005-02-23 22:09:57

by Nick Warne

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

> But not all cases could be accounted in that way. If you
> report back that memtest86 ran cleanly...

Hugh,

Nothing to do with the 'problem' in this thread, but an aside that is perhaps
relevant.

On my main gateway, I couldn't get any kernel greater than 2.6.4 to run
without an 'oops' after x amount of time. It was always swapd or memory oops
that caused it.

I ran memtest86 a few times with no errors - reaseated everything, new fans
etc. etc. No go.

I upgraded memory - all 4 sticks - over Christmas, and after a few weeks
uptime, tried 2.4.10 again.

I have had no problems since - so perhaps I did have bad memory (it was old).
But all tests never showed anything untoward.

I was always suspicious why my 2.6.4 build ran OK, but newer builds always
failed. Could it be a subtle fault in memory whilst building kernels that
does it?

Nick
--
"When you're chewing on life's gristle,
Don't grumble, Give a whistle..."

2005-02-23 22:47:05

by Ammar T. Al-Sayegh

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

----- Original Message -----
From: "Hugh Dickins" <[email protected]>
To: "Ammar T. Al-Sayegh" <[email protected]>
Cc: <[email protected]>
Sent: Wednesday, February 23, 2005 4:31 PM
Subject: Re: kernel BUG at mm/rmap.c:483!


> On Wed, 23 Feb 2005, Ammar T. Al-Sayegh wrote:
>>
>> Any suggestion on what else I can do to mitigate this
>> problem?
>
> The first thing to do is to give memtest86 a good (say
> overnight) run. Many of the rmap.c BUG reporters have
> subsequently found memtest86 failures, and we believe those
> instances are accounted for by bad memory. And if that's so
> in your case, you don't really want to be running 2.4 on it.
>
> But not all cases could be accounted in that way. If you
> report back that memtest86 ran cleanly, then I'll have to
> rework a debug patch against your Fedora RC3 kernel to try
> to give us more info - though quite possibly you cannot afford
> such experiments on this server, and will revert to 2.4 for now.

The problem is that my server is already in production
mode. I'm running great portion of my business on it,
where there is very little tolerance for downtime.
Because the server is located in a remote datacenter,
every time it goes down it takes several hours to have
someone sent up there to manually reboot it for a hefty
emergency fee. So this bug has already cost me a lot of
money, and I'm worried that it will cost me a lot of my
clients as well if it persists.

Remote hands are rather expensive, so it will cost me
$100/hr to have someone runs memtest86 on my server
since I can't perform it remotely. I'll do it though
since that's your recommendation for the time being.
Hope it will not take more than an hour to run the
test, and hope it turns out as bad memory modules as
you expect because I hate to downgrade after all the
time and money I expended on the upgrade.


-ammar

2005-02-23 22:51:25

by Ammar T. Al-Sayegh

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

----- Original Message -----
From: "Arjan van de Ven" <[email protected]>
To: "Ammar T. Al-Sayegh" <[email protected]>
Cc: <[email protected]>
Sent: Wednesday, February 23, 2005 5:01 PM
Subject: Re: kernel BUG at mm/rmap.c:483!


> On Wed, 2005-02-23 at 16:45 -0500, Ammar T. Al-Sayegh wrote:
>> > On Wed, 2005-02-23 at 15:41 -0500, Ammar T. Al-Sayegh wrote:
>> >> Hi All,
>> >>
>> >> I recently installed Fedora RC3 on a new server.
>> >> The kernel is 2.6.10-1.741_FC3smp. The server
>> >> crashes every few days. When I examine /var/log/messages,
>> >> I find the following line just before the crash:
>> >>
>> >> Feb 22 23:50:35 hostname kernel: ------------[ cut here ]------------
>> >> Feb 22 23:50:35 hostname kernel: kernel BUG at mm/rmap.c:483!
>> >>
>> >> No further debug lines are given to diagnose the
>> >> source of the
>> > no oops at all?
>>
>> No. Is there a way to enable the kernel to give more
>> diagnostic debug output next time this error happens?
>
> not really; it was supposed to do that already
>
>> i2c_dev 13249 0
>> i2c_core 24513 1 i2c_dev
>
> try for fun to not use i2c for a while
>
>> microcode 11489 0
> same for microcode... try removing that so that the microcode of your
> system doesn't get updated at boot

What do these two modules do in particular? and how can I disable
them so that they don't get reloaded during boot time? do I need
to disable both i2c_dev and i2c_core or just one of them?

Thanks.


-ammar

2005-02-24 05:31:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Wed, 23 Feb 2005, Ammar T. Al-Sayegh wrote:
> ----- Original Message ----- From: "Hugh Dickins" <[email protected]>
> > though quite possibly you cannot afford
> > such experiments on this server, and will revert to 2.4 for now.
>
> The problem is that my server is already in production
> mode. I'm running great portion of my business on it,
> where there is very little tolerance for downtime.

I feared as much.

> Because the server is located in a remote datacenter,
> every time it goes down it takes several hours to have
> someone sent up there to manually reboot it for a hefty
> emergency fee. So this bug has already cost me a lot of
> money, and I'm worried that it will cost me a lot of my
> clients as well if it persists.

I'm very sorry for that.

> Remote hands are rather expensive, so it will cost me
> $100/hr to have someone runs memtest86 on my server
> since I can't perform it remotely. I'll do it though
> since that's your recommendation for the time being.
> Hope it will not take more than an hour to run the
> test, and hope it turns out as bad memory modules as
> you expect because I hate to downgrade after all the
> time and money I expended on the upgrade.

One hour will be enough if it does find a problem in that time,
worth a shot; but not enough to give confidence in the memory
if it does not find one, 12 hours better. I actually wonder
whether rmap.c:483 is the best memory tester (serious answer
would be, in some cases yes, but not in all).

Do let me know. If I can find time to rejig the debug patch
against your kernel, it would itself keep your server running,
replacing the BUG_ON by printks and safety. But without knowing
what it will report, I can't judge how satisfactory that would
be (and it's unlikely to lead us to the final answer in one go).

Hugh

2005-02-24 08:39:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

> really; it was supposed to do that already
> >
> >> i2c_dev 13249 0
> >> i2c_core 24513 1 i2c_dev
> >
> > try for fun to not use i2c for a while
> >
> >> microcode 11489 0
> > same for microcode... try removing that so that the microcode of your
> > system doesn't get updated at boot
>
> What do these two modules do in particular? and how can I disable
> them so that they don't get reloaded during boot time? do I need
> to disable both i2c_dev and i2c_core or just one of them?

i2c is used to directly talk to motherboard hardware such as temperature
sensors. I've seen cases of certain chipset bugs leading to cacheline
corruption when stuff talked to the slow i2c bus and did other stuff in
parallel.

microcode changes the microcode of the cpu (a part of your cpu is
actually written in "software" that can be updated); however updating
this behind the back of the bios might not always be a good idea. (but I
have no hard proof of any failures due to this)

As for how to disable these.. you could just rename the respective .ko
files to .notko or something....


2005-02-24 09:25:50

by Ammar T. Al-Sayegh

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

>> really; it was supposed to do that already
>> >
>> >> i2c_dev 13249 0
>> >> i2c_core 24513 1 i2c_dev
>> >
>> > try for fun to not use i2c for a while
>> >
>> >> microcode 11489 0
>> > same for microcode... try removing that so that the microcode of your
>> > system doesn't get updated at boot
>>
>> What do these two modules do in particular? and how can I disable
>> them so that they don't get reloaded during boot time? do I need
>> to disable both i2c_dev and i2c_core or just one of them?
>
> i2c is used to directly talk to motherboard hardware such as temperature
> sensors. I've seen cases of certain chipset bugs leading to cacheline
> corruption when stuff talked to the slow i2c bus and did other stuff in
> parallel.
>
> microcode changes the microcode of the cpu (a part of your cpu is
> actually written in "software" that can be updated); however updating
> this behind the back of the bios might not always be a good idea. (but I
> have no hard proof of any failures due to this)
>
> As for how to disable these.. you could just rename the respective .ko
> files to .notko or something....

Done. Following is my new loaded module list:

[root ~]# lsmod
Module Size Used by
ip_conntrack_ftp 76145 0
md5 8001 1
ipv6 236769 38
autofs4 21829 0
sunrpc 135077 1
ipt_REJECT 10561 2
ipt_state 5825 79
ip_conntrack 45317 2 ip_conntrack_ftp,ipt_state
iptable_filter 7489 1
ip_tables 20929 3 ipt_REJECT,ipt_state,iptable_filter
dm_mod 57925 0
video 19653 0
button 10577 0
battery 13253 0
ac 8773 0
uhci_hcd 33497 0
ehci_hcd 33737 0
e1000 84629 0
floppy 56913 0
ext3 117961 6
jbd 57177 1 ext3
3w_xxxx 30561 0
ata_piix 12485 7
libata 44101 1 ata_piix
sd_mod 20545 9
scsi_mod 116033 3 3w_xxxx,libata,sd_mod

Looks better now?

I guess I can no longer monitor the processor temperature and
such after preventing i2c from loading, but what what's the
penalty of preventing microcode from loading? a performance
hit?

I will be testing memory as suggested by Hugh Dickins as well.
Hopefully, your trick or Hugh's suggestion will help revealing
the source of the problem, if not the kernel itself.


-ammar

2005-02-24 09:30:41

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!


> I guess I can no longer monitor the processor temperature and
> such after preventing i2c from loading,

yup

> but what what's the
> penalty of preventing microcode from loading? a performance
> hit?

not even that; in theory a few cpu bugs may have been fixed. Nobody
really knows since there's no changelog for the microcode..


2005-02-24 12:16:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Wed, 23 Feb 2005, Nick Warne wrote:
>
> I upgraded memory - all 4 sticks - over Christmas, and after a few weeks
> uptime, tried 2.4.10 again.
>
> I have had no problems since - so perhaps I did have bad memory (it was old).
> But all tests never showed anything untoward.
>
> I was always suspicious why my 2.6.4 build ran OK, but newer builds always
> failed. Could it be a subtle fault in memory whilst building kernels that
> does it?

Perhaps, though I don't remember hearing of any example of that.

I think what typically happens, on a build the size of the kernel,
is that one of the compilations collapses with a SIGSEGV because some
pointer within gcc gets corrupted by the bad memory, so the build
fails to complete; rather than completing with a bad vmlinux output.

A more likely cause for what you saw, if the bad memory is low down
or high up (and what I mean by high may change: wli made an important
change to memory allocation ordering in 2.6.8 which would affect it),
is that one kernel's image or system initialization will happen to
allocate the bad memory to something scarcely used, where another
may allocate it to something vital.

But please don't place too much weight on my idle speculations,
others have a better appreciation of these issues.

Hugh

2005-02-24 13:25:47

by Horst H. von Brand

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

Hugh Dickins <[email protected]> said:
> On Wed, 23 Feb 2005, Ammar T. Al-Sayegh wrote:
> > I recently installed Fedora RC3 on a new server.
> > The kernel is 2.6.10-1.741_FC3smp.

> I can't really speak for Fedora RC3 kernels,
> perhaps there's some special patch in there that happens to
> trigger it for you, but certainly there have been occasional
> other reports of this BUG with vanilla kernel.org kernels.

That kernel is outdated, current is 2.6.10-1.766_FC3. Before reporting any
bugs, update everything! And in case of problems with vendor kernels, it is
more useful for everybody involved to report to the distribution.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-02-28 10:43:15

by Giacomo A. Catenazzi

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

Arjan van de Ven wrote:

>>but what what's the
>>penalty of preventing microcode from loading? a performance
>>hit?
>
>
> not even that; in theory a few cpu bugs may have been fixed. Nobody
> really knows since there's no changelog for the microcode..

You can see the processor bugs in intel website, i.e.:
ftp://download.intel.com/design/Xeon/specupdt/24967847.pdf

The following sentence (IMHO) meens that bug is corrected in microcode:
"Workaround: It is possible for the BIOS to contain a workaround
for this erratum."

ciao
cate

2005-02-28 11:09:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: kernel BUG at mm/rmap.c:483!

On Mon, 2005-02-28 at 11:43 +0100, Giacomo A. Catenazzi wrote:
> Arjan van de Ven wrote:
>
> >>but what what's the
> >>penalty of preventing microcode from loading? a performance
> >>hit?
> >
> >
> > not even that; in theory a few cpu bugs may have been fixed. Nobody
> > really knows since there's no changelog for the microcode..
>
> You can see the processor bugs in intel website, i.e.:
> ftp://download.intel.com/design/Xeon/specupdt/24967847.pdf
>
> The following sentence (IMHO) meens that bug is corrected in microcode:
> "Workaround: It is possible for the BIOS to contain a workaround
> for this erratum."

yeah but it doesn't say in which microcode. Eg it's not possible to find
out what a specific microcode update changes over the one the bios
already put in...