Subject: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Hello kernel-hackers,

I bought a PC intended to be used as a server with the recent Gigabyte
motherboard "GA-X48-DQ6":
http://www.giga-byte.es/products/mb/specs/ga_x48_dq6.html

It includes two RTL8111/8168B NICs, which seem the cause of the kernel
crashes (*but I'm not sure*)..., a Quad core, 4GB RAM and two 500GB HDs.

I installed Debian Linux (4.0) on it and I'm getting *kernel panic*
errors (r8168 module) when set to production state. I couldn't get any
kernel debug messages since I'm using Debian stock kernel
(2.6.18-6-686-bigmem) and I'm not a kernel hacker either. It happens from
time to time so it is not easily reproduceable, or at least I couldn't find
the way to reproduce it on purpose (I stressed the server forcing huge sftp
transfers and/or using iperf tool, without success). But the problem is
there and it seems a NIC driver problem.

Moreover, the machine is located at a remote location so I have to trust
other people reading screen messages for me, etc. It's not easy to debug in
this situation. I've been told that the oops were produced in r8168 module,
from time to time, and that they saw also one only crash while booting the
server "due to SATA" (???). The server uses software-raid and was
re-syncing to a second disk since the second disk was new (so expect high
disk load). I'm a bit confused, it seems that the problem could be the
r8168 driver but I cannot be sure at all.

Could you help me? Do you have reports of similar problems with similar NIC
/ motherboard? How to solve it? I'm sorry for not providing more info
(lspci, etc) but as I previously said the server is in a remote location
and it's currently powered-off. Any hints?

I'm using latest Realtek driver for RTL8111/8168B at:
ftp://202.65.194.212/cn/nic/r8168-8.010.00.tar.bz2
(I have r8169 driver *not* loaded and blacklisted in
/etc/modprobe.d/blacklist -since it didn't work for my 8111B-, and initrd
was rebuilt with new r8168 driver; so the only NIC driver loaded is r8168,
compiled from the former .tar.bz2 (8.010 version).

As a quick workaround (since I need to put that server on production state
*ASAP*), would you recommend to boot with safer options (noacpi, etc)?
Which ones exactly? (no problem if they degrade performance a bit or if
they cause less power-saving; in this case, stability and uptime is preferred).

Thank you from your comprehension and cooperation.
-Roman


2009-01-09 21:53:00

by Frans Pop

[permalink] [raw]
Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Roman Medina-Heigl Hernandez wrote:
> It includes two RTL8111/8168B NICs, which seem the cause of the kernel
> crashes (*but I'm not sure*)..., a Quad core, 4GB RAM and two 500GB HDs.
>
> I installed Debian Linux (4.0) on it and I'm getting *kernel panic*
> errors (r8168 module) when set to production state. I couldn't get any
> kernel debug messages since I'm using Debian stock kernel
> (2.6.18-6-686-bigmem) and I'm not a kernel hacker either.

Have you tried the 2.6.24 kernel that is available for Debian Etch? That
seems to me the most logical thing to try first as a possible solution.

It was included as part of the "Etch and a half" stable release update
[1].

Cheers,
FJP

[1] http://www.debian.org/releases/etch/etchnhalf

Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver


Frans Pop escribi?:
> Roman Medina-Heigl Hernandez wrote:
>> It includes two RTL8111/8168B NICs, which seem the cause of the kernel
>> crashes (*but I'm not sure*)..., a Quad core, 4GB RAM and two 500GB HDs.
>>
>> I installed Debian Linux (4.0) on it and I'm getting *kernel panic*
>> errors (r8168 module) when set to production state. I couldn't get any
>> kernel debug messages since I'm using Debian stock kernel
>> (2.6.18-6-686-bigmem) and I'm not a kernel hacker either.
>
> Have you tried the 2.6.24 kernel that is available for Debian Etch? That
> seems to me the most logical thing to try first as a possible solution.

I didn't (although I thought of it). It seems to me like a blind attempt
and the problem is that I cannot afford to waste another "production
attempt" and that I cannot reproduce the crash whenever I want, so if I
upgrade I'll not be able to know whether the problem is fixed or not until
I got a new crash (or not). Moreover, if the root cause of the problem is
r8168 driver, it will crash in both cases (the driver doesn't change, it's
compiled from source by me).

Anyway, it's a good idea, I'll see if I have a new changes' window. Thank you.

--

Saludos,
-Roman

PGP Fingerprint:
09BB EFCD 21ED 4E79 25FB 29E1 E47F 8A7D EAD5 6742
[Key ID: 0xEAD56742. Available at KeyServ]

2009-01-10 12:55:37

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

On Friday 09 January 2009 23:31:30 Roman Medina-Heigl Hernandez wrote:
> Frans Pop escribi?:
> > Roman Medina-Heigl Hernandez wrote:
> >> It includes two RTL8111/8168B NICs, which seem the cause of the kernel
> >> crashes (*but I'm not sure*)..., a Quad core, 4GB RAM and two 500GB HDs.
> >>
> >> I installed Debian Linux (4.0) on it and I'm getting *kernel panic*
> >> errors (r8168 module) when set to production state. I couldn't get any
> >> kernel debug messages since I'm using Debian stock kernel
> >> (2.6.18-6-686-bigmem) and I'm not a kernel hacker either.
> >
> > Have you tried the 2.6.24 kernel that is available for Debian Etch? That
> > seems to me the most logical thing to try first as a possible solution.
>
> I didn't (although I thought of it). It seems to me like a blind attempt
> and the problem is that I cannot afford to waste another "production
> attempt" and that I cannot reproduce the crash whenever I want, so if I
> upgrade I'll not be able to know whether the problem is fixed or not until
> I got a new crash (or not). Moreover, if the root cause of the problem is
> r8168 driver, it will crash in both cases (the driver doesn't change, it's
> compiled from source by me).

One thing to bear in mind is that if you're using a CPU which goes via swiotlb
(such as a pre-nehalem Intel Quad) 2.6.24 won't be new enough. The r8169
driver until recently had a grave bug on >=4GB RAM systems where the kernel
would crash after some amount of transfer.

My advice to you is to install the very latest Debian kernel (2.6.27) in which
this bug has been fixed. That said, I think if you do this, you won't have any
further problems. Using the vendor driver instead of the one in mainline
probably isn't a good idea.

--
Cheers,
Alistair.

Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Alistair John Strachan escribi?:
> On Friday 09 January 2009 23:31:30 Roman Medina-Heigl Hernandez wrote:
>> Frans Pop escribi?:
>>> Roman Medina-Heigl Hernandez wrote:
>>>> It includes two RTL8111/8168B NICs, which seem the cause of the kernel
>>>> crashes (*but I'm not sure*)..., a Quad core, 4GB RAM and two 500GB HDs.
>>>>
>>>> I installed Debian Linux (4.0) on it and I'm getting *kernel panic*
>>>> errors (r8168 module) when set to production state. I couldn't get any
>>>> kernel debug messages since I'm using Debian stock kernel
>>>> (2.6.18-6-686-bigmem) and I'm not a kernel hacker either.
>>> Have you tried the 2.6.24 kernel that is available for Debian Etch? That
>>> seems to me the most logical thing to try first as a possible solution.
>> I didn't (although I thought of it). It seems to me like a blind attempt
>> and the problem is that I cannot afford to waste another "production
>> attempt" and that I cannot reproduce the crash whenever I want, so if I
>> upgrade I'll not be able to know whether the problem is fixed or not until
>> I got a new crash (or not). Moreover, if the root cause of the problem is
>> r8168 driver, it will crash in both cases (the driver doesn't change, it's
>> compiled from source by me).
>
> One thing to bear in mind is that if you're using a CPU which goes via swiotlb
> (such as a pre-nehalem Intel Quad) 2.6.24 won't be new enough. The r8169
> driver until recently had a grave bug on >=4GB RAM systems where the kernel
> would crash after some amount of transfer.

Oh!

> My advice to you is to install the very latest Debian kernel (2.6.27) in which
> this bug has been fixed. That said, I think if you do this, you won't have any
> further problems. Using the vendor driver instead of the one in mainline
> probably isn't a good idea.

I'm using vendor driver because mainline (r8169) didn't work for my
rtl8111B chipset. The r8169 driver recognized both NICs but it was
constantly giving failures at recognizing link (sometimes worked, sometimes
not) and NIC leds weren't working ok also (clear signal of something wasn't
working properly...). When I banned r8169 module and installed/loaded
vendor driver (r8168), NICs problems disappeared. That's why I was using
vendor driver. I didn't note estability problems by then; they only
appeared when running in production environment (Murphy's law!) :-(

Another curiosity: I've discovered than *another* machine I'm administering
(different customer, different hardware, different isp... but same Debian
version and same kernel: 2.6.18-6-686-bigmem) is using r8169 module without
problems at all. And guess it, it has the same NIC hardware (!!!):
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B
PCI Express Gigabit Ethernet controller (rev 01)

The working machine is an AMD dual core with 2 MB RAM and 1 rtl8111B NIC
(being the non-working an Intel Quad core with 4 MB RAM and 2 rtl8111B
NICs). So yes, as you said, the problems could be due to Quad / 4GB RAM.

There's no Debian-official 2.6.27 deb packages (latest is 2.6.26),
according to:
http://packages.qa.debian.org/l/linux-2.6.html
Best for etch I've found is a 2.6.26 deb package (backport).

I suppose you're refering to the ultra-experimental "trunk" branch at:
deb http://kernel-archive.buildserver.net/debian-kernel trunk main
Right?

I'll give it a try, it seems a good choice (given the circunstances).
Another option: Do you think it worth the pain to compile a custom 2.6.28
kernel? (in other words, do you think it contains fixes that could solve
the strange issues I've been affected with? I don't follow kernel
development so I cannot answer this question but perhaps you could...).

Thank you very much for your comments. They are greatly appreciated.

--

Saludos,
-Roman

PGP Fingerprint:
09BB EFCD 21ED 4E79 25FB 29E1 E47F 8A7D EAD5 6742
[Key ID: 0xEAD56742. Available at KeyServ]

2009-01-10 20:35:36

by Alistair John Strachan

[permalink] [raw]
Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Hi Roman,

On Saturday 10 January 2009 18:47:47 Roman Medina-Heigl Hernandez wrote:
> Alistair John Strachan escribi?:
> > My advice to you is to install the very latest Debian kernel (2.6.27) in
> > which this bug has been fixed. That said, I think if you do this, you
> > won't have any further problems. Using the vendor driver instead of the
> > one in mainline probably isn't a good idea.
[snip]
> The working machine is an AMD dual core with 2 MB RAM and 1 rtl8111B NIC
> (being the non-working an Intel Quad core with 4 MB RAM and 2 rtl8111B
> NICs). So yes, as you said, the problems could be due to Quad / 4GB RAM.

Francois Romieu would be the best person to confirm, but I suspect you are
correct and it does not surprise me that the vendor driver has the same bug.

It would manifest exactly like this -- the NIC would be basically non-
functional with 4GB RAM, but the minute it stops using the swiotlb or AMD GART
IOMMU there was no problem.

> There's no Debian-official 2.6.27 deb packages (latest is 2.6.26),
> according to:
> http://packages.qa.debian.org/l/linux-2.6.html
> Best for etch I've found is a 2.6.26 deb package (backport).
>
> I suppose you're refering to the ultra-experimental "trunk" branch at:
> deb http://kernel-archive.buildserver.net/debian-kernel trunk main
> Right?
>
> I'll give it a try, it seems a good choice (given the circunstances).
> Another option: Do you think it worth the pain to compile a custom 2.6.28
> kernel? (in other words, do you think it contains fixes that could solve
> the strange issues I've been affected with? I don't follow kernel
> development so I cannot answer this question but perhaps you could...).

Yes, vanilla 2.6.27 and of course 2.6.28 have the fixes. There's no
requirement to use Debian's mildly patched kernels if you don't want to. The
reason I recommended the Debian package (though unfortunately as you point out
it doesn't yet exist) is that it's less faff and is presumably more likely to
work in your production environment.

Another (good) option would be to grab the Debian kernel sources (apt-get
source linux-image-<whatever>) and then patch it with the fix, available from
this bugzilla thread:

http://bugzilla.kernel.org/show_bug.cgi?id=9468

Then debuild it as normal (hopefully you're familiar with this process).

> Thank you very much for your comments. They are greatly appreciated.

No problem. I ran into similar problems with this hardware many months ago and
nobody knew anything about it. It's just unfortunate that Debian are behind
the curve in terms of kernel version.

--
Cheers,
Alistair.

Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Alistair John Strachan escribi?:
> Hi Roman,
>
> On Saturday 10 January 2009 18:47:47 Roman Medina-Heigl Hernandez wrote:
>> Alistair John Strachan escribi?:
>>> My advice to you is to install the very latest Debian kernel (2.6.27) in
>>> which this bug has been fixed. That said, I think if you do this, you
>>> won't have any further problems. Using the vendor driver instead of the
>>> one in mainline probably isn't a good idea.
> [snip]
>> The working machine is an AMD dual core with 2 MB RAM and 1 rtl8111B NIC
>> (being the non-working an Intel Quad core with 4 MB RAM and 2 rtl8111B
>> NICs). So yes, as you said, the problems could be due to Quad / 4GB RAM.
>
> Francois Romieu would be the best person to confirm, but I suspect you are
> correct and it does not surprise me that the vendor driver has the same bug.
>
> It would manifest exactly like this -- the NIC would be basically non-
> functional with 4GB RAM, but the minute it stops using the swiotlb or AMD GART
> IOMMU there was no problem.
>
>> There's no Debian-official 2.6.27 deb packages (latest is 2.6.26),
>> according to:
>> http://packages.qa.debian.org/l/linux-2.6.html
>> Best for etch I've found is a 2.6.26 deb package (backport).
>>
>> I suppose you're refering to the ultra-experimental "trunk" branch at:
>> deb http://kernel-archive.buildserver.net/debian-kernel trunk main
>> Right?
>>
>> I'll give it a try, it seems a good choice (given the circunstances).
>> Another option: Do you think it worth the pain to compile a custom 2.6.28
>> kernel? (in other words, do you think it contains fixes that could solve
>> the strange issues I've been affected with? I don't follow kernel
>> development so I cannot answer this question but perhaps you could...).
>
> Yes, vanilla 2.6.27 and of course 2.6.28 have the fixes. There's no
> requirement to use Debian's mildly patched kernels if you don't want to. The
> reason I recommended the Debian package (though unfortunately as you point out
> it doesn't yet exist) is that it's less faff and is presumably more likely to
> work in your production environment.

I've built two custom .deb packages for kernels:
- 2.6.28
- 2.6.27.10

The reason why I built the second one is due to this:
http://marc.info/?l=linux-kernel&m=123074869611409&w=4

It scared me a bit!

Now the dilemma is: which kernel to choose? 2.6.28 should be better but if
the former bug report is confirmed....... Kernel 2.6.28 have several r8169
patches merged; this could be good... or not... who knows... In the other
side, 2.6.27.10 is perhaps newer enough, and it's reported to work ok
(well, not exactly: former bug report said "r8169 worked fine in 2.6.27.8",
and since there are no changes in r8169 from 2.6.27.7 to 2.6.27.10, it
should be safe).

I'll probably go for 2.6.27.10. If you think I'm not right, please, correct
me :-)

> Another (good) option would be to grab the Debian kernel sources (apt-get
> source linux-image-<whatever>) and then patch it with the fix, available from
> this bugzilla thread:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=9468
>
> Then debuild it as normal (hopefully you're familiar with this process).

If I have to go towards the PITA of maintaining my own .deb kernel package
I prefer to use latest kernels (2.6.18 is pretty older) so at least I'll
have the benefits of better hardware support (drivers) and (hopefully) more
fixes applied :)

>> Thank you very much for your comments. They are greatly appreciated.
>
> No problem. I ran into similar problems with this hardware many months ago and
> nobody knew anything about it. It's just unfortunate that Debian are behind
> the curve in terms of kernel version.

Yes, but Debian has also reasons to keep doing this so "stable" release
continues being called "stable" :-) Anyway, I don't want to enter a
flamewar here.

I'll try to install one of the new kernels tomorrow evening...

--

Saludos,
-Roman

PGP Fingerprint:
09BB EFCD 21ED 4E79 25FB 29E1 E47F 8A7D EAD5 6742
[Key ID: 0xEAD56742. Available at KeyServ]

Subject: Re: Oops with Gigabyte motherboard "GA-X48-DQ6" and r8168 Realtek driver

Roman Medina-Heigl Hernandez escribi?:
> I've built two custom .deb packages for kernels:
> - 2.6.28
> - 2.6.27.10
>
> I'll probably go for 2.6.27.10. If you think I'm not right, please, correct
> me :-)
>
> I'll try to install one of the new kernels tomorrow evening...

Hello,

I installed 2.6.27.10 on monday. No problems arised since then, it seems
stable. I'm using r8169 built-in kernel driver instead of r8168 vendor
module. To be honest I'm not sure whether the problem was due to this r8168
driver, or it's simply that 2.6.18 is too much old for a modern PC (with
Quad core).

Thank you to all for your contribution, specially to Alistair.

Cheers,
-Roman