2006-09-28 01:37:36

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Network problem with 2.6.18-mm1 ?



I am unable to get networking to work with 2.6.18-mm1 on my system.

But 2.6.18 kernel on same system works fine. Here is some info about
the system/debug attempts. Attached are the lspci output and config.

Appreciate any help. Please let me know if you need more info.

Suka

System info:

x326, 2 CPU (AMD Opteron Processor 250)

Kernel info:

$ uname -a
Linux elm3b166 2.6.18-mm1 #4 SMP PREEMPT Tue Sep 26 18:11:58 PDT 2006
x86_64 GNU/Linux

Config tokens differing between the 2.6.18 kernel that works and
the 2.6.18-mm1 that does not are:

Tokens in 2.6.18 but not in 2.6.18-mm1 config

CONFIG_SCSI_FC_ATTRS=y
CONFIG_SCSI_SATA_SIL=y
CONFIG_SCSI_SATA=y

Tokens in 2.6.18-mm1 but not in 2.6.18 config

CONFIG_PROC_SYSCTL=y
CONFIG_SATA_SIL=y
CONFIG_ATA=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_BLOCK=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_ZONE_DMA=y
CONFIG_FB_DDC=y

All drivers compiled into kernel in both cases.

Debug info:

Checked hardware connections :-)
(Rebooting on 2.6.18 kernel works - consistently)

$ ethtool -i eth0
driver: e1000
version: 7.2.7-k2
firmware-version: N/A

$ ip addr
seems fine (up, broadcasting etc)

$ ip -s link
shows no errors/drops/overruns

$ ip route
shows the correct gw

$ ethtool -S eth0

shows non-zero tx/rx packets/bytes but *rx_missed_errors*
quite large (~138K) and increasing over time

$ ping <own-ip-addr>
works fine

$ ping <gateway>
no response.

$ tcpdump -i eth0 host <broken-host>

while pinging gateway, tcpdump shows messages like:

18:03:45.936161 arp who-has <gateway> tell <broken-host>

(Config file and lspci output are attached)


Attachments:
(No filename) (1.68 kB)
lspci.out (1.94 kB)
lspci.out
config-mm1 (34.27 kB)
config-2.6.18-mm1
Download all attachments

2006-09-28 02:04:20

by Kok, Auke

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

Sukadev Bhattiprolu wrote:
>
> I am unable to get networking to work with 2.6.18-mm1 on my system.
>
> But 2.6.18 kernel on same system works fine. Here is some info about
> the system/debug attempts. Attached are the lspci output and config.
>
> Appreciate any help. Please let me know if you need more info.
>
> Suka
>
> System info:
>
> x326, 2 CPU (AMD Opteron Processor 250)
>
> Kernel info:
>
> $ uname -a
> Linux elm3b166 2.6.18-mm1 #4 SMP PREEMPT Tue Sep 26 18:11:58 PDT 2006
> x86_64 GNU/Linux
>
> Config tokens differing between the 2.6.18 kernel that works and
> the 2.6.18-mm1 that does not are:
>
> Tokens in 2.6.18 but not in 2.6.18-mm1 config
>
> CONFIG_SCSI_FC_ATTRS=y
> CONFIG_SCSI_SATA_SIL=y
> CONFIG_SCSI_SATA=y
>
> Tokens in 2.6.18-mm1 but not in 2.6.18 config
>
> CONFIG_PROC_SYSCTL=y
> CONFIG_SATA_SIL=y
> CONFIG_ATA=y
> CONFIG_ARCH_POPULATES_NODE_MAP=y
> CONFIG_CRYPTO_ALGAPI=y
> CONFIG_MICROCODE_OLD_INTERFACE=y
> CONFIG_BLOCK=y
> CONFIG_VIDEO_V4L1_COMPAT=y
> CONFIG_ZONE_DMA=y
> CONFIG_FB_DDC=y
>
> All drivers compiled into kernel in both cases.
>
> Debug info:
>
> Checked hardware connections :-)
> (Rebooting on 2.6.18 kernel works - consistently)
>
> $ ethtool -i eth0
> driver: e1000
> version: 7.2.7-k2
> firmware-version: N/A
>
> $ ip addr
> seems fine (up, broadcasting etc)
>
> $ ip -s link
> shows no errors/drops/overruns
>
> $ ip route
> shows the correct gw
>
> $ ethtool -S eth0
>
> shows non-zero tx/rx packets/bytes but *rx_missed_errors*
> quite large (~138K) and increasing over time
>
> $ ping <own-ip-addr>
> works fine
>
> $ ping <gateway>
> no response.
>
> $ tcpdump -i eth0 host <broken-host>
>
> while pinging gateway, tcpdump shows messages like:
>
> 18:03:45.936161 arp who-has <gateway> tell <broken-host>
>
> (Config file and lspci output are attached)

how about dmesg? Perhaps it shows some valuable information.

also, since this is a networking problem, please include `ifconfig eth0` and the full
output of `ethtool eth0` and `ethtool -S eth0`

Cheers,

Auke

2006-09-28 18:52:29

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

Thanks. See below for additional info

Auke Kok [[email protected]] wrote:
| Sukadev Bhattiprolu wrote:
| >
| >I am unable to get networking to work with 2.6.18-mm1 on my system.
| >
| >But 2.6.18 kernel on same system works fine. Here is some info about
| >the system/debug attempts. Attached are the lspci output and config.
| >
| >Appreciate any help. Please let me know if you need more info.
| >
| >Suka
| >
| >System info:
| >
| > x326, 2 CPU (AMD Opteron Processor 250)
| >
| >Kernel info:
| >
| > $ uname -a
| > Linux elm3b166 2.6.18-mm1 #4 SMP PREEMPT Tue Sep 26 18:11:58 PDT 2006
| > x86_64 GNU/Linux
| >
| > Config tokens differing between the 2.6.18 kernel that works and
| > the 2.6.18-mm1 that does not are:
| >
| > Tokens in 2.6.18 but not in 2.6.18-mm1 config
| >
| > CONFIG_SCSI_FC_ATTRS=y
| > CONFIG_SCSI_SATA_SIL=y
| > CONFIG_SCSI_SATA=y
| >
| > Tokens in 2.6.18-mm1 but not in 2.6.18 config
| >
| > CONFIG_PROC_SYSCTL=y
| > CONFIG_SATA_SIL=y
| > CONFIG_ATA=y
| > CONFIG_ARCH_POPULATES_NODE_MAP=y
| > CONFIG_CRYPTO_ALGAPI=y
| > CONFIG_MICROCODE_OLD_INTERFACE=y
| > CONFIG_BLOCK=y
| > CONFIG_VIDEO_V4L1_COMPAT=y
| > CONFIG_ZONE_DMA=y
| > CONFIG_FB_DDC=y
| >
| > All drivers compiled into kernel in both cases.
| >
| >Debug info:
| >
| > Checked hardware connections :-)
| > (Rebooting on 2.6.18 kernel works - consistently)
| >
| > $ ethtool -i eth0
| > driver: e1000
| > version: 7.2.7-k2
| > firmware-version: N/A
| >
| > $ ip addr
| > seems fine (up, broadcasting etc)
| >
| > $ ip -s link
| > shows no errors/drops/overruns
| >
| > $ ip route
| > shows the correct gw
| >
| > $ ethtool -S eth0
| >
| > shows non-zero tx/rx packets/bytes but *rx_missed_errors*
| > quite large (~138K) and increasing over time
| >
| > $ ping <own-ip-addr>
| > works fine
| >
| > $ ping <gateway>
| > no response.
| >
| > $ tcpdump -i eth0 host <broken-host>
| >
| > while pinging gateway, tcpdump shows messages like:
| >
| > 18:03:45.936161 arp who-has <gateway> tell <broken-host>
| >
| >(Config file and lspci output are attached)
|
| how about dmesg? Perhaps it shows some valuable information.

Am attaching the dmesg.out

|
| also, since this is a networking problem, please include `ifconfig eth0`
| and the full output of `ethtool eth0` and `ethtool -S eth0`

$ ifconfig eth0

eth0 Link encap:Ethernet HWaddr 00:02:B3:9D:D4:D7
inet addr:10.0.67.166 Bcast:10.0.67.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:564 errors:0 dropped:5927 overruns:0 frame:0
TX packets:105 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:81803 (79.8 KiB) TX bytes:6720 (6.5 KiB)
Base address:0x3400 Memory:e8240000-e8260000


$ ethtool eth0

Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: umbg
Wake-on: g
Current message level: 0x00000007 (7)
Link detected: yes

$ ethtool -S eth0

NIC statistics:
rx_packets: 564
tx_packets: 105
rx_bytes: 81803
tx_bytes: 6720
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 11
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 310
rx_missed_errors: 5865
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 81803
rx_csum_offload_good: 0
rx_csum_offload_errors: 0
rx_header_split: 0
alloc_rx_buff_failed: 0

Hope this helps. Let me know if you need more info.

Suka


Attachments:
(No filename) (4.39 kB)
dmesg-mm1.out (18.16 kB)
dmesg-2.6.18-mm1.out
Download all attachments

2006-09-28 21:10:06

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

On 9/28/06, Sukadev Bhattiprolu <[email protected]> wrote:
> Thanks. See below for additional info
>
> Auke Kok [[email protected]] wrote:
> | Sukadev Bhattiprolu wrote:
> | >
> | >I am unable to get networking to work with 2.6.18-mm1 on my system.
> | >
> | >But 2.6.18 kernel on same system works fine. Here is some info about
> | >the system/debug attempts. Attached are the lspci output and config.
> | >
> | >Appreciate any help. Please let me know if you need more info.

It seems you're having interrupt delivery problems or interrupts are
getting lost.
rx_missed_errors indicates frames that were dropped due to the e1000
adapter's fifo getting full and over flowing.
> rx_no_buffer_count: 310
> rx_missed_errors: 5865
rx_no_buffer_count indicates that the driver didn't return buffers to
the hardware soon enough, but the hardware was able to store the
packet (at the time of reception) in the fifo to try again.

Both these indicate to me that there is something wrong with
interrupts. Maybe interrupt sharing

can you possibly try a back to back connection with another linux box
and run tcpdump on both ends then ping? it will tell us if traffic is
truely getting out and coming in okay.

also please send output of lspci -vv and cat /proc/interrupts

Jesse

2006-09-29 00:52:15

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

Jesse Brandeburg [[email protected]] wrote:
| On 9/28/06, Sukadev Bhattiprolu <[email protected]> wrote:
| >Thanks. See below for additional info
| >
| >Auke Kok [[email protected]] wrote:
| >| Sukadev Bhattiprolu wrote:
| >| >
| >| >I am unable to get networking to work with 2.6.18-mm1 on my system.
| >| >
| >| >But 2.6.18 kernel on same system works fine. Here is some info about
| >| >the system/debug attempts. Attached are the lspci output and config.
| >| >
| >| >Appreciate any help. Please let me know if you need more info.
|
| It seems you're having interrupt delivery problems or interrupts are
| getting lost.
| rx_missed_errors indicates frames that were dropped due to the e1000
| adapter's fifo getting full and over flowing.
| >rx_no_buffer_count: 310
| >rx_missed_errors: 5865
| rx_no_buffer_count indicates that the driver didn't return buffers to
| the hardware soon enough, but the hardware was able to store the
| packet (at the time of reception) in the fifo to try again.
|
| Both these indicate to me that there is something wrong with
| interrupts. Maybe interrupt sharing
|
| can you possibly try a back to back connection with another linux box
| and run tcpdump on both ends then ping? it will tell us if traffic is
| truely getting out and coming in okay.

Unfortunately, I can't try this week, but can try it early next week.

|
| also please send output of lspci -vv and cat /proc/interrupts

lspci-vv.out is attached. Here is the /proc/interrupts:

$ cat /proc/interrupts

CPU0 CPU1
0: 18316 0 IO-APIC-edge timer
2: 0 0 XT-PIC-level cascade
4: 1023 0 IO-APIC-edge serial
8: 0 0 IO-APIC-edge rtc
17: 3380 0 IO-APIC-fasteoi libata
19: 174 0 IO-APIC-fasteoi ohci_hcd:usb1, ohci_hcd:usb2
28: 0 0 IO-APIC-fasteoi eth0
NMI: 96 35
LOC: 18251 18524
ERR: 0


Attachments:
(No filename) (1.97 kB)
lspci-vv.txt (10.19 kB)
lspci-vv.txt
Download all attachments

2006-09-29 18:08:14

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

On 9/28/06, Sukadev Bhattiprolu <[email protected]> wrote:
> $ cat /proc/interrupts
>
> CPU0 CPU1
> 28: 0 0 IO-APIC-fasteoi eth0
> NMI: 96 35
> LOC: 18251 18524
> ERR: 0

you should be getting an interrupt every two seconds from the eth0
(e1000) driver. You are having interrupt delivery problems probably
due to something screwing up interrupt routing in the kernel.
Normally these issues are associated with MSI interrupts but your
adapter doesn't support those and is using generic IRQ

I'm guessing that if you somehow enable interrupts on your vga card on
the same bus as e1000 (bus 3) it will have interrupt delivery problems
as well. Maybe try xorg?

Jesse

2006-09-29 23:31:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

"Jesse Brandeburg" <[email protected]> writes:

> On 9/28/06, Sukadev Bhattiprolu <[email protected]> wrote:
>> $ cat /proc/interrupts
>>
>> CPU0 CPU1
>> 28: 0 0 IO-APIC-fasteoi eth0
>> NMI: 96 35
>> LOC: 18251 18524
>> ERR: 0
>
> you should be getting an interrupt every two seconds from the eth0
> (e1000) driver. You are having interrupt delivery problems probably
> due to something screwing up interrupt routing in the kernel.
> Normally these issues are associated with MSI interrupts but your
> adapter doesn't support those and is using generic IRQ
>
> I'm guessing that if you somehow enable interrupts on your vga card on
> the same bus as e1000 (bus 3) it will have interrupt delivery problems
> as well. Maybe try xorg?

To summarize.

We have an e1000 plugged into a pci-x slot on an Opteron system with
an amd chipset.

That motherboard has 3 ioapics. (One on each PCI-X bridge and
one on the 8111 for handling everything else.

We know we are getting interrupts through the 8111 ioapic.

We don't know which ioapic the pci-x bus is hooked to.

So either the ioapics on the 8131 are having problems.
Or we have a problem parsing the irq routing tables.

We see in dmesg.
[ 0.000000] I/O APIC #2 at 0xFEC00000.
[ 0.000000] I/O APIC #3 at 0xE8000000.
[ 0.000000] I/O APIC #4 at 0xE8001000.
...
[ 97.410411] PCI: Cannot allocate resource region 0 of device 0000:00:0a.1
[ 97.423945] PCI: Cannot allocate resource region 0 of device 0000:00:0b.1

We see in lspci
>
> 0000:00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> (prog-if 10 [IO-APIC])
> Subsystem: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
> Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
> Latency: 0
> Region 0: Memory at 88100000 (64-bit, non-prefetchable) [size=4K]
>
> 0000:00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> (prog-if 10 [IO-APIC])
> Subsystem: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
> Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
> Latency: 0
> Region 0: Memory at 88101000 (64-bit, non-prefetchable) [size=4K]

So it looks like the kernel moved the ioapics.

The following patch in 2.6.18-mm1 is known to have that effect.
x86_64-mm-insert-ioapics-and-local-apic-into-resource-map

Can you please try reverting that one patch?

There is a fix an updated version of that patch I think in -mm2
but I haven't had a chance to see if it fixes the problem yet.

Eric

2006-10-02 23:18:24

by Badari Pulavarty

[permalink] [raw]
Subject: Re: Network problem with 2.6.18-mm1 ?

On Fri, 2006-09-29 at 17:30 -0600, Eric W. Biederman wrote:
....
>
> So it looks like the kernel moved the ioapics.
>
> The following patch in 2.6.18-mm1 is known to have that effect.
> x86_64-mm-insert-ioapics-and-local-apic-into-resource-map
>
> Can you please try reverting that one patch?
>
> There is a fix an updated version of that patch I think in -mm2
> but I haven't had a chance to see if it fixes the problem yet.
>

Bingo !! Reverting this patch fixed my networking problem on
2.6.18-mm2.

Thanks,
Badari