2003-01-24 06:40:22

by Michael Fu

[permalink] [raw]
Subject: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec

After kernel was bootup through kexec command, the NIC failed to
initialize. The 2.5.52 kernel was patched with kexec and kexec-hwfix
patch.

the following was is the dmesg output:


Linux version 2.5.52 (root@aminoacin) (gcc version 2.96 20000731 (Red
Hat Linux 7.1 2.96-81)) #1 SMP Fri Jan 24 14:17:58 CST 2003
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
BIOS-e820: 0000000000100000 - 000000000ff87000 (usable)
BIOS-e820: 000000000ff87000 - 000000000ffa6000 (ACPI data)
BIOS-e820: 000000000ffa6000 - 0000000010000000 (reserved)
BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
255MB LOWMEM available.
On node 0 totalpages: 65415
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 61319 pages, LIFO batch:14
HighMem zone: 0 pages, LIFO batch:1
ACPI: RSDP (v000 DELL ) @ 0x000fd730
ACPI: RSDT (v001 DELL GX400 00000.00005) @ 0x000fd744
ACPI: FADT (v001 DELL GX400 00000.00005) @ 0x000fd774
ACPI: SSDT (v001 DELL st_ex 00000.04096) @ 0xfffe7279
ACPI: BOOT (v001 DELL GX400 00000.00005) @ 0x000fd7e8
ACPI: DSDT (v001 DELL dt_ex 00000.04096) @ 0x00000000
ACPI: BIOS passes blacklist
ACPI: MADT not present
Building zonelist for node : 0
Kernel command line: auto ro root=/dev/hda5
No local APIC present or hardware disabled
Initializing CPU#0
Detected 1993.714 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 3932.16 BogoMIPS
Memory: 253164k/261660k available (2804k kernel code, 7792k reserved,
1539k data, 140k init, 0k highmem)
Dentry cache hash table entries: 32768 (order: 6, 262144 bytes)
Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
CPU: Before vendor init, caps: 3febf9ff 00000000 00000000, vendor = 0
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 256K
CPU: Hyper-Threading is disabled
CPU: After vendor init, caps: 3febf9ff 00000000 00000000 00000000
CPU: After generic, caps: 3febf9ff 00000000 00000000 00000000
CPU: Common caps: 3febf9ff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU#0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU#0: Thermal monitoring enabled
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
CPU0: Intel(R) Pentium(R) 4 CPU 2.00GHz stepping 02
per-CPU timeslice cutoff: 731.31 usecs.
task migration cache decay timeout: 1 msecs.
SMP motherboard not detected.
Local APIC not detected. Using dummy APIC emulation.
Starting migration thread for cpu 0
CPUS done 32
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
mtrr: v2.0 (20020519)
device class 'cpu': registering
device class cpu: adding driver system:cpu
PCI: PCI BIOS revision 2.10 entry at 0xfc0be, last bus=2
PCI: Using configuration type 1
device class cpu: adding device CPU 0
interfaces: adding device CPU 0
BIO: pool of 256 setup, 14Kb (56 bytes/bio)
biovec pool[0]: 1 bvecs: 256 entries (12 bytes)
biovec pool[1]: 4 bvecs: 256 entries (48 bytes)
biovec pool[2]: 16 bvecs: 256 entries (192 bytes)
biovec pool[3]: 64 bvecs: 256 entries (768 bytes)
biovec pool[4]: 128 bvecs: 256 entries (1536 bytes)
biovec pool[5]: 256 bvecs: 256 entries (3072 bytes)
ACPI: Subsystem revision 20021212
tbxface-0099 [03] Acpi_load_tables : ACPI Tables successfully
acquired
Parsing all Control
Methods:.............................................................................................................
Table [DSDT] - 297 Objects with 29 Devices 109 Methods 19 Regions
Parsing all Control Methods:
Table [SSDT] - 0 Objects with 0 Devices 0 Methods 0 Regions
ACPI Namespace successfully loaded at root c05b2b7c
evxfevnt-0063 [04] Acpi_enable : System is already in ACPI
mode
evgpe-0259: *** Info: GPE Block0 defined as GPE0 to GPE15
evgpe-0259: *** Info: GPE Block1 defined as GPE16 to GPE31
Executing all Device _STA and_INI methods:.............................
29 Devices found containing: 29 _STA, 3 _INI methods
Completing Region/Field/Buffer/Package
initialization:....................................
Initialized 13/19 Regions 0/0 Fields 10/10 Buffers 13/13 Packages (304
nodes)
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (00:00)
PCI: Probing PCI hardware (bus 00)
Transparent bridge - Intel Corp. 82801BA/CA/DB PCI Br
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 *10 11 12 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 *10 11 12 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11 12 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11 12 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 *5 6 7 9 10 11 12 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 *9 10 11 12 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 7 9 10 11 12 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 *9 10 11 12 15)
Linux Plug and Play Support v0.93 (c) Adam Belay
block request queues:
128 requests per read queue
128 requests per write queue
8 requests per batch
enter congestion at 31
exit congestion at 33
SCSI subsystem driver Revision: 1.00
device class 'scsi-host': registering
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: if you experience problems, try using option 'pci=noacpi' or even
'acpi=off'
SBF: Simple Boot Flag extension found and enabled.
SBF: Setting boot flags 0x80
aio_setup: sizeof(struct page) = 40
Journalled Block Device driver loaded
Installing knfsd (copyright (C) 1996 [email protected]).
udf: registering filesystem
ACPI: Power Button (FF) [PWRF]
ACPI: Processor [CPU0] (supports C1)
Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
parport0: PC-style at 0x378 (0x778) [PCSPP,TRISTATE,EPP]
parport0: irq 7 detected
parport0: cpp_daisy: aa5500ff(08)
parport0: assign_addrs: aa5500ff(08)
parport0: cpp_daisy: aa5500ff(08)
parport0: assign_addrs: aa5500ff(08)
pty: 256 Unix98 ptys configured
lp0: using parport0 (polling).
i810_rng: RNG not detected
Linux agpgart interface v1.0 (c) Dave Jones
agpgart: Detected Intel i850 chipset
agpgart: Maximum main memory to use for agp memory: 203M
agpgart: AGP aperture is 64M @ 0xf8000000
[drm] AGP 1.0 on Intel i850 @ 0xf8000000 64MB
[drm] Initialized radeon 1.7.0 20020828 on minor 0
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Intel(R) PRO/100 Network Driver - version 2.1.24-k2
Copyright (c) 2002 Intel Corporation






PCI: Enabling device 02:09.0 (0000 -> 0003)
PCI: Setting latency timer of device 02:09.0 to 64
e100: selftest timeout
e100: Failed to initialize, instance #0






Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
ICH2: IDE controller at PCI slot 00:1f.1
ICH2: chipset revision 4
ICH2: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio
hda: IC35L040AVER07-0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hdc: SAMSUNG CD-ROM SC-148C, ATAPI CD/DVD-ROM drive
hdc: Disabling (U)DMA for SAMSUNG CD-ROM SC-148C
ide1 at 0x170-0x177,0x376 on irq 15
hda: host protected area => 1
hda: 78165360 sectors (40021 MB) w/1916KiB Cache, CHS=77545/16/63,
UDMA(100)
hda: hda1 hda2 < hda5 hda6 >
hdc: ATAPI 48X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
end_request: I/O error, dev hdc, sector 0
scsi HBA driver <NULL> didn't set max_sectors, please fix the template
scsi HBA driver Qlogic ISP 1280/12160 didn't set max_sectors, please fix
the template
request_module[scsi_hostadapter]: not ready
request_module[scsi_hostadapter]: not ready
request_module[scsi_hostadapter]: not ready
Linux Kernel Card Services 3.1.22
options: [pci] [cardbus] [pm]
Initializing USB Mass Storage driver...
drivers/usb/core/usb.c: registered new driver usb-storage
USB Mass Storage support registered.
device class 'input': registering
register interface 'mouse' with class 'input'
mice: PS/2 mouse device common for all mice
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10
19:48:18 2002 UTC).
request_module[snd-card-0]: not ready
request_module[snd-card-1]: not ready
request_module[snd-card-2]: not ready
request_module[snd-card-3]: not ready
request_module[snd-card-4]: not ready
request_module[snd-card-5]: not ready
request_module[snd-card-6]: not ready
request_module[snd-card-7]: not ready
PCI: Setting latency timer of device 00:1f.5 to 64
intel8x0: clocking to 41138
ALSA device list:
#0: Intel 82801BA-ICH2 at 0xd800, irq 10
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 1024 buckets, 16Kbytes
TCP: Hash tables configured (established 8192 bind 10922)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
ds: no socket drivers loaded!
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 140k freed
Adding 530104k swap on /dev/hda6. Priority:-1 extents:1
warning: process `update' used the obsolete bdflush system call
Fix your initscripts?
warning: process `update' used the obsolete bdflush system call
Fix your initscripts?



I doubt this is a bug in E100 actually.
--
Michael Fu <[email protected]>
Not speaking for Intel, opinions are my own


2003-01-24 13:53:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec

Michael Fu <[email protected]> writes:

> After kernel was bootup through kexec command, the NIC failed to
> initialize. The 2.5.52 kernel was patched with kexec and kexec-hwfix
> patch.

Interesting... The patch goes cleanly onto newer kernels so feel
free to play with them. You are running a single cpu system
so the kexec-hwfix patch should not make a difference at this point.

Your interrupt routing is via ACPI interesting...

>
> the following was is the dmesg output:

[snip]
> Intel(R) PRO/100 Network Driver - version 2.1.24-k2
> Copyright (c) 2002 Intel Corporation
>
>
>
>
>
>
> PCI: Enabling device 02:09.0 (0000 -> 0003)
> PCI: Setting latency timer of device 02:09.0 to 64
> e100: selftest timeout
> e100: Failed to initialize, instance #0

[snip]

> I doubt this is a bug in E100 actually.

Given that everything else was working correctly this is almost
certainly an e100 driver or a hardware bug. On x86 everything has
been working well enough that finding something that is not a
hardware/driver bug as a failure case is currently quite a challenge.

Q1: Is this reproducible?
Q2: Is this reproducible with the eepro100 driver?

You were doing the easy case of 2.5.52 to 2.5.52 I have gotten so many
false positives with things working when I reboot the exact same kernel
I barely consider it a valid test case any more...

If it is a bug in the driver a shutdown method can be used to clean up
before reboot to place the device is a quiescent state.
Either that or the drivers initialization code can be enhanced to
handle more strange states.

I know the eepro100 driver issues a reset before playing with the
card. The e100 driver is doing this in a different order, and it is
dying before it resets the card so that looks like the issue to me.

Doing a clean user space shutdown may also help. Though your kexecwrapper
script looked like it was probably doing that o.k.

Eric

2003-01-24 14:48:52

by Andrey Nekrasov

[permalink] [raw]
Subject: Re: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec

Hello Eric W. Biederman,

Once you wrote about "Re: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec":
> Michael Fu <[email protected]> writes:
>
> > After kernel was bootup through kexec command, the NIC failed to
> > initialize. The 2.5.52 kernel was patched with kexec and kexec-hwfix
> > patch.
>
> Interesting... The patch goes cleanly onto newer kernels so feel
> free to play with them. You are running a single cpu system
> so the kexec-hwfix patch should not make a difference at this point.
>
> Your interrupt routing is via ACPI interesting...
>
> >
> > the following was is the dmesg output:
>
> [snip]
> > Intel(R) PRO/100 Network Driver - version 2.1.24-k2
> > Copyright (c) 2002 Intel Corporation
> >
> >
> >
> >
> >
> >
> > PCI: Enabling device 02:09.0 (0000 -> 0003)
> > PCI: Setting latency timer of device 02:09.0 to 64
> > e100: selftest timeout
> > e100: Failed to initialize, instance #0

use NIC EEPro100+ on INTEL STL2 motherboard, with "eepro100" driver - all work ok.

or "e100" driver and with patch:


--- drivers/net/e100/e100.h- Wed Dec 4 15:16:08 2002
+++ drivers/net/e100/e100.h Wed Dec 4 15:16:20 2002
@@ -100,7 +100,7 @@

#define E100_MAX_NIC 16

-#define E100_MAX_SCB_WAIT 100 /* Max udelays in wait_scb */
+#define E100_MAX_SCB_WAIT 5000 /* Max udelays in wait_scb */
#define E100_MAX_CU_IDLE_WAIT 50 /* Max udelays in wait_cus_idle */

/* HWI feature related constant */

all work ok.

2003-01-24 15:58:44

by Jeff Garzik

[permalink] [raw]
Subject: Re: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec

On Fri, Jan 24, 2003 at 05:57:55PM +0300, Andrey Nekrasov wrote:
> or "e100" driver and with patch:
>
>
> --- drivers/net/e100/e100.h- Wed Dec 4 15:16:08 2002
> +++ drivers/net/e100/e100.h Wed Dec 4 15:16:20 2002
> @@ -100,7 +100,7 @@
>
> #define E100_MAX_NIC 16
>
> -#define E100_MAX_SCB_WAIT 100 /* Max udelays in wait_scb */
> +#define E100_MAX_SCB_WAIT 5000 /* Max udelays in wait_scb */
> #define E100_MAX_CU_IDLE_WAIT 50 /* Max udelays in wait_cus_idle */




No, don't use this patch, it's awful. The latest Marcelo tree in
BitKeeper has this fixed... the right way. See the following e100
patch, which is what Intel emailed me, and what I merged into the
Marcelo tree.

Jeff




# --------------------------------------------
# 03/01/16 [email protected] 1.884.23.2
# [netdrvr e100] udelay a better way
# * Bug Fix: TCO workaround after hard reset of controller to wait for TCO
# traffic to settle. Workaround requires issuing a CU load base command
# after hard reset, followed by a wait for scb and finally a wait for
# TCO traffic bit to clear. Affects 82559s and above wired to SMBus.
# --------------------------------------------
#
diff -Nru a/drivers/net/e100/e100_main.c b/drivers/net/e100/e100_main.c
--- a/drivers/net/e100/e100_main.c Fri Jan 24 11:06:35 2003
+++ b/drivers/net/e100/e100_main.c Fri Jan 24 11:06:35 2003
@@ -196,6 +196,7 @@
char *e100_get_brand_msg(struct e100_private *);
static u8 e100_pci_setup(struct pci_dev *, struct e100_private *);
static u8 e100_sw_init(struct e100_private *);
+static void e100_tco_walkaround(struct e100_private *);
static unsigned char e100_alloc_space(struct e100_private *);
static void e100_dealloc_space(struct e100_private *);
static int e100_alloc_tcb_pool(struct e100_private *);
@@ -213,7 +214,7 @@

static unsigned char e100_clr_cntrs(struct e100_private *);
static unsigned char e100_load_microcode(struct e100_private *);
-static unsigned char e100_hw_init(struct e100_private *, u32);
+static unsigned char e100_hw_init(struct e100_private *);
static unsigned char e100_setup_iaaddr(struct e100_private *, u8 *);
static unsigned char e100_update_stats(struct e100_private *bdp);

@@ -1265,7 +1266,7 @@
/* read NIC's part number */
e100_rd_pwa_no(bdp);

- if (!e100_hw_init(bdp, PORT_SOFTWARE_RESET)) {
+ if (!e100_hw_init(bdp)) {
printk(KERN_ERR "e100: hw init failed\n");
return false;
}
@@ -1314,10 +1315,46 @@
return 1;
}

+static void __devinit
+e100_tco_walkaround(struct e100_private *bdp)
+{
+ int i;
+
+ /* Do software reset */
+ e100_sw_reset(bdp, PORT_SOFTWARE_RESET);
+
+ /* Do a dummy LOAD CU BASE command. */
+ /* This gets us out of pre-driver to post-driver. */
+ e100_exec_cmplx(bdp, 0, SCB_CUC_LOAD_BASE);
+
+ /* Wait 20 msec for reset to take effect */
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(HZ / 50);
+
+ /* disable interrupts since they are enabled */
+ /* after device reset */
+ e100_disable_clear_intr(bdp);
+
+ /* Wait for command to be cleared up to 1 sec */
+ for (i=0; i<1000; i++) {
+ if (!readb(&bdp->scb->scb_cmd_low))
+ break;
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(HZ / 1000);
+ }
+
+ /* Wait for TCO request bit in PMDR register to be clear */
+ for (i=0; i<500; i++) {
+ if (!(readb(&bdp->scb->scb_ext.d101m_scb.scb_pmdr) & BIT_1))
+ break;
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(HZ / 1000);
+ }
+}
+
/**
* e100_hw_init - initialized tthe hardware
* @bdp: atapter's private data struct
- * @reset_cmd: s/w reset or selective reset
*
* This routine performs a reset on the adapter, and configures the adapter.
* This includes configuring the 82557 LAN controller, validating and setting
@@ -1329,13 +1366,16 @@
* false - If the adapter failed initialization
*/
unsigned char __devinit
-e100_hw_init(struct e100_private *bdp, u32 reset_cmd)
+e100_hw_init(struct e100_private *bdp)
{
if (!e100_phy_init(bdp))
return false;

- /* Issue a software reset to the e100 */
- e100_sw_reset(bdp, reset_cmd);
+ e100_sw_reset(bdp, PORT_SELECTIVE_RESET);
+
+ /* Only 82559 or above needs TCO walkaround */
+ if (bdp->rev_id >= D101MA_REV_ID)
+ e100_tco_walkaround(bdp);

/* Load the CU BASE (set to 0, because we use linear mode) */
if (!e100_wait_exec_cmplx(bdp, 0, SCB_CUC_LOAD_BASE, 0))

Subject: Re: [BUG] e100 driver fails to initialize the hardware after kernel bootup through kexec

Jeff Garzik <[email protected]> writes:

>+ /* Wait 20 msec for reset to take effect */
>+ set_current_state(TASK_UNINTERRUPTIBLE);
>+ schedule_timeout(HZ / 50);

Hm. This assumes HZ=100, doesn't it?

>+ /* Wait for command to be cleared up to 1 sec */
>+ for (i=0; i<1000; i++) {
>+ if (!readb(&bdp->scb->scb_cmd_low))
>+ break;
>+ set_current_state(TASK_UNINTERRUPTIBLE);
>+ schedule_timeout(HZ / 1000);
>+ }

HZ = 100 -> HZ / 1000 == 0 ?

This whole patch scares me. :-)

Regards
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20