LinuxLists.cc - Re: Gianfar driver failing on MPC8641D based board

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

Martyn Welch wrote:
> Martyn Welch wrote:
>
>> I have recently attempted to boot an 8641D based board from an NFS root.
>> The boot process grinds to a halt not long after the first access of the
>> NFS root and I receive multiple "nfs: server 192.168.0.1 not responding,
>> still trying" messages. Wireshark suggests that there is no further
>> traffic from this board at this point on. The NFS server seems to
>> eventually try sending duplicate packets it's already sent, which
>> results in "nfs: server 192.168.0.1 OK" messages, but the "not
>> responding" messages resume with no further traffic from the board.
>>
>> I am able to boot to a ramdisk fine and the network seems to work -
>> though I haven't really pushed the interface from it.
>>
>> I have attempted to git bisect, though I wasn't able to get much further
>> than discovering the problem was introduced in the 2.6.33 merge window -
>> at which point the gianfar network driver fails to compile (I have tried
>> to git bisect skip many, many times to no avail).
>>
>> NFS booting fails for this board on todays linux-next, the master branch
>> of Kumar's PPC tree and the head of the main tree. I have also been able
>> to NFS boot from a random x86 based board that I have, using the head of
>> the main tree and the linux-next tree.
>>
>> Copying the gianfar drivers from 2.6.32 into the head of the main tree
>> restores the correct behaviour and I'm able to NFS boot. I have heard
>> from others that the latest drivers work on 83xx and 85xx based boards,
>> but it seems to be broken on at least the 8641D.
>>
>> I can see there has been a fair amount of work done on the gianfar
>> driver, I assume that this is a bug introduced by the multiple queue
>> support, but I'm way out of my depth on this.
>>
>>
> I have just compiled 2.6.33 for the Freescale MPC8641_HPCN demo board
> and am having still experiencing the problems outlined in my previous
> email, though I have noticed that I tend to be able to boot from cold,
> but my boot fails on reboot. Hitting the reset button doesn't help, I
> need to actually power the machine on and off again for it to work.
>
> As before, I'm way out of my depth in this, any one have any ideas?
> Below is a dump of the failed boot process:
>
> U-Boot 2009.01-00181-gc1b7c70 (Jan 30 2009 - 11:17:31)
>
> Freescale PowerPC
> CPU:
> Core: E600 Core 0, Version: 0.2, (0x80040202)
> System: Unknown, Version: 2.0, (0x80900120)
> Clocks: CPU:1000 MHz, MPX: 400 MHz, DDR: 200 MHz, LBC: 25 MHz
> L2: Enabled
> Board: MPC8641HPCN, System ID: 0x10, System Version: 0x10, FPGA Version:
> 0x22
> I2C: ready
> DRAM: DDR: 1 GB
> FLASH: 8 MB
> Invalid ID (ff ff ff ff)
> Scanning PCI bus 01
> PCI-EXPRESS 1 on bus 00 - 02
> PCI-EXPRESS 2 on bus 03 - 03
> Video: No radeon video card found!
> In: serial
> Out: serial
> Err: serial
> SCSI: AHCI 0001.0000 32 slots 4 ports 3 Gbps 0xf impl IDE mode
> flags: ncq ilck pm led clo pmp pio slum part
> scanning bus for devices...
> Net: eTSEC1, eTSEC2, eTSEC3, eTSEC4
> => tftp 4000000 hpcn/uImage-torvalds-linux-2.6
> Speed: 1000, full duplex
> Using eTSEC1 device
> TFTP from server 192.168.0.1; our IP address is 192.168.0.30
> Filename 'hpcn/uImage-torvalds-linux-2.6'.
> Load address: 0x4000000
> Loading: #################################################################
> #################################################################
> #######################################################
> done
> Bytes transferred = 2709050 (29563a hex)
> => tftp 5000000 hpcn/mpc8641_hpcn-torvalds-linux-2.6.dtb
> Speed: 1000, full duplex
> Using eTSEC1 device
> TFTP from server 192.168.0.1; our IP address is 192.168.0.30
> Filename 'hpcn/mpc8641_hpcn-torvalds-linux-2.6.dtb'.
> Load address: 0x5000000
> Loading: #
> done
> Bytes transferred = 11523 (2d03 hex)
> => setenv bootargs "root=/dev/nfs rw
> nfsroot=192.168.0.1:/tftpboot/hpcn/root/ i"
> => bootm 4000000 - 5000000
> WARNING: adjusting available memory to 10000000
> ## Booting kernel from Legacy Image at 04000000 ...
> Image Name: Linux-2.6.33-00001-gbaac35c
> Image Type: PowerPC Linux Kernel Image (gzip compressed)
> Data Size: 2708986 Bytes = 2.6 MB
> Load Address: 00000000
> Entry Point: 00000000
> Verifying Checksum ... OK
> ## Flattened Device Tree blob at 05000000
> Booting using the fdt blob at 0x5000000
> Uncompressing Kernel Image ... OK
> Loading Device Tree to 007fa000, end 007ffd02 ... OK
> Using MPC86xx HPCN machine description
> Total memory = 1024MB; using 2048kB for hash table (at cfe00000)
> Linux version 2.6.33-00001-gbaac35c (welchma@ES-J7S4D2J) (gcc version
> 4.1.2) #20
> CPU maps initialized for 1 thread per core
> bootconsole [udbg0] enabled
> setup_arch: bootmem
> mpc86xx_hpcn_setup_arch()
> Found FSL PCI host bridge at 0x00000000ffe08000. Firmware bus number: 0->2
> PCI host bridge /pcie@ffe08000 (primary) ranges:
> MEM 0x0000000080000000..0x000000009fffffff -> 0x0000000080000000
> IO 0x00000000ffc00000..0x00000000ffc0ffff -> 0x0000000000000000
> /pcie@ffe08000: PCICSRBAR @ 0xfff00000
> Found FSL PCI host bridge at 0x00000000ffe09000. Firmware bus number: 0->0
> PCI host bridge /pcie@ffe09000 ranges:
> MEM 0x00000000a0000000..0x00000000bfffffff -> 0x00000000a0000000
> IO 0x00000000ffc10000..0x00000000ffc1ffff -> 0x0000000000000000
> /pcie@ffe09000: PCICSRBAR @ 0xfff00000
> MPC86xx HPCN board from Freescale Semiconductor
> arch: exit
> Zone PFN ranges:
> DMA 0x00000000 -> 0x00030000
> Normal 0x00030000 -> 0x00030000
> HighMem 0x00030000 -> 0x00040000
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 0: 0x00000000 -> 0x00040000
> PERCPU: Embedded 7 pages/cpu @c1003000 s7712 r8192 d12768 u65536
> pcpu-alloc: s7712 r8192 d12768 u65536 alloc=16*4096
> pcpu-alloc: [0] 0 [0] 1
> Built 1 zonelists in Zone order, mobility grouping on. Total pages: 260096
> Kernel command line: root=/dev/nfs rw
> nfsroot=192.168.0.1:/tftpboot/hpcn/root/ p
> PID hash table entries: 4096 (order: 2, 16384 bytes)
> Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
> Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
> Memory: 1030864k/1048576k available (5228k kernel code, 17004k reserved,
> 196k d)
> Kernel virtual memory layout:
> * 0xfffc1000..0xfffff000 : fixmap
> * 0xff800000..0xffc00000 : highmem PTEs
> * 0xff7da000..0xff800000 : early ioremap
> * 0xf1000000..0xff7da000 : vmalloc & ioremap
> SLUB: Genslabs=13, HWalign=32, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
> Hierarchical RCU implementation.
> NR_IRQS:512 nr_irqs:512
> mpic: Setting up MPIC " MPIC " version 1.2 at ffe40000, max 2 CPUs
> mpic: ISU size: 256, shift: 8, mask: ff
> mpic: Initializing for 256 sources
> i8259 legacy interrupt controller initialized
> clocksource: timebase mult[2800000] shift[22] registered
> Console: colour dummy device 80x25
> Mount-cache hash table entries: 512
> mpic: requesting IPIs ...
> Processor 1 found.
> Brought up 2 CPUs
> NET: Registered protocol family 16
>
> PCI: Probing PCI hardware
> pci 0000:00:00.0: ignoring class b20 (doesn't match header type 01)
> pci 0000:00:00.0: PCI bridge to [bus 01-ff]
> pci 0000:02:1d.0: unsupported PM cap regs version (4)
> pci 0000:01:00.0: PCI bridge to [bus 02-ff] (subtractive decode)
> pci 0001:03:00.0: ignoring class b20 (doesn't match header type 01)
> pci 0001:03:00.0: PCI bridge to [bus 04-ff]
> pci 0000:01:00.0: PCI bridge to [bus 02-02]
> pci 0000:01:00.0: bridge window [io 0x1000-0x1fff]
> pci 0000:01:00.0: bridge window [mem 0x80000000-0x800fffff]
> pci 0000:01:00.0: bridge window [mem pref disabled]
> pci 0000:00:00.0: PCI bridge to [bus 01-02]
> pci 0000:00:00.0: bridge window [io 0x0000-0xffff]
> pci 0000:00:00.0: bridge window [mem 0x80000000-0x9fffffff]
> pci 0000:00:00.0: bridge window [mem pref disabled]
> pci 0000:00:00.0: enabling device (0106 -> 0107)
> pci 0001:03:00.0: PCI bridge to [bus 04-04]
> pci 0001:03:00.0: bridge window [io 0xfffee000-0xffffdfff]
> pci 0001:03:00.0: bridge window [mem 0xa0000000-0xbfffffff]
> pci 0001:03:00.0: bridge window [mem pref disabled]
> pci 0001:03:00.0: enabling device (0106 -> 0107)
> bio: create slab <bio-0> at 0
> vgaarb: loaded
> SCSI subsystem initialized
> usbcore: registered new interface driver usbfs
> usbcore: registered new interface driver hub
> usbcore: registered new device driver usb
> Advanced Linux Sound Architecture Driver Version 1.0.21.
> Switching to clocksource timebase
> NET: Registered protocol family 2
> IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
> TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
> TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
> TCP: Hash tables configured (established 131072 bind 65536)
> TCP reno registered
> UDP hash table entries: 512 (order: 2, 16384 bytes)
> UDP-Lite hash table entries: 512 (order: 2, 16384 bytes)
> NET: Registered protocol family 1
> RPC: Registered udp transport module.
> RPC: Registered tcp transport module.
> RPC: Registered tcp NFSv4.1 backchannel transport module.
> audit: initializing netlink socket (disabled)
> type=2000 audit(0.144:1): initialized
> highmem bounce pool size: 64 pages
> Installing knfsd (copyright (C) 1996 [email protected]).
> NTFS driver 2.1.29 [Flags: R/O].
> msgmni has been set to 1502
> alg: No test for stdrng (krng)
> io scheduler noop registered
> io scheduler deadline registered
> io scheduler cfq registered (default)
> Generic non-volatile memory driver v1.1
> Serial: 8250/16550 driver, 2 ports, IRQ sharing enabled
> serial8250.0: ttyS0 at MMIO 0xffe04500 (irq = 42) is a 16550A
> console [ttyS0] enabled, bootconsole disabled
> console [ttyS0] enabled, bootconsole disabled
> serial8250.0: ttyS1 at MMIO 0xffe04600 (irq = 28) is a 16550A
> brd: module loaded
> loop: module loaded
> nbd: registered device at major 43
> st: Version 20081215, fixed bufsize 32768, s/g segs 256
> ahci 0000:02:1f.1: AHCI 0001.0000 32 slots 4 ports 3 Gbps 0xf impl SATA mode
> ahci 0000:02:1f.1: flags: ncq sntf ilck pm led clo pmp pio slum part
> scsi0 : ahci
> scsi1 : ahci
> scsi2 : ahci
> scsi3 : ahci
> ata1: SATA max UDMA/133 abar m1024@0x80006000 port 0x80006100 irq 5
> ata2: SATA max UDMA/133 abar m1024@0x80006000 port 0x80006180 irq 5
> ata3: SATA max UDMA/133 abar m1024@0x80006000 port 0x80006200 irq 5
> ata4: SATA max UDMA/133 abar m1024@0x80006000 port 0x80006280 irq 5
> scsi4 : pata_ali
> scsi5 : pata_ali
> ata5: PATA max UDMA/133 cmd 0x1200 ctl 0x1208 bmdma 0x1220 irq 14
> ata6: PATA max UDMA/133 cmd 0x1210 ctl 0x1218 bmdma 0x1228 irq 14
> eth0: Gianfar Ethernet Controller Version 1.2, 00:e0:0c:00:00:01
> eth0: Running with NAPI enabled
> eth0: :RX BD ring size for Q[0]: 256
> eth0:TX BD ring size for Q[0]: 256
> eth1: Gianfar Ethernet Controller Version 1.2, 00:e0:0c:00:01:fd
> eth1: Running with NAPI enabled
> eth1: :RX BD ring size for Q[0]: 256
> eth1:TX BD ring size for Q[0]: 256
> eth2: Gianfar Ethernet Controller Version 1.2, 00:e0:0c:00:02:fd
> eth2: Running with NAPI enabled
> eth2: :RX BD ring size for Q[0]: 256
> eth2:TX BD ring size for Q[0]: 256
> eth3: Gianfar Ethernet Controller Version 1.2, 00:e0:0c:00:03:fd
> eth3: Running with NAPI enabled
> eth3: :RX BD ring size for Q[0]: 256
> eth3:TX BD ring size for Q[0]: 256
> Freescale PowerQUICC MII Bus: probed
> Freescale PowerQUICC MII Bus: probed
> Freescale PowerQUICC MII Bus: probed
> Freescale PowerQUICC MII Bus: probed
> usbmon: debugfs is not available
> ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> ehci_hcd 0000:02:1c.3: EHCI Host Controller
> ehci_hcd 0000:02:1c.3: new USB bus registered, assigned bus number 1
> ehci_hcd 0000:02:1c.3: debug port 1
> ehci_hcd 0000:02:1c.3: Enabling legacy PCI PM
> ehci_hcd 0000:02:1c.3: irq 11, io mem 0x80003000
> ehci_hcd 0000:02:1c.3: USB 2.0 started, EHCI 1.00
> hub 1-0:1.0: USB hub found
> hub 1-0:1.0: 8 ports detected
> ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
> ohci_hcd 0000:02:1c.0: OHCI Host Controller
> ata5.00: ATAPI: SONY DVD RW AW-Q170A, 1.73, max UDMA/66
> ata5.00: WARNING: ATAPI DMA disabled for reliability issues. It can be
> enabled
> ata5.00: WARNING: via pata_ali.atapi_dma modparam or corresponding sysfs
> node.
> ata5.00: configured for UDMA/66
> ohci_hcd 0000:02:1c.0: new USB bus registered, assigned bus number 2
> ohci_hcd 0000:02:1c.0: irq 12, io mem 0x80000000
> hub 2-0:1.0: USB hub found
> hub 2-0:1.0: 3 ports detected
> ohci_hcd 0000:02:1c.1: OHCI Host Controller
> ohci_hcd 0000:02:1c.1: new USB bus registered, assigned bus number 3
> ohci_hcd 0000:02:1c.1: irq 9, io mem 0x80001000
> ata3: SATA link down (SStatus 0 SControl 300)
> ata1: SATA link down (SStatus 0 SControl 300)
> ata4: SATA link down (SStatus 0 SControl 300)
> ata2: SATA link down (SStatus 0 SControl 300)
> hub 3-0:1.0: USB hub found
> hub 3-0:1.0: 3 ports detected
> ohci_hcd 0000:02:1c.2: OHCI Host Controller
> ohci_hcd 0000:02:1c.2: new USB bus registered, assigned bus number 4
> scsi 4:0:0:0: CD-ROM SONY DVD RW AW-Q170A 1.73 PQ: 0 ANSI: 5
> ohci_hcd 0000:02:1c.2: irq 10, io mem 0x80002000
> sr0: scsi3-mmc drive: 48x/48x writer cd/rw xa/form2 cdda tray
> Uniform CD-ROM driver Revision: 3.20
> sr 4:0:0:0: Attached scsi generic sg0 type 5
> hub 4-0:1.0: USB hub found
> hub 4-0:1.0: 3 ports detected
> Initializing USB Mass Storage driver...
> usbcore: registered new interface driver usb-storage
> USB Mass Storage support registered.
> i8042.c: No controller found.
> rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
> rtc0: alarms up to one day, 114 bytes nvram
> usbcore: registered new interface driver usbhid
> usbhid: USB HID core driver
> intel8x0_measure_ac97_clock: measured 50231 usecs (2424 samples)
> intel8x0: clocking to 48000
> ALSA device list:
> #0: ALi M5455 with ALC650F at irq 6
> IPv4 over IPv4 tunneling driver
> GRE over IPv4 tunneling driver
> TCP cubic registered
> Initializing XFRM netlink socket
> NET: Registered protocol family 10
> IPv6 over IPv4 tunneling driver
> NET: Registered protocol family 17
> rtc_cmos rtc_cmos: setting system clock to 2002-03-11 18:46:05 UTC
> (1015872365)
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> ADDRCONF(NETDEV_UP): eth1: link is not ready
> ADDRCONF(NETDEV_UP): eth2: link is not ready
> ADDRCONF(NETDEV_UP): eth3: link is not ready
> Sending DHCP requests .
> PHY: mdio@ffe24520:00 - Link is Up - 1000/Full
> ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> ., OK
> IP-Config: Got DHCP answer from 192.168.0.1, my address is 192.168.0.241
> IP-Config: Complete:
> device=eth0, addr=192.168.0.241, mask=255.255.255.0, gw=192.168.0.1,
> host=192.168.0.241, domain=Radstone.Local, nis-domain=(none),
> bootserver=192.168.0.1, rootserver=192.168.0.1, rootpath=
> Looking up port of RPC 100003/2 on 192.168.0.1
> Looking up port of RPC 100005/1 on 192.168.0.1
> VFS: Mounted root (nfs filesystem) on device 0:13.
> Freeing unused kernel memory: 220k init
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
> nfs: server 192.168.0.1 not responding, still trying
>

Further testing has shown that this isn't restricted to warm reboots, it
happens from cold as well. In addition, the exact timing of the failure
seems to vary, some boots have got further before failing.

Martyn

--
Martyn Welch (Principal Software Engineer) | Registered in England and
GE Intelligent Platforms | Wales (3828642) at 100
T +44(0)127322748 | Barbirolli Square, Manchester,
E [email protected] | M2 3AB VAT:GB 927559189

2010-02-25 16:51:48

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Thu, Feb 25, 2010 at 04:46:54PM +0000, Martyn Welch wrote:
[...]
> > nfs: server 192.168.0.1 not responding, still trying
> >
>
> Further testing has shown that this isn't restricted to warm reboots, it
> happens from cold as well. In addition, the exact timing of the failure
> seems to vary, some boots have got further before failing.

Unfortunately I don't have any 8641 boards near me, so I can't
debug this myself. Though, I tested gianfar on MPC8568E-MDS with
2.6.33 kernel, and it seems to work just fine.

I see you use SMP. Can you try to turn it off? If that will fix
the issue, then it'll be a good data point.

Meanwhile, I'll try SMP kernel on MPC8568 (UP), and let you
know the results.

Thanks,

--
Anton Vorontsov
email: [email protected]
irc://irc.freenode.net/bd2

2010-02-25 17:49:38

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Thu, Feb 25, 2010 at 07:51:41PM +0300, Anton Vorontsov wrote:
> On Thu, Feb 25, 2010 at 04:46:54PM +0000, Martyn Welch wrote:
> [...]
> > > nfs: server 192.168.0.1 not responding, still trying
> > >
> >
> > Further testing has shown that this isn't restricted to warm reboots, it
> > happens from cold as well. In addition, the exact timing of the failure
> > seems to vary, some boots have got further before failing.
>
> Unfortunately I don't have any 8641 boards near me, so I can't
> debug this myself. Though, I tested gianfar on MPC8568E-MDS with
> 2.6.33 kernel, and it seems to work just fine.
>
> I see you use SMP. Can you try to turn it off? If that will fix
> the issue, then it'll be a good data point.
>
> Meanwhile, I'll try SMP kernel on MPC8568 (UP), and let you
> know the results.

Nope, no luck. Can't trigger the issue. :-/
Tested with NFS boot, TCP and UDP netperf tests.

--
Anton Vorontsov
email: [email protected]
irc://irc.freenode.net/bd2

2010-02-25 18:27:39

by Kumar Gala

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Feb 25, 2010, at 10:46 AM, Martyn Welch wrote:

>
> Further testing has shown that this isn't restricted to warm reboots, it
> happens from cold as well. In addition, the exact timing of the failure
> seems to vary, some boots have got further before failing.

what mechanism do you use for warm resets?

- k

2010-02-26 00:53:34

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Thu, Feb 25, 2010 at 12:49 PM, Anton Vorontsov
<[email protected]> wrote:
> On Thu, Feb 25, 2010 at 07:51:41PM +0300, Anton Vorontsov wrote:
>> On Thu, Feb 25, 2010 at 04:46:54PM +0000, Martyn Welch wrote:
>> [...]
>> > > nfs: server 192.168.0.1 not responding, still trying
>> > >
>> >
>> > Further testing has shown that this isn't restricted to warm reboots, it
>> > happens from cold as well. In addition, the exact timing of the failure
>> > seems to vary, some boots have got further before failing.
>>
>> Unfortunately I don't have any 8641 boards near me, so I can't
>> debug this myself. Though, I tested gianfar on MPC8568E-MDS with
>> 2.6.33 kernel, and it seems to work just fine.
>>
>> I see you use SMP. Can you try to turn it off? If that will fix
>> the issue, then it'll be a good data point.
>>
>> Meanwhile, I'll try SMP kernel on MPC8568 (UP), and let you
>> know the results.
>
> Nope, no luck. Can't trigger the issue. :-/
> Tested with NFS boot, TCP and UDP netperf tests.

I was able to reproduce it on an 8641D and bisected it down to this:

-----------
commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
Author: Anton Vorontsov <[email protected]>
Date: Tue Nov 10 14:11:10 2009 +0000

gianfar: Revive SKB recycling

Before calling gfar_clean_tx_ring() the driver grabs an irqsave
spinlock, and then tries to recycle skbs. But since
skb_recycle_check() returns 0 with IRQs disabled, we'll never
recycle any skbs.

It appears that gfar_clean_tx_ring() and gfar_start_xmit() are
mostly idependent and can work in parallel, except when they
modify num_txbdfree.

So we can drop the lock from most sections and thus fix the skb
recycling.
-----------

...which probably explains why you weren't seeing it on non-SMP.
I'd imagine it would show up on any of the e500mc boards too.

I'd done a rev-list on gianfar.[ch] from 32 to 33-rc1, and then
cherry-picked those onto a 32 baseline to reduce the scale of
the bisection, but I don't think that should impact the final
result I got in any meaningful way.

Paul.

2010-02-26 03:14:55

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
[...]
> I was able to reproduce it on an 8641D and bisected it down to this:
>
> -----------
> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
> Author: Anton Vorontsov <[email protected]>
> Date: Tue Nov 10 14:11:10 2009 +0000
>
> gianfar: Revive SKB recycling

Thanks for the bisect. I have a guess why tx hangs in
SMP case. Could anyone try the patch down below?

[...]
> ...which probably explains why you weren't seeing it on non-SMP.
> I'd imagine it would show up on any of the e500mc boards too.

Yeah.. Pity, I don't have SMP boards anymore. I'll try
to get one though.

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 8bd3c9f..3ff3bd0 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2614,6 +2614,8 @@ static int gfar_poll(struct napi_struct *napi, int budget)
tx_queue = priv->tx_queue[rx_queue->qindex];

tx_cleaned += gfar_clean_tx_ring(tx_queue);
+ if (!tx_cleaned && !tx_queue->num_txbdfree)
+ tx_cleaned += 1; /* don't complete napi */
rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
budget_per_queue);
rx_cleaned += rx_cleaned_per_queue;

2010-02-26 04:58:55

by Kumar Gopalpet-B05799

[permalink] [raw]

Subject: RE: Gianfar driver failing on MPC8641D based board

>-----Original Message-----
>From: Anton Vorontsov [mailto:[email protected]]
>Sent: Friday, February 26, 2010 8:45 AM
>To: Paul Gortmaker
>Cc: Martyn Welch; linuxppc-dev list; [email protected];
>[email protected]; Kumar Gopalpet-B05799;
>[email protected]; Kumar Gala
>Subject: Re: Gianfar driver failing on MPC8641D based board
>
>On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
>[...]
>> I was able to reproduce it on an 8641D and bisected it down to this:
>>
>> -----------
>> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
>> Author: Anton Vorontsov <[email protected]>
>> Date: Tue Nov 10 14:11:10 2009 +0000
>>
>> gianfar: Revive SKB recycling
>
>Thanks for the bisect. I have a guess why tx hangs in SMP
>case. Could anyone try the patch down below?
>
>[...]
>> ...which probably explains why you weren't seeing it on non-SMP.
>> I'd imagine it would show up on any of the e500mc boards too.
>
>Yeah.. Pity, I don't have SMP boards anymore. I'll try to get
>one though.
>
>
>diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
>index 8bd3c9f..3ff3bd0 100644
>--- a/drivers/net/gianfar.c
>+++ b/drivers/net/gianfar.c
>@@ -2614,6 +2614,8 @@ static int gfar_poll(struct napi_struct
>*napi, int budget)
> tx_queue = priv->tx_queue[rx_queue->qindex];
>
> tx_cleaned += gfar_clean_tx_ring(tx_queue);
>+ if (!tx_cleaned && !tx_queue->num_txbdfree)
>+ tx_cleaned += 1; /* don't
>complete napi */
> rx_cleaned_per_queue =
>gfar_clean_rx_ring(rx_queue,
>
>budget_per_queue);
> rx_cleaned += rx_cleaned_per_queue;
>

Anton,

There is also one more issue that I have been observing with the patch
"gianfar: Revive SKB recycling".
The issue is when I do a IPV4 forwarding test scenario with
bidirectional flows (SMP environment). I am using Spirent smart bits
(smartflow) for automation testing and I frequently observe smart flow
reporting "Rx packet counte greater than Tx packet count. Duplicate
packets might have been received".

To just get over the issue I have removed this patch and I didn't see
the issue.

To a certain extent I could get over the problem by using atomic_t for
num_txbdfree (atomic_add and atomic_dec instructions for updating the
num_txbdfree) and completely removing the spin_locks in the tx routines.

Also, I feel we might want to make some more changes to the
gfar_clean_tx_ring( ) and gfar_start_xmit() routines so that they can
operate parallely.

I am really sorry for not posting it a bit earlier as I am caught up
with some urgent issues.

--

Thanks
Sandeep

2010-02-26 11:48:07

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

Anton Vorontsov wrote:
> On Thu, Feb 25, 2010 at 04:46:54PM +0000, Martyn Welch wrote:
> [...]
>
>>> nfs: server 192.168.0.1 not responding, still trying
>>>
>>>
>> Further testing has shown that this isn't restricted to warm reboots, it
>> happens from cold as well. In addition, the exact timing of the failure
>> seems to vary, some boots have got further before failing.
>>
>
> Unfortunately I don't have any 8641 boards near me, so I can't
> debug this myself. Though, I tested gianfar on MPC8568E-MDS with
> 2.6.33 kernel, and it seems to work just fine.
>
> I see you use SMP. Can you try to turn it off? If that will fix
> the issue, then it'll be a good data point.
>
> Meanwhile, I'll try SMP kernel on MPC8568 (UP), and let you
> know the results.
>
> Thanks

I removed the second core from the dts file rather than truly disabling
SMP in the kernel config. Doing this allowed the board to boot reliably.

Martyn

--
Martyn Welch (Principal Software Engineer) | Registered in England and
GE Intelligent Platforms | Wales (3828642) at 100
T +44(0)127322748 | Barbirolli Square, Manchester,
E [email protected] | M2 3AB VAT:GB 927559189

2010-02-26 12:03:16

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

Anton Vorontsov wrote:
> On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
> [...]
>
>> I was able to reproduce it on an 8641D and bisected it down to this:
>>
>> -----------
>> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
>> Author: Anton Vorontsov <[email protected]>
>> Date: Tue Nov 10 14:11:10 2009 +0000
>>
>> gianfar: Revive SKB recycling
>>
>
> Thanks for the bisect. I have a guess why tx hangs in
> SMP case. Could anyone try the patch down below?
>

Yup, no problem. I'm afraid it doesn't resolve the problem for me.

> [...]
>
>> ...which probably explains why you weren't seeing it on non-SMP.
>> I'd imagine it would show up on any of the e500mc boards too.
>>
>
> Yeah.. Pity, I don't have SMP boards anymore. I'll try
> to get one though.
>
>
> diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
> index 8bd3c9f..3ff3bd0 100644
> --- a/drivers/net/gianfar.c
> +++ b/drivers/net/gianfar.c
> @@ -2614,6 +2614,8 @@ static int gfar_poll(struct napi_struct *napi, int budget)
> tx_queue = priv->tx_queue[rx_queue->qindex];
>
> tx_cleaned += gfar_clean_tx_ring(tx_queue);
> + if (!tx_cleaned && !tx_queue->num_txbdfree)
> + tx_cleaned += 1; /* don't complete napi */
> rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
> budget_per_queue);
> rx_cleaned += rx_cleaned_per_queue;
>

--
Martyn Welch (Principal Software Engineer) | Registered in England and
GE Intelligent Platforms | Wales (3828642) at 100
T +44(0)127322748 | Barbirolli Square, Manchester,
E [email protected] | M2 3AB VAT:GB 927559189

2010-02-26 14:35:36

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Fri, Feb 26, 2010 at 12:06:15PM +0000, Martyn Welch wrote:
> Anton Vorontsov wrote:
> > On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
> > [...]
> >
> >> I was able to reproduce it on an 8641D and bisected it down to this:
> >>
> >> -----------
> >> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
> >> Author: Anton Vorontsov <[email protected]>
> >> Date: Tue Nov 10 14:11:10 2009 +0000
> >>
> >> gianfar: Revive SKB recycling
> >>
> >
> > Thanks for the bisect. I have a guess why tx hangs in
> > SMP case. Could anyone try the patch down below?
> >
>
> Yup, no problem. I'm afraid it doesn't resolve the problem for me.

Hm.. I found a p2020 board and I was able to reproduce the issue.
The patch down below fixed it completely for me... hm.

I'll look further, thanks!

> > [...]
> >
> >> ...which probably explains why you weren't seeing it on non-SMP.
> >> I'd imagine it would show up on any of the e500mc boards too.
> >>
> >
> > Yeah.. Pity, I don't have SMP boards anymore. I'll try
> > to get one though.
> >
> >
> > diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
> > index 8bd3c9f..3ff3bd0 100644
> > --- a/drivers/net/gianfar.c
> > +++ b/drivers/net/gianfar.c
> > @@ -2614,6 +2614,8 @@ static int gfar_poll(struct napi_struct *napi, int budget)
> > tx_queue = priv->tx_queue[rx_queue->qindex];
> >
> > tx_cleaned += gfar_clean_tx_ring(tx_queue);
> > + if (!tx_cleaned && !tx_queue->num_txbdfree)
> > + tx_cleaned += 1; /* don't complete napi */
> > rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
> > budget_per_queue);
> > rx_cleaned += rx_cleaned_per_queue;
> >

--
Anton Vorontsov
email: [email protected]
irc://irc.freenode.net/bd2

2010-02-26 14:52:37

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On 10-02-26 09:35 AM, Anton Vorontsov wrote:
> On Fri, Feb 26, 2010 at 12:06:15PM +0000, Martyn Welch wrote:
>> Anton Vorontsov wrote:
>>> On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
>>> [...]
>>>
>>>> I was able to reproduce it on an 8641D and bisected it down to this:
>>>>
>>>> -----------
>>>> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
>>>> Author: Anton Vorontsov<[email protected]>
>>>> Date: Tue Nov 10 14:11:10 2009 +0000
>>>>
>>>> gianfar: Revive SKB recycling
>>>>
>>>
>>> Thanks for the bisect. I have a guess why tx hangs in
>>> SMP case. Could anyone try the patch down below?
>>>
>>
>> Yup, no problem. I'm afraid it doesn't resolve the problem for me.
>
> Hm.. I found a p2020 board and I was able to reproduce the issue.
> The patch down below fixed it completely for me... hm.

Interesting. I just tested the patch on the sbc8641d, and it
still has the issue with your patch applied. I'm using NFSroot
just like Martyn was and it still appears bound up on that
gianfar tx lock. I'll see if I can get a SysRq backtrace in
case that will help you see how it manages to get there...

Paul.

----

nfs: server not responding, still trying

[repeated ~15 times, then...]

INFO: task rc.sysinit:837 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rc.sysinit D 0fef73f4 0 837 836 0x00000000
Call Trace:
[dfb7d9b0] [c000a144] __switch_to+0x8c/0xf8
[dfb7d9d0] [c03443dc] schedule+0x380/0x954
[dfb7da50] [c0344a0c] io_schedule+0x5c/0x90
[dfb7da70] [c0074b0c] sync_page+0x4c/0x74
[dfb7da80] [c0344f44] __wait_on_bit_lock+0xb0/0x148
[dfb7dab0] [c0074a8c] __lock_page+0x94/0xa4
[dfb7dae0] [c0074d5c] find_lock_page+0x8c/0xa4
[dfb7db00] [c0075674] filemap_fault+0x1ec/0x4fc
[dfb7db40] [c008d548] __do_fault+0x98/0x53c
[dfb7dba0] [c0018478] do_page_fault+0x2d0/0x500
[dfb7dc50] [c00149d4] handle_page_fault+0xc/0x80
--- Exception: 301 at __clear_user+0x14/0x7c
LR = load_elf_binary+0x670/0x1270
[dfb7dd10] [c00f6ca0] load_elf_binary+0x620/0x1270 (unreliable)
[dfb7dd90] [c00b1f78] search_binary_handler+0x17c/0x394
[dfb7dde0] [c00f4f50] load_script+0x274/0x288
[dfb7de90] [c00b1f78] search_binary_handler+0x17c/0x394
[dfb7dee0] [c00b3580] do_execve+0x240/0x29c
[dfb7df20] [c000a46c] sys_execve+0x68/0xa4
[dfb7df40] [c00145a4] ret_from_syscall+0x0/0x38

2010-02-26 15:15:57

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

Paul Gortmaker wrote:
> On 10-02-26 09:35 AM, Anton Vorontsov wrote:
>
>> On Fri, Feb 26, 2010 at 12:06:15PM +0000, Martyn Welch wrote:
>>
>>> Anton Vorontsov wrote:
>>>
>>>> On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
>>>> [...]
>>>>
>>>>
>>>>> I was able to reproduce it on an 8641D and bisected it down to this:
>>>>>
>>>>> -----------
>>>>> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
>>>>> Author: Anton Vorontsov<[email protected]>
>>>>> Date: Tue Nov 10 14:11:10 2009 +0000
>>>>>
>>>>> gianfar: Revive SKB recycling
>>>>>
>>>>>
>>>> Thanks for the bisect. I have a guess why tx hangs in
>>>> SMP case. Could anyone try the patch down below?
>>>>
>>>>
>>> Yup, no problem. I'm afraid it doesn't resolve the problem for me.
>>>
>> Hm.. I found a p2020 board and I was able to reproduce the issue.
>> The patch down below fixed it completely for me... hm.
>>
>
> Interesting. I just tested the patch on the sbc8641d, and it
> still has the issue with your patch applied. I'm using NFSroot
> just like Martyn was and it still appears bound up on that
> gianfar tx lock. I'll see if I can get a SysRq backtrace in
> case that will help you see how it manages to get there...
>

I've got a p2020ds here as well, so I'll give NFSroot on that a try with
your patch.

Martyn

--
Martyn Welch (Principal Software Engineer) | Registered in England and
GE Intelligent Platforms | Wales (3828642) at 100
T +44(0)127322748 | Barbirolli Square, Manchester,
E [email protected] | M2 3AB VAT:GB 927559189

2010-02-26 15:31:11

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

Martyn Welch wrote:
> Paul Gortmaker wrote:
>
>> On 10-02-26 09:35 AM, Anton Vorontsov wrote:
>>
>>
>>> On Fri, Feb 26, 2010 at 12:06:15PM +0000, Martyn Welch wrote:
>>>
>>>
>>>> Anton Vorontsov wrote:
>>>>
>>>>
>>>>> On Thu, Feb 25, 2010 at 07:53:30PM -0500, Paul Gortmaker wrote:
>>>>> [...]
>>>>>
>>>>>
>>>>>
>>>>>> I was able to reproduce it on an 8641D and bisected it down to this:
>>>>>>
>>>>>> -----------
>>>>>> commit a3bc1f11e9b867a4f49505ecac486a33af248b2e
>>>>>> Author: Anton Vorontsov<[email protected]>
>>>>>> Date: Tue Nov 10 14:11:10 2009 +0000
>>>>>>
>>>>>> gianfar: Revive SKB recycling
>>>>>>
>>>>>>
>>>>>>
>>>>> Thanks for the bisect. I have a guess why tx hangs in
>>>>> SMP case. Could anyone try the patch down below?
>>>>>
>>>>>
>>>>>
>>>> Yup, no problem. I'm afraid it doesn't resolve the problem for me.
>>>>
>>>>
>>> Hm.. I found a p2020 board and I was able to reproduce the issue.
>>> The patch down below fixed it completely for me... hm.
>>>
>>>
>> Interesting. I just tested the patch on the sbc8641d, and it
>> still has the issue with your patch applied. I'm using NFSroot
>> just like Martyn was and it still appears bound up on that
>> gianfar tx lock. I'll see if I can get a SysRq backtrace in
>> case that will help you see how it manages to get there...
>>
>>
>
> I've got a p2020ds here as well, so I'll give NFSroot on that a try with
> your patch.
>

Out of 10 boot attempts, 7 failed.

Martyn

--
Martyn Welch (Principal Software Engineer) | Registered in England and
GE Intelligent Platforms | Wales (3828642) at 100
T +44(0)127322748 | Barbirolli Square, Manchester,
E [email protected] | M2 3AB VAT:GB 927559189

2010-02-26 16:11:05

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Fri, Feb 26, 2010 at 03:34:07PM +0000, Martyn Welch wrote:
[...]
> Out of 10 boot attempts, 7 failed.

OK, I see why. With ip=on (dhcp boot) it's much harder to trigger
it. With static ip config can I see the same.

2010-02-26 16:27:56

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On 10-02-26 11:10 AM, Anton Vorontsov > On Fri, Feb 26, 2010 at 03:34:07PM > [...]
>> Out of 10 boot attempts, >
> OK, I see why. With ip=on (dhcp > it. With static ip config can
I'd kind of expected to see us stuck the SysRQ-T doesn't show us hung [This was on a base 2.6.33, with
Paul.

----------

SysRq : Changing Loglevel Loglevel set to 9 nfs: server not responding, still trying SysRq : Show State task PC stack pid father init D 0ff1c380 0 1 Call Trace: [df841a30] [c0009fc4] __switch_to+0x8c/0xf8 [df841a50] [c0350160] schedule+0x354/0x92c [df841ae0] [c0331394] rpc_wait_bit_killable+0x2c/0x54 [df841af0] [c0350eb0] __wait_on_bit+0x9c/0x108 [df841b10] [c0350fc0] out_of_line_wait_on_bit+0xa4/0xb4 [df841b40] [c0331cf0] __rpc_execute+0x16c/0x398 [df841b90] [c0329abc] rpc_run_task+0x48/0x9c [df841ba0] [c0329c40] rpc_call_sync+0x54/0x88 [df841bd0] [c015e780] nfs_proc_lookup+0x94/0xe8 [df841c20] [c014eb60] nfs_lookup+0x12c/0x230 [df841d50] [c00b9680] do_lookup+0x118/0x288 [df841d80] [c00bb904] link_path_walk+0x194/0x1118 [df841df0] [c00bcb08] path_walk+0x8c/0x168 [df841e20] [c00bcd6c] do_path_lookup+0x74/0x7c [df841e40] [c00be148] do_filp_open+0x5d4/0xba4 [df841f10] [c00abe94] do_sys_open+0xac/0x190 [df841f40] [c001437c] ret_from_syscall+0x0/0x38 --- Exception: c01 at 0xff1c380 LR = 0xfec6d98 kthreadd S 00000000 0 2 Call Trace: [df843e50] [c002e788] wake_up_new_task+0x128/0x16c [df843f10] [c0009fc4] __switch_to+0x8c/0xf8 [df843f30] [c0350160] schedule+0x354/0x92c [df843fc0] [c004d154] kthreadd+0x130/0x134 [df843ff0] [c00141a0] kernel_thread+0x4c/0x68 migration/0 S 00000000 0 3 Call Trace: [df847de0] [ffffffff] 0xffffffff (unreliable) [df847ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df847ec0] [c0350160] schedule+0x354/0x92c [df847f50] [c002d074] migration_thread+0x29c/0x448 [df847fb0] [c004d020] kthread+0x80/0x84 [df847ff0] [c00141a0] kernel_thread+0x4c/0x68 ksoftirqd/0 S 00000000 0 4 Call Trace: [df84be10] [00000800] 0x800 (unreliable) [df84bed0] [c0009fc4] __switch_to+0x8c/0xf8 [df84bef0] [c0350160] schedule+0x354/0x92c [df84bf80] [c0038454] run_ksoftirqd+0x14c/0x1e0 [df84bfb0] [c004d020] kthread+0x80/0x84 [df84bff0] [c00141a0] kernel_thread+0x4c/0x68 watchdog/0 S 00000000 0 5 Call Trace: [df84dee0] [c0009fc4] __switch_to+0x8c/0xf8 [df84df00] [c0350160] schedule+0x354/0x92c [df84df90] [c006b8e8] watchdog+0x48/0x88 [df84dfb0] [c004d020] kthread+0x80/0x84 [df84dff0] [c00141a0] kernel_thread+0x4c/0x68 migration/1 S 00000000 0 6 Call Trace: [df84fea0] [c0009fc4] __switch_to+0x8c/0xf8 [df84fec0] [c0350160] schedule+0x354/0x92c [df84ff50] [c002d074] migration_thread+0x29c/0x448 [df84ffb0] [c004d020] kthread+0x80/0x84 [df84fff0] [c00141a0] kernel_thread+0x4c/0x68 ksoftirqd/1 S 00000000 0 7 Call Trace: [df853ed0] [c0009fc4] __switch_to+0x8c/0xf8 [df853ef0] [c0350160] schedule+0x354/0x92c [df853f80] [c0038454] run_ksoftirqd+0x14c/0x1e0 [df853fb0] [c004d020] kthread+0x80/0x84 [df853ff0] [c00141a0] kernel_thread+0x4c/0x68 watchdog/1 S 00000000 0 8 Call Trace: [df857ee0] [c0009fc4] __switch_to+0x8c/0xf8 [df857f00] [c0350160] schedule+0x354/0x92c [df857f90] [c006b8e8] watchdog+0x48/0x88 [df857fb0] [c004d020] kthread+0x80/0x84 [df857ff0] [c00141a0] kernel_thread+0x4c/0x68 events/0 S 00000000 0 9 Call Trace: [df859ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df859ec0] [c0350160] schedule+0x354/0x92c [df859f50] [c0048718] worker_thread+0x1fc/0x200 [df859fb0] [c004d020] kthread+0x80/0x84 [df859ff0] [c00141a0] kernel_thread+0x4c/0x68 events/1 S 00000000 0 10 Call Trace: [df85bea0] [c0009fc4] __switch_to+0x8c/0xf8 [df85bec0] [c0350160] schedule+0x354/0x92c [df85bf50] [c0048718] worker_thread+0x1fc/0x200 [df85bfb0] [c004d020] kthread+0x80/0x84 [df85bff0] [c00141a0] kernel_thread+0x4c/0x68 khelper S 00000000 0 11 Call Trace: [df85dde0] [c0030564] do_fork+0x1b0/0x344 (unreliable) [df85dea0] [c0009fc4] __switch_to+0x8c/0xf8 [df85dec0] [c0350160] schedule+0x354/0x92c [df85df50] [c0048718] worker_thread+0x1fc/0x200 [df85dfb0] [c004d020] kthread+0x80/0x84 [df85dff0] [c00141a0] kernel_thread+0x4c/0x68 async/mgr S 00000000 0 15 Call Trace: [df8a7df0] [000000fc] 0xfc (unreliable) [df8a7eb0] [c0009fc4] __switch_to+0x8c/0xf8 [df8a7ed0] [c0350160] schedule+0x354/0x92c [df8a7f60] [c00565c0] async_manager_thread+0x120/0x174 [df8a7fb0] [c004d020] kthread+0x80/0x84 [df8a7ff0] [c00141a0] kernel_thread+0x4c/0x68 sync_supers S 00000000 0 85 Call Trace: [df951e30] [00000400] 0x400 (unreliable) [df951ef0] [c0009fc4] __switch_to+0x8c/0xf8 [df951f10] [c0350160] schedule+0x354/0x92c [df951fa0] [c008d714] bdi_sync_supers+0x30/0x5c [df951fb0] [c004d020] kthread+0x80/0x84 [df951ff0] [c00141a0] kernel_thread+0x4c/0x68 bdi-default S 00000000 0 87 Call Trace: [df957e30] [c0009fc4] __switch_to+0x8c/0xf8 [df957e50] [c0350160] schedule+0x354/0x92c [df957ee0] [c0350b14] schedule_timeout+0x15c/0x23c [df957f30] [c008e510] bdi_forker_task+0x2f8/0x30c [df957fb0] [c004d020] kthread+0x80/0x84 [df957ff0] [c00141a0] kernel_thread+0x4c/0x68 kblockd/0 S 00000000 0 88 Call Trace: [df8bdde0] [00000800] 0x800 (unreliable) [df8bdea0] [c0009fc4] __switch_to+0x8c/0xf8 [df8bdec0] [c0350160] schedule+0x354/0x92c [df8bdf50] [c0048718] worker_thread+0x1fc/0x200 [df8bdfb0] [c004d020] kthread+0x80/0x84 [df8bdff0] [c00141a0] kernel_thread+0x4c/0x68 kblockd/1 S 00000000 0 89 Call Trace: [df959de0] [00000800] 0x800 (unreliable) [df959ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df959ec0] [c0350160] schedule+0x354/0x92c [df959f50] [c0048718] worker_thread+0x1fc/0x200 [df959fb0] [c004d020] kthread+0x80/0x84 [df959ff0] [c00141a0] kernel_thread+0x4c/0x68 rpciod/0 S 00000000 0 111 Call Trace: [df93fea0] [c0009fc4] __switch_to+0x8c/0xf8 [df93fec0] [c0350160] schedule+0x354/0x92c [df93ff50] [c0048718] worker_thread+0x1fc/0x200 [df93ffb0] [c004d020] kthread+0x80/0x84 [df93fff0] [c00141a0] kernel_thread+0x4c/0x68 rpciod/1 S 00000000 0 112 Call Trace: [df931de0] [00000001] 0x1 (unreliable) [df931ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df931ec0] [c0350160] schedule+0x354/0x92c [df931f50] [c0048718] worker_thread+0x1fc/0x200 [df931fb0] [c004d020] kthread+0x80/0x84 [df931ff0] [c00141a0] kernel_thread+0x4c/0x68 khungtaskd S 00000000 0 141 Call Trace: [df979db0] [00000800] 0x800 (unreliable) [df979e70] [c0009fc4] __switch_to+0x8c/0xf8 [df979e90] [c0350160] schedule+0x354/0x92c [df979f20] [c0350b14] schedule_timeout+0x15c/0x23c [df979f70] [c006bd38] watchdog+0x98/0x294 [df979fb0] [c004d020] kthread+0x80/0x84 [df979ff0] [c00141a0] kernel_thread+0x4c/0x68 kswapd0 S 00000000 0 142 Call Trace: [df97bd60] [c04383a0] 0xc04383a0 (unreliable) [df97be20] [c0009fc4] __switch_to+0x8c/0xf8 [df97be40] [c0350160] schedule+0x354/0x92c [df97bed0] [c00868a8] kswapd+0x81c/0x858 [df97bfb0] [c004d020] kthread+0x80/0x84 [df97bff0] [c00141a0] kernel_thread+0x4c/0x68 aio/0 S 00000000 0 143 Call Trace: [df97dde0] [ffffffff] 0xffffffff (unreliable) [df97dea0] [c0009fc4] __switch_to+0x8c/0xf8 [df97dec0] [c0350160] schedule+0x354/0x92c [df97df50] [c0048718] worker_thread+0x1fc/0x200 [df97dfb0] [c004d020] kthread+0x80/0x84 [df97dff0] [c00141a0] kernel_thread+0x4c/0x68 aio/1 S 00000000 0 144 Call Trace: [df97fde0] [ffffffff] 0xffffffff (unreliable) [df97fea0] [c0009fc4] __switch_to+0x8c/0xf8 [df97fec0] [c0350160] schedule+0x354/0x92c [df97ff50] [c0048718] worker_thread+0x1fc/0x200 [df97ffb0] [c004d020] kthread+0x80/0x84 [df97fff0] [c00141a0] kernel_thread+0x4c/0x68 nfsiod S 00000000 0 145 Call Trace: [df9a5de0] [00000003] 0x3 (unreliable) [df9a5ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df9a5ec0] [c0350160] schedule+0x354/0x92c [df9a5f50] [c0048718] worker_thread+0x1fc/0x200 [df9a5fb0] [c004d020] kthread+0x80/0x84 [df9a5ff0] [c00141a0] kernel_thread+0x4c/0x68 crypto/0 S 00000000 0 146 Call Trace: [df9a7de0] [00000800] 0x800 (unreliable) [df9a7ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df9a7ec0] [c0350160] schedule+0x354/0x92c [df9a7f50] [c0048718] worker_thread+0x1fc/0x200 [df9a7fb0] [c004d020] kthread+0x80/0x84 [df9a7ff0] [c00141a0] kernel_thread+0x4c/0x68 crypto/1 S 00000000 0 147 Call Trace: [df9a9ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df9a9ec0] [c0350160] schedule+0x354/0x92c [df9a9f50] [c0048718] worker_thread+0x1fc/0x200 [df9a9fb0] [c004d020] kthread+0x80/0x84 [df9a9ff0] [c00141a0] kernel_thread+0x4c/0x68 mtdblockd S 00000000 0 779 Call Trace: [dfae1e00] [00000800] 0x800 (unreliable) [dfae1ec0] [c0009fc4] __switch_to+0x8c/0xf8 [dfae1ee0] [c0350160] schedule+0x354/0x92c [dfae1f70] [c02232dc] mtd_blktrans_thread+0x1c4/0x394 [dfae1fb0] [c004d020] kthread+0x80/0x84 [dfae1ff0] [c00141a0] kernel_thread+0x4c/0x68 kstriped S 00000000 0 826 Call Trace: [df935de0] [00000800] 0x800 (unreliable) [df935ea0] [c0009fc4] __switch_to+0x8c/0xf8 [df935ec0] [c0350160] schedule+0x354/0x92c [df935f50] [c0048718] worker_thread+0x1fc/0x200 [df935fb0] [c004d020] kthread+0x80/0x84 [df935ff0] [c00141a0] kernel_thread+0x4c/0x68 ksnapd S 00000000 0 828 Call Trace: [dfae9de0] [00000800] 0x800 (unreliable) [dfae9ea0] [c0009fc4] __switch_to+0x8c/0xf8 [dfae9ec0] [c0350160] schedule+0x354/0x92c [dfae9f50] [c0048718] worker_thread+0x1fc/0x200 [dfae9fb0] [c004d020] kthread+0x80/0x84 [dfae9ff0] [c00141a0] kernel_thread+0x4c/0x68 Sched Debug Version: v0.09, 2.6.33-00001-g8c31d07 #1 now at 35747.705693 msecs .jiffies .sysctl_sched_latency .sysctl_sched_min_granularity .sysctl_sched_wakeup_granularity .sysctl_sched_child_runs_first .sysctl_sched_features .sysctl_sched_tunable_scaling cpu#0 .nr_running : 0 .load : 0 .nr_switches : 2809 .nr_load_updates : 8950 .nr_uninterruptible : 1 .next_balance : 4294.901248 .curr->pid : 0 .clock : 35832.063536 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 cfs_rq[0] for UID: 0 .exec_clock : 0.000000 .MIN_vruntime : 0.000001 .min_vruntime : 4129.195888 .max_vruntime : 0.000001 .spread : 0.000000 .spread0 : 4048.261385 .nr_running : 0 .load : 0 .nr_spread_over : 0 .shares : 0 .se->exec_start : 35836.116992 .se->vruntime : 80.934503 .se->sum_exec_runtime : 123.815984 .se->load.weight : 1024 rt_rq[0]: .rt_nr_running : 0 .rt_throttled : 0 .rt_time : 0.000000 .rt_runtime : 950.000000 runnable tasks: task PID tree-key sum-exec sum-sleep ------------------------------------------ -------------------------- cpu#1 .nr_running : 0 .load : 0 .nr_switches : 4069 .nr_load_updates : 8689 .nr_uninterruptible : 0 .next_balance : 4294.901019 .curr->pid : 0 .clock : 34909.104304 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 cfs_rq[1] for UID: 0 .exec_clock : 0.000000 .MIN_vruntime : 0.000001 .min_vruntime : 509.424556 .max_vruntime : 0.000001 .spread : 0.000000 .spread0 : 428.490053 .nr_running : 0 .load : 0 .nr_spread_over : 0 .shares : 0 .se->exec_start : 34909.104304 .se->vruntime : 273.153007 .se->sum_exec_runtime : 503.971344 .se->load.weight : 1024 rt_rq[1]: .rt_nr_running : 0 .rt_throttled : 0 .rt_time : 0.000000 .rt_runtime : 950.000000 runnable tasks: task PID tree-key sum-exec sum-sleep ------------------------------------------ -------------------------- wrote:
+0000, Martyn Welch wrote:
7 failed.
boot) it's much harder to trigger
I see the same.
in gianfar on that lock, but
up anywhere in gianfar itself.
just a small sysrq fix patch]

0 0x00000000

0 0x00000000

(unreliable)

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

2 0x00000000

: 4294901234
: 10.000000
: 2.000000
: 2.000000
: 0.000000
: 7917179
: 1 (logaritmic)

switches prio exec-runtime

--------------------------------------

switches prio exec-runtime

--------------------------------------

2010-02-26 21:38:32

[permalink] [raw]

Subject: Re: Gianfar driver failing on MPC8641D based board

On Fri, Feb 26, 2010 at 11:27:42AM -0500, Paul Gortmaker wrote:
> On 10-02-26 11:10 AM, Anton Vorontsov wrote:
> > On Fri, Feb 26, 2010 at 03:34:07PM +0000, Martyn Welch wrote:
> > [...]
> >> Out of 10 boot attempts, 7 failed.
> >
> > OK, I see why. With ip=on (dhcp boot) it's much harder to trigger
> > it. With static ip config can I see the same.
>
> I'd kind of expected to see us stuck in gianfar on that lock, but
> the SysRQ-T doesn't show us hung up anywhere in gianfar itself.
> [This was on a base 2.6.33, with just a small sysrq fix patch]

> [df841a30] [c0009fc4] __switch_to+0x8c/0xf8
> [df841a50] [c0350160] schedule+0x354/0x92c
> [df841ae0] [c0331394] rpc_wait_bit_killable+0x2c/0x54
> [df841af0] [c0350eb0] __wait_on_bit+0x9c/0x108
> [df841b10] [c0350fc0] out_of_line_wait_on_bit+0xa4/0xb4
> [df841b40] [c0331cf0] __rpc_execute+0x16c/0x398
> [df841b90] [c0329abc] rpc_run_task+0x48/0x9c
> [df841ba0] [c0329c40] rpc_call_sync+0x54/0x88
> [df841bd0] [c015e780] nfs_proc_lookup+0x94/0xe8
> [df841c20] [c014eb60] nfs_lookup+0x12c/0x230
> [df841d50] [c00b9680] do_lookup+0x118/0x288
> [df841d80] [c00bb904] link_path_walk+0x194/0x1118
> [df841df0] [c00bcb08] path_walk+0x8c/0x168
> [df841e20] [c00bcd6c] do_path_lookup+0x74/0x7c
> [df841e40] [c00be148] do_filp_open+0x5d4/0xba4
> [df841f10] [c00abe94] do_sys_open+0xac/0x190

Yeah, I don't think this is gianfar-related. It must be something
else triggered by the fact that gianfar no longer sends stuff.

OK, I think I found what's happening in gianfar.

Some background...

start_xmit() prepares new skb for transmitting, generally it does
three things:

1. sets up all BDs (marks them ready to send), except the first one.
2. stores skb into tx_queue->tx_skbuff so that clean_tx_ring()
would cleanup it later.
3. sets up the first BD, i.e. marks it ready.

Here is what clean_tx_ring() does:

1. reads skbs from tx_queue->tx_skbuff
2. Checks if the *last* BD is ready. If it's still ready [to send]
then it it isn't transmitted, so clean_tx_ring() returns.
Otherwise it actually cleanups BDs. All is OK.

Now, if there is just one BD, code flow:

- start_xmit(): stores skb into tx_skbuff. Note that the first BD
(which is also the last one) isn't marked as ready, yet.
- clean_tx_ring(): sees that skb is not null, *and* its lstatus
says that it is NOT ready (like if BD was sent), so it cleans
it up (bad!)
- start_xmit(): marks BD as ready [to send], but it's too late.

We can fix this simply by reordering lstatus/tx_skbuff writes.

It works flawlessly on my p2020, please try it.

Thanks!

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 8bd3c9f..cccb409 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2021,7 +2021,6 @@ static int gfar_start_xmit(struct sk_buff *skb, struct net_device *dev)
}

/* setup the TxBD length and buffer pointer for the first BD */
- tx_queue->tx_skbuff[tx_queue->skb_curtx] = skb;
txbdp_start->bufPtr = dma_map_single(&priv->ofdev->dev, skb->data,
skb_headlen(skb), DMA_TO_DEVICE);

@@ -2053,6 +2052,10 @@ static int gfar_start_xmit(struct sk_buff *skb, struct net_device *dev)

txbdp_start->lstatus = lstatus;

+ eieio(); /* force lstatus write before tx_skbuff */
+
+ tx_queue->tx_skbuff[tx_queue->skb_curtx] = skb;
+
/* Update the current skb pointer to the next entry we will use
* (wrapping if necessary) */
tx_queue->skb_curtx = (tx_queue->skb_curtx + 1) &

2010-02-26 22:12:56