Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753316AbYKTDAx (ORCPT ); Wed, 19 Nov 2008 22:00:53 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751918AbYKTDAl (ORCPT ); Wed, 19 Nov 2008 22:00:41 -0500 Received: from mms2.broadcom.com ([216.31.210.18]:2289 "EHLO mms2.broadcom.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751818AbYKTDAk (ORCPT ); Wed, 19 Nov 2008 22:00:40 -0500 X-Server-Uuid: D3C04415-6FA8-4F2C-93C1-920E106A2031 Date: Wed, 19 Nov 2008 19:00:12 -0800 From: "Matt Carlson" To: "Roger Heflin" cc: "Peter Zijlstra" , LKML , netdev Subject: Re: WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0xfe/0x17e() with tg3 network Message-ID: <20081120030012.GC26448@xw6200.broadcom.net> References: <491954E1.2050002@gmail.com> <1226403067.7685.1598.camel@twins> <491E49AA.60407@gmail.com> MIME-Version: 1.0 In-Reply-To: <491E49AA.60407@gmail.com> User-Agent: Mutt/1.5.16 (2007-06-09) X-OriginalArrivalTime: 20 Nov 2008 03:00:12.0821 (UTC) FILETIME=[1B3F4C50:01C94ABC] X-WSS-ID: 653A0D413FC18901191-01-01 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10263 Lines: 198 On Fri, Nov 14, 2008 at 08:01:46PM -0800, Roger Heflin wrote: > Peter Zijlstra wrote: > > (netdev CC'ed) > > > > On Tue, 2008-11-11 at 03:48 -0600, Roger Heflin wrote: > >> I have duplicate this with kernel 2.6.27.2 and 2.6.27.5, no > >> extra modules, tg3 Gbit networking. I have not yet tested > >> earlier kernels to see if this has been around for a while. > > > > How do more recent kernels do? > > I did not try more recent kernels, more testing seems to indicate > that at the very least it the bug depends on a certain version of > either tg3 and/or firmware to happen, as my second tg3 port does not > have it happen. More about this below. > > > > >> So far I have had this error happen 5 times (MTBF is maybe > >> 12 hours), 4 of the 5 times resulted in the networking being > >> broken, one time things came back by itself without a reboot, > >> I believe in this case the hang was traffic coming into the > >> machine vs the other times going out of the machine. > >> > >> Unloading all of the network modules and reloading them did > >> not correct the problem. > >> > >> Searching google finds a couple of other people getting the > >> same error but they have a different network chipset (e1000 > >> and a rt811C chipset), which makes me thing that there is > >> something interacting bad with the network. Or does this > >> error truly mean that the network chipset for some unknown reason > >> locked itself up? > >> > >> http://www.google.com/url?sa=U&start=4&q=http://kerneltrap.org/mailarchive/linux-netdev/2008/8/6/2838184&ei=rU8ZScysAon8edz5xKgO&sig2=Wxp7IkUtdgORGZiflxvppg&usg=AFQjCNHzPwsCOmLGKmtX4q_FEpk6oubxxg > >> http://article.gmane.org/gmane.linux.network/110238 > >> > >> The changes I made recently were to upgrade my MB (old > >> was E100 on a 100Mbit network,new is tg3 on a Gbit network, > >> cpu and memory are the same, MB chipset is a intel 955 > >> chipset vs the old being a intel 915 chipset). > >> > >> Autoneg is turned on all around, the GBit switch is a > >> 8-port Dlink switch. The network seems to otherwise be working > >> correctly. > >> > >> I did test the network under decent load and the error did not > >> appear to be any more likely under load, and typically the network > >> is under very light load 2-3MB/second. > >> > >> The machine originally had 2 HT CPU's showing up, I turned off HT > >> so that only one cpu was showing, but this did not change the error. > >> > >> I am first turning off all offload capabilities on tg3 and going > >> to see if that changes anything. > > This made no difference in the error. > > >> > >> The next thing I am going to be doing is to turn of GB capability > >> on the networking and see if that does anything. > > Did not try. > > >> > >> I also have a second tg3 port that is slightly different, so I may > >> try that eventually. > > I tried this, and with the second port I don't appear to be getting > the error. The first port is a 5789-v3.29a and the second port is a > 5788-v3.04, I know the first port is faster (pcie-x1) than the second > port (pci bus-built-in, unknown exact connection). The second port > will sustain about 50MB/second, were as the first port will get > >90MB/second. The 5789 is a PCIe device, the 5788 is a PCI device. There's a lot of room for differences between the two devices. > It seems to me to likely be the firmware on the tg3, and it would seem > unlikely that the driver could do anything more than work around the > issue that is in the firmware, and currently my system works on the > second port, and the second port is fast enough for my needs. I don't think the 5789 should have any firmware running at the time of the failure. There's always the possibility that there is a problem with the bootcode, but I'm guessing the problem is elsewhere. > If someone else runs into this issue, since I have 2 ports I would be > able to do some testing on it, right now my first port is locked up, and > the machine is running fine on the second port. > > lspci -vvv for the first (bad) port: Ah. There it is. > 02:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5789 Gigabit > Ethernet PCI Express (rev 11) > Subsystem: Foxconn International, Inc. Unknown device 0cc1 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0, Cache Line Size: 32 bytes > Interrupt: pin A routed to IRQ 19 > Region 0: Memory at fd8f0000 (64-bit, non-prefetchable) [size=64K] > Expansion ROM at [disabled] > Capabilities: [48] Power Management version 2 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot+,D3cold+) > Status: D3 PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [50] Vital Product Data > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 > Enable- > Address: 0101b8102a0f7b0c Data: f21e > Capabilities: [d0] Express Endpoint IRQ 0 > Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+ > Device: Latency L0s <4us, L1 unlimited > Device: AtnBtn- AtnInd- PwrInd- > Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- > Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes > Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 > Link: Latency L0s <2us, L1 <64us > Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- > Link: Speed 2.5Gb/s, Width x1 > Capabilities: [100] Advanced Error Reporting > Capabilities: [13c] Virtual Channel Hmmm. No smoking gun. Perhaps the register dump will help. > >> Nov 11 00:44:39 computer kernel: ------------[ cut here ]------------ > >> Nov 11 00:44:39 computer kernel: WARNING: at net/sched/sch_generic.c:219 > >> dev_watchdog+0xfe/0x17e() > >> Nov 11 00:44:39 computer kernel: NETDEV WATCHDOG: eth0 (tg3): transmit timed out Usually the tg3_tx_timeout function dumps a few registers before resetting the chip, but I don't see that here. Have you seen any dumps since then? > >> Nov 11 00:44:39 computer kernel: Modules linked in: nfsd auth_rpcgss exportfs > >> w83627ehf hwmon_vid hwmon nfs lockd nfs_acl sunrpc ipv6 xfs raid456 async_xor > >> async_memcpy async_tx xor video output sbs sbshc battery ac lgdt330x cx88_dvb > >> wm8775 cx88_vp3054_i2c cx25840 tuner_simple tuner_types tda9887 tda8290 tuner > >> mt2131 s5h1409 snd_hda_intel snd_seq_dummy ivtv cx8800 snd_seq_oss cx88_alsa > >> cx8802 cx88xx cx23885 snd_seq_midi_event snd_seq ir_common videodev v4l1_compat > >> i2c_algo_bit cx2341x firewire_ohci iTCO_wdt snd_seq_device compat_ioctl32 > >> videobuf_dvb i2c_i801 firewire_core tveeprom floppy iTCO_vendor_support > >> v4l2_common snd_pcm_oss dvb_core pcspkr tg3 sata_sil i2c_core btcx_risc > >> videobuf_dma_sg crc_itu_t snd_mixer_oss libphy videobuf_core snd_pcm parport_pc > >> parport snd_timer snd soundcore button snd_page_alloc sg dm_snapshot dm_zero > >> dm_mirror dm_log dm_mod ahci ata_piix ata_generic libata sd_mod scsi_mod ext3 > >> jbd mbcache ehci_hcd ohci_hcd uhci_hcd [last unloaded: eeprom] > >> Nov 11 00:44:39 computer kernel: Pid: 0, comm: swapper Not tainted 2.6.27.5 #2 > >> Nov 11 00:44:39 computer kernel: [] warn_slowpath+0x61/0x83 > >> Nov 11 00:44:39 computer kernel: [] usb_hcd_submit_urb+0x75c/0x811 > >> Nov 11 00:44:39 computer kernel: [] hiddev_hid_event+0x0/0x64 > >> Nov 11 00:44:39 computer kernel: [] hid_process_event+0x58/0x5f > >> Nov 11 00:44:39 computer kernel: [] __next_cpu+0x12/0x21 > >> Nov 11 00:44:39 computer kernel: [] find_busiest_group+0x23e/0x672 > >> Nov 11 00:44:39 computer kernel: [] clocksource_get_next+0x39/0x3f > >> Nov 11 00:44:39 computer kernel: [] update_wall_time+0x567/0x70c > >> Nov 11 00:44:39 computer kernel: [] read_tsc+0x6/0x22 > >> Nov 11 00:44:39 computer kernel: [] getnstimeofday+0x37/0xc1 > >> Nov 11 00:44:39 computer kernel: [] uhci_scan_schedule+0x11b/0x6b0 > >> [uhci_hcd] > >> Nov 11 00:44:39 computer kernel: [] dev_watchdog+0xfe/0x17e > >> Nov 11 00:44:39 computer kernel: [] __mod_timer+0x99/0xa3 > >> Nov 11 00:44:39 computer kernel: [] rh_timer_func+0x0/0x5 > >> Nov 11 00:44:39 computer kernel: [] usb_hcd_poll_rh_status+0x12b/0x133 > >> Nov 11 00:44:39 computer kernel: [] tick_dev_program_event+0x1e/0x81 > >> Nov 11 00:44:39 computer kernel: [] dev_watchdog+0x0/0x17e > >> Nov 11 00:44:39 computer kernel: [] run_timer_softirq+0x10e/0x167 > >> Nov 11 00:44:39 computer kernel: [] dev_watchdog+0x0/0x17e > >> Nov 11 00:44:39 computer kernel: [] __do_softirq+0x5d/0xc1 > >> Nov 11 00:44:39 computer kernel: [] do_softirq+0x32/0x36 > >> Nov 11 00:44:39 computer kernel: [] smp_apic_timer_interrupt+0x6e/0x79 > >> Nov 11 00:44:39 computer kernel: [] apic_timer_interrupt+0x28/0x30 > >> Nov 11 00:44:39 computer kernel: [] mwait_idle+0x32/0x38 > >> Nov 11 00:44:39 computer kernel: [] cpu_idle+0xbd/0xd5 > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> Please read the FAQ at http://www.tux.org/lkml/ > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/