Hi all,
[Please CC me as I am not on the list.]
With both 2.6.27.19 and 2.6.28.7, I am experiencing "transmit timed out"
errors as reported by the netdev watchdog, for both my PCMCIA Ethernet
adapters, using the r8169 and 8139too drivers respectively.
I can readily reproduce the error by generating intensive traffic, most
reliably by initiating an scp of a large file.
I have attached the dmesg output of the error ocurring for both cards, as well
as both kernel config files. I'll gladly provide more information as it is
requested.
Cheers!
~Mik
--
The geek shall inherit the earth.
- Rainer Wolfcastle in "Undercover Nerd"
Michael Büker <[email protected]> :
[...]
> With both 2.6.27.19 and 2.6.28.7, I am experiencing "transmit timed out"
> errors as reported by the netdev watchdog, for both my PCMCIA Ethernet
> adapters, using the r8169 and 8139too drivers respectively.
Can you describe the symptoms a bit more specifically ?
The kernel displays a scary warning, I can guess that it is almost surely
associated with some loss of network connectivity for a few seconds at the
very least but it is a bit hard to figure the real scale of your problem.
Please scare me. :o)
[...]
> as both kernel config files. I'll gladly provide more information as it is
> requested.
lspci -vx and a complete dmesg.
Can you identify a kernel which worked flawlessly ?
--
Ueimor
On Wednesday 04 March 2009, you wrote:
> The kernel displays a scary warning, I can guess that it is almost surely
> associated with some loss of network connectivity for a few seconds at the
> very least but it is a bit hard to figure the real scale of your problem.
I'll try to be more precise :)
I can readily reproduce the symptoms by initiating the scp transfer of a large
file. The transfer will stall somewhere between ~2MB and ~50MB of data
transferred. After that, the results differ for the two network cards:
With the r8169 card, the transfer never recovers, but after ~1min to ~2min,
there is a "link up" message in dmesg and the network will be back up.
With the 8139too card, the stall will be no longer than ~30sec and the
transfer will resume. There is also a "link up" message upon the network's
resurrection.
I have been able to verify that this problem does _not_ occur when using the
same network (and target server) through my wireless USB adapter. The average
speed this way is lower, however - ~690kb/s as opposed to ~1.8MB/s over the
wire (the reason this is so much lower than the physical limit for both types
of connections probably lies in the fact that we're dealing with a pentium2
that has to do ssl encryption on the fly).
The two dmesg files I have attached contain the log of the following actions:
1. Booting with the respective card inserted
2. Initializing the file transfer
3. Abort the transfer after it's been dead for ~10sec
4. Wait for the network to come back up
The two attached lspci outputs were recorded with the two cards inserted
respectively.
> Can you identify a kernel which worked flawlessly ?
I'm quite sure I did not experience this problem with the last kernel I ran,
which was 2.6.25.15.
Hope to help!
Michael
--
I came for the quality, but I stayed for the freedom.
- Sean Neakums
[sorry if you get duplicates, my first try got blocked by vger because
my mailer jumped into stupid HTML mode]
Am Mittwoch, den 04.03.2009, 23:43 +0100 schrieb Francois Romieu:
> Michael Büker <[email protected]> :
> [...]
> > With both 2.6.27.19 and 2.6.28.7, I am experiencing "transmit timed out"
> > errors as reported by the netdev watchdog, for both my PCMCIA Ethernet
> > adapters, using the r8169 and 8139too drivers respectively.
>
> Can you describe the symptoms a bit more specifically ?
>
> The kernel displays a scary warning, I can guess that it is almost surely
> associated with some loss of network connectivity for a few seconds at the
> very least but it is a bit hard to figure the real scale of your problem.
>
> Please scare me. :o)
I'm also experiencing these transmit time outs with the r8169 Driver on
an Asus M3A-H/HDMI Board - 32bit mode.
Happens at least with 2.6.28, 2.6.28.4, 2.6.28.6 and 2.6.29-rc7.
This is a diskless box (on NFS) on GigE LAN, used for DVB-S
recordings/playback with vdr. I first suspected problems with the DVB
Stuff since I have minor issues there too, but now that I've digged
around in the list I think that's unrelated.
Every once in while, more likely with heavy usage of the vdr stuff (like
recording two shows and/or cutting a recording) these show up. Of course
the impact is realy bad on a diskless system. It takes the system from
10 seconds to 3 minutes to recover.
With 2.6.29-rc7 i just tried to trigger this by moving data (dd from and
to NFS. rsync via ssh). All I managed to get was one
[61534.002671] r8169: eth0: link up
in dmesg while moving around 20GB.
I've found a couple of other, similar r8169 related bugreports here and
I don't know whats the progress there. If I can help with more
information or testing, just tell me what to do.
this is with 2.6.29-rc7:
[44469.085035] nfs: server 192.168.1.8 not responding, still trying
[44469.677055] nfs: server 192.168.1.8 not responding, still trying
[44529.997057] ------------[ cut here ]------------
[44529.997065] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x1f1/0x200()
[44529.997071] Hardware name: System Product Name
[44529.997075] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[44529.997079] Modules linked in: autofs4 powernow_k8 cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_performance freq_table pci_slot sbs ac battery video backlight output sbshc container sbp2 loop lnbp21 stv0299 snd_hda_codec_atihdmi snd_seq_dummy snd_hda_codec_realtek snd_seq_oss snd_seq_midi snd_hda_intel snd_rawmidi snd_hda_codec snd_pcm_oss snd_seq_midi_event snd_mixer_oss dvb_ttpci dvb_core snd_seq snd_pcm saa7146_vv saa7146 videobuf_dma_sg rtc_cmos videobuf_core videodev snd_seq_device rtc_core snd_timer v4l1_compat rtc_lib evdev ttpci_eeprom wmi serio_raw snd psmouse k8temp button i2c_piix4 pcspkr soundcore i2c_core snd_page_alloc af_packet pata_acpi ata_generic sg sr_mod cdrom pata_atiixp ehci_hcd ahci ohci1394 ohci_hcd ieee1394 libata scsi_mod usbcore raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys hwmon fuse
[44529.997208] Pid: 0, comm: swapper Not tainted 2.6.29-rc7 #1
[44529.997213] Call Trace:
[44529.997226] [<c0229587>] warn_slowpath+0x87/0xe0
[44529.997233] [<c024a779>] tick_program_event+0x39/0x50
[44529.997241] [<c024230e>] hrtimer_interrupt+0xee/0x250
[44529.997250] [<c021cbf0>] place_entity+0x40/0xb0
[44529.997255] [<c021eb7f>] enqueue_entity+0x11f/0x190
[44529.997261] [<c021ee1e>] enqueue_task_fair+0x2e/0x90
[44529.997268] [<c0243d5d>] sched_clock_cpu+0x14d/0x1a0
[44529.997276] [<c04b0bfd>] _spin_lock+0xd/0x30
[44529.997283] [<c0369d1f>] strlcpy+0x1f/0x60
[44529.997289] [<c0419711>] dev_watchdog+0x1f1/0x200
[44529.997294] [<c021db7e>] __wake_up+0x3e/0x60
[44529.997302] [<c0232f8d>] cascade+0x5d/0x80
[44529.997308] [<c0233160>] run_timer_softirq+0x130/0x1f0
[44529.997314] [<c0419520>] dev_watchdog+0x0/0x200
[44529.997320] [<c0419520>] dev_watchdog+0x0/0x200
[44529.997326] [<c022ec3f>] __do_softirq+0x7f/0x130
[44529.997332] [<c022ed45>] do_softirq+0x55/0x60
[44529.997337] [<c022ef45>] irq_exit+0x75/0x90
[44529.997346] [<c0212eb7>] smp_apic_timer_interrupt+0x67/0xa0
[44529.997353] [<c0203b00>] apic_timer_interrupt+0x28/0x30
[44529.997360] [<c0209292>] default_idle+0x42/0x50
[44529.997367] [<c020941f>] c1e_idle+0x2f/0xf0
[44529.997372] [<c0202353>] cpu_idle+0x63/0xa0
[44529.997381] [<c04ac293>] start_secondary+0x19e/0x2eb
[44529.997386] ---[ end trace 41b1e5c0ec95bcd1 ]---
[44530.008513] nfs: server 192.168.1.8 OK
[44530.008984] r8169: eth0: link up
[44540.508051] nfs: server 192.168.1.8 not responding, still trying
[44568.832060] nfs: server 192.168.1.8 not responding, still trying
[44650.011418] nfs: server 192.168.1.8 OK
[44650.012137] nfs: server 192.168.1.8 OK
attached is the config, dmesg after booting and lspci -vx
Tom
Francois Romieu wrote:
> Michael Büker <[email protected]> :
> [...]
>
>> With both 2.6.27.19 and 2.6.28.7, I am experiencing "transmit timed out"
>> errors as reported by the netdev watchdog, for both my PCMCIA Ethernet
>> adapters, using the r8169 and 8139too drivers respectively.
>>
>
>
This seems to be the problem I also reported:
http://lkml.org/lkml/2009/2/16/121
> Can you describe the symptoms a bit more specifically ?
>
> The kernel displays a scary warning, I can guess that it is almost surely
> associated with some loss of network connectivity for a few seconds at the
> very least but it is a bit hard to figure the real scale of your problem.
>
> Please scare me. :o)
>
Besides the data I've sent on my past message, here is my dmesg output:
Hardware name:
NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Modules linked in: iptable_filter ip_tables x_tables joydev i915 drm
i2c_algo_bit af_packet snd_pcm_oss snd_mixer_oss microcode snd_seq
snd_seq_device binfmt_misc fuse loop dm_mod snd_hda_codec_realtek(N)
snd_hda_intel snd_hda_codec(N) snd_hwdep snd_pcm snd_timer iTCO_wdt snd
ppdev iTCO_vendor_support rtc_cmos r8169 soundcore i2c_i801 rtc_core
parport_pc button snd_page_alloc intel_agp mii i2c_core pcspkr rtc_lib
parport sg floppy raid456 async_xor async_memcpy async_tx xor raid0
ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore edd raid1 ext3 mbcache jbd
fan thermal processor thermal_sys hwmon ide_pci_generic ide_core
ata_generic ata_piix libata scsi_mod
Supported: Yes
Pid: 0, comm: swapper Tainted: G N
2.6.29-rc5-git3-master_20090221181736_632072f6-default #1
Call Trace:
[<ffffffff8020ff2d>] try_stack_unwind+0x70/0x127
[<ffffffff8020f0c0>] dump_trace+0x9a/0x2a6
[<ffffffff8020fc7e>] show_trace_log_lvl+0x4c/0x58
[<ffffffff8020fc9a>] show_trace+0x10/0x12
[<ffffffff804efbb1>] dump_stack+0x72/0x7b
[<ffffffff802483f7>] warn_slowpath+0xb1/0xed
[<ffffffff80480b41>] dev_watchdog+0x13c/0x202
[<ffffffff80251eda>] run_timer_softirq+0x1a3/0x232
[<ffffffff8024dedc>] __do_softirq+0xd6/0x1f2
[<ffffffff8020d83c>] call_softirq+0x1c/0x30
[<ffffffff8020ea10>] do_softirq+0x44/0x8f
[<ffffffff8024db87>] irq_exit+0x3f/0x7e
[<ffffffff8021f012>] smp_apic_timer_interrupt+0x93/0xac
[<ffffffff8020d1f3>] apic_timer_interrupt+0x13/0x20
DWARF2 unwinder stuck at apic_timer_interrupt+0x13/0x20
Leftover inexact backtrace:
<IRQ> <EOI> [<ffffffff80212e38>] ? mwait_idle+0x6e/0x7a
[<ffffffff8020b450>] ? enter_idle+0x22/0x24
[<ffffffff8020b4ab>] ? cpu_idle+0x59/0x9a
[<ffffffff804de0fd>] ? rest_init+0x61/0x63
---[ end trace 28260c20fab8b205 ]---
r8169: eth0: link up
r8169: eth0: link up
r8169: eth0: link up
r8169: eth0: link up
Just a few other hints for a possible solution:
1) The problem seems only to happen on TX, as Michael states. If I RX a
large file, the NIC will not cease to work, probably because the TX is
enough not to crash it...
2) On my post refered above, only the PCIe card has this problem. The
other tree PCI NICs work flawlessly.
3) The way I use to test it, is just an scp out of a large file. If I
detect the staleness of the transfer on an early stage, the NIC will
recover. If not, the NIC rarely recovers.
> [...]
>
>> as both kernel config files. I'll gladly provide more information as it is
>> requested.
>>
>
> lspci -vx and a complete dmesg.
>
> Can you identify a kernel which worked flawlessly ?
>
I'm performing a git bisect to try to find the patch that caused it.
Here is the current status:
git bisect start
# bad: [fec6c6fec3e20637bee5d276fb61dd8b49a3f9cc] Linux 2.6.29-rc7
git bisect bad fec6c6fec3e20637bee5d276fb61dd8b49a3f9cc
# good: [0215ffb08ce99e2bb59eca114a99499a4d06e704] Linux 2.6.19
git bisect good 0215ffb08ce99e2bb59eca114a99499a4d06e704
# good: [836341a70471ba77657b0b420dd7eea3c30a038b] mac80211: remove sta
TIM flag, fix expiry TIM handling
git bisect good 836341a70471ba77657b0b420dd7eea3c30a038b ( This is a
2.6.25-rc3-master_20090221181736_632072f6 )
The bisect will take a while as the system is a dual core Atom...
This bisect will take a while as my machine usually will not boot on
2.6.27 kernels...
If I get any further I'll let you know.
Regards,
Rui Santos
Am Sonntag, den 08.03.2009, 11:27 +0100 schrieb Tom Weber:
> Happens at least with 2.6.28, 2.6.28.4, 2.6.28.6 and 2.6.29-rc7.
>
> This is a diskless box (on NFS) on GigE LAN, used for DVB-S
> recordings/playback with vdr.
[...]
> Every once in while, more likely with heavy usage of the vdr stuff (like
> recording two shows and/or cutting a recording) these show up. Of course
> the impact is realy bad on a diskless system. It takes the system from
> 10 seconds to 3 minutes to recover.
After running 2.6.29-rc7 for more than a day now with lots of data moved
around I got the impression that it is more stable than 2.6.28.6.
I only got one warning so far and a few
r8169: eth0: link up
messages.
I know this is a vague statement, but I think I would have seen more
warnings with 2.6.28.6 and all the stuff I've done.
Below is the latest / only Call trace I've got since my earlier mail.
Note the 'eth0: link up' messages. This is where most likely the
2.6.28.6 would have barfed on me (cutting of recordings).
This is the complete dmesg output since the first warning. I didn't
remove anything before or between the 'link up' messages.
[2.6.29-rc7 - same configs etc as described in my earlier mail]
[ 320.501076] nfs: server 192.168.1.8 not responding, still trying
[ 321.031065] nfs: server 192.168.1.8 not responding, still trying
[ 321.997042] ------------[ cut here ]------------
[ 321.997050] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x1f1/0x200()
[ 321.997056] Hardware name: System Product Name
[ 321.997060] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[ 321.997063] Modules linked in: autofs4 powernow_k8 cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_performance freq_table pci_slot sbs ac battery video backlight output sbshc container sbp2 loop lnbp21 stv0299 snd_hda_codec_atihdmi snd_seq_dummy snd_seq_oss snd_hda_codec_realtek snd_seq_midi snd_hda_intel dvb_ttpci snd_hda_codec dvb_core snd_pcm_oss saa7146_vv saa7146 snd_rawmidi snd_mixer_oss videobuf_dma_sg evdev videobuf_core serio_raw psmouse videodev snd_seq_midi_event v4l1_compat ttpci_eeprom k8temp snd_pcm pcspkr snd_seq rtc_cmos rtc_core rtc_lib i2c_piix4 i2c_core snd_timer snd_seq_device button snd wmi soundcore snd_page_alloc af_packet pata_acpi ata_generic sg sr_mod cdrom ohci1394 pata_atiixp ehci_hcd ohci_hcd ahci ieee1394 libata usbcore scsi_mod raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod dm_mirror dm_region_hash dm_log dm_snapshot dm_mod thermal processor fan thermal_sys hwmon fuse
[ 321.997191] Pid: 0, comm: swapper Not tainted 2.6.29-rc7 #1
[ 321.997195] Call Trace:
[ 321.997207] [<c0229587>] warn_slowpath+0x87/0xe0
[ 321.997216] [<c0402177>] pskb_copy+0x27/0x160
[ 321.997222] [<c0402177>] pskb_copy+0x27/0x160
[ 321.997229] [<c0214829>] ack_apic_level+0x59/0x260
[ 321.997236] [<c0269c00>] rcu_sched_grace_period+0x1e0/0x2c0
[ 321.997242] [<c026897b>] handle_fasteoi_irq+0x8b/0xe0
[ 321.997249] [<c022eefd>] irq_exit+0x2d/0x90
[ 321.997256] [<c0204e6d>] do_IRQ+0x4d/0x90
[ 321.997266] [<c0246b41>] getnstimeofday+0x51/0x110
[ 321.997272] [<c02039e7>] common_interrupt+0x27/0x2c
[ 321.997279] [<c0369d1f>] strlcpy+0x1f/0x60
[ 321.997285] [<c0419711>] dev_watchdog+0x1f1/0x200
[ 321.997291] [<c0240039>] posix_cpu_timer_set+0x409/0x450
[ 321.997298] [<c0243d5d>] sched_clock_cpu+0x14d/0x1a0
[ 321.997306] [<c02126a0>] lapic_next_event+0x10/0x20
[ 321.997312] [<c02493e3>] clockevents_program_event+0xa3/0x170
[ 321.997320] [<c0232f8d>] cascade+0x5d/0x80
[ 321.997326] [<c0233160>] run_timer_softirq+0x130/0x1f0
[ 321.997332] [<c0419520>] dev_watchdog+0x0/0x200
[ 321.997337] [<c0419520>] dev_watchdog+0x0/0x200
[ 321.997342] [<c022ec3f>] __do_softirq+0x7f/0x130
[ 321.997361] [<c022ed45>] do_softirq+0x55/0x60
[ 321.997366] [<c022ef45>] irq_exit+0x75/0x90
[ 321.997373] [<c0212eb7>] smp_apic_timer_interrupt+0x67/0xa0
[ 321.997384] [<c0203b00>] apic_timer_interrupt+0x28/0x30
[ 321.997392] [<c0209292>] default_idle+0x42/0x50
[ 321.997398] [<c020941f>] c1e_idle+0x2f/0xf0
[ 321.997403] [<c0202353>] cpu_idle+0x63/0xa0
[ 321.997412] [<c04ac293>] start_secondary+0x19e/0x2eb
[ 321.997417] [<c020b9ce>] user_enable_single_step+0xe/0x10
[ 321.997422] ---[ end trace c509771bca9f9e70 ]---
[ 322.009305] nfs: server 192.168.1.8 OK
[ 322.009589] r8169: eth0: link up
[ 323.152320] nfs: server 192.168.1.8 OK
[ 323.152397] nfs: server 192.168.1.8 OK
[ 323.152508] nfs: server 192.168.1.8 OK
[68818.001610] r8169: eth0: link up
[69004.003443] r8169: eth0: link up
[90178.001567] r8169: eth0: link up
Tom
Rui Santos wrote:
> I'm performing a git bisect to try to find the patch that caused it.
> Here is the current status:
> git bisect start
> # bad: [fec6c6fec3e20637bee5d276fb61dd8b49a3f9cc] Linux 2.6.29-rc7
> git bisect bad fec6c6fec3e20637bee5d276fb61dd8b49a3f9cc
> # good: [0215ffb08ce99e2bb59eca114a99499a4d06e704] Linux 2.6.19
> git bisect good 0215ffb08ce99e2bb59eca114a99499a4d06e704
> # good: [836341a70471ba77657b0b420dd7eea3c30a038b] mac80211: remove sta
> TIM flag, fix expiry TIM handling
> git bisect good 836341a70471ba77657b0b420dd7eea3c30a038b ( This is a
> 2.6.25-rc3-master_20090221181736_632072f6 )
>
> The bisect will take a while as the system is a dual core Atom...
> This bisect will take a while as my machine usually will not boot on
> 2.6.27 kernels...
> If I get any further I'll let you know.
>
>
Hi. Just to add a few extra data that may help to find the problem. I
was ( still ) not able to fully complete the git bisect but, here is the
current status:
~# git bisect log
git bisect start
# bad: [d2f8d7ee1a9b4650b4e43325b321801264f7c37a] Linux 2.6.29-rc5
git bisect bad d2f8d7ee1a9b4650b4e43325b321801264f7c37a
# good: [3fa8749e584b55f1180411ab1b51117190bac1e5] Linux 2.6.27
git bisect good 3fa8749e584b55f1180411ab1b51117190bac1e5
# bad: [759ef89fb096c4a6ef078d3cfd5682ac037bd789] iwlwifi: change email
contact information
git bisect bad 759ef89fb096c4a6ef078d3cfd5682ac037bd789
# bad: [bdbf0ac7e187b2b757216e653e64f8b808b9077e] Merge branch
'hwmon-for-linus' of git://jdelvare.pck.nerim.net/jdelvare-2.6
git bisect bad bdbf0ac7e187b2b757216e653e64f8b808b9077e
# skip: [5c3c4d9b5810c9aabd8c05219c62ca088aa83eb0] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
git bisect skip 5c3c4d9b5810c9aabd8c05219c62ca088aa83eb0
# skip: [ead9d23d803ea3a73766c3cb27bf7563ac8d7266] Merge phase #4
(X2APIC, APIC unification, CPU identification unification) of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
git bisect skip ead9d23d803ea3a73766c3cb27bf7563ac8d7266
# skip: [fd048088306656824958e7783ffcee27e241b361] Merge branch
'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect skip fd048088306656824958e7783ffcee27e241b361
# skip: [14835a3325c1f84c3ae6eaf81102a3917e84809e] [CIFS] cifs: remove
pointless lock and unlock of GlobalMid_Lock in header_assemble
git bisect skip 14835a3325c1f84c3ae6eaf81102a3917e84809e
# skip: [1efd325fbadc02c1338e0ef676f0a6669b251c7a] Fix RTC wakealarm
sysfs interface breakage.
git bisect skip 1efd325fbadc02c1338e0ef676f0a6669b251c7a
# skip: [325af5fb1418c79953db0954556de048e061d8b6] x86: ioperm user_regset
git bisect skip 325af5fb1418c79953db0954556de048e061d8b6
# skip: [77e841de8abac4755cc83ca224fdf71418d65380] jbd2: abort when
failed to log metadata buffers
git bisect skip 77e841de8abac4755cc83ca224fdf71418d65380
# skip: [7d3c6f8717ee6c2bf6cba5fa0bda3b28fbda6015] md: Fix
rdev_size_store with size == 0
git bisect skip 7d3c6f8717ee6c2bf6cba5fa0bda3b28fbda6015
ALSO AN IMPORTANT NOTE: If I disable CONFIG_PCI_MSI this problem will
not occur on my machine in any of the Kernels. Can anyone corroborate ?
Regards,
Rui Santos
Rui Santos wrote:
> If I get any further I'll let you know.
>
> Regards,
> Rui Santos
>
>
I Finally ended up the big bisect operation. The commit responsible for
this issue is commit
b726e493e8dc13537d1d7f8cd66bcd28516606c3
labeled
'r8169: sync existing 8168 device hardware start sequences with vendor
driver'
With that patch removed, I was able to get a sustained 40MB/s+ on a
20GB ftp transfer. I wasn't able to do it with the specified patch applied.
Just as a side note, I was never able to reproduce it on a 100MBit
connection, only on a full 1GBit one...
However this patch will not reverse on the current kernel.
What do you think the next step should be ?
As usual, if you need any testing, please do let me know.
Regards,
Rui Santos
Rui Santos <[email protected]> :
[...]
> I Finally ended up the big bisect operation. The commit responsible for
> this issue is commit
> b726e493e8dc13537d1d7f8cd66bcd28516606c3
> labeled
> 'r8169: sync existing 8168 device hardware start sequences with vendor
> driver'
Thanks a lot for the bisect.
> With that patch removed, I was able to get a sustained 40MB/s+ on a
> 20GB ftp transfer. I wasn't able to do it with the specified patch applied.
> Just as a side note, I was never able to reproduce it on a 100MBit
> connection, only on a full 1GBit one...
>
> However this patch will not reverse on the current kernel.
>
> What do you think the next step should be ?
> As usual, if you need any testing, please do let me know.
Can you see if one of the attached patches or a combination of them
makes a difference for the current (-git) kernel ?
--
Ueimor
On Sunday 22 March 2009 22:12:00 Francois Romieu wrote:
> Rui Santos <[email protected]> :
> [...]
> > I Finally ended up the big bisect operation. The commit responsible for
> > this issue is commit
> > b726e493e8dc13537d1d7f8cd66bcd28516606c3
> > labeled
> > 'r8169: sync existing 8168 device hardware start sequences with vendor
> > driver'
>
> Thanks a lot for the bisect.
>
> > With that patch removed, I was able to get a sustained 40MB/s+ on a
> > 20GB ftp transfer. I wasn't able to do it with the specified patch applied.
> > Just as a side note, I was never able to reproduce it on a 100MBit
> > connection, only on a full 1GBit one...
I can trigger it on a 100MBit connection, but it takes a _lot_ more time (something
like 4 hours or so of continuous traffic). On a 1GBit connection it takes only a few minutes.
> > However this patch will not reverse on the current kernel.
> >
> > What do you think the next step should be ?
> > As usual, if you need any testing, please do let me know.
>
> Can you see if one of the attached patches or a combination of them
> makes a difference for the current (-git) kernel ?
Thanks a lot. I will try these. Will take some time, because this is a
production server and I'll have to take care to not completely bring it down. :)
--
Greetings, Michael.
Michael Buesch <[email protected]> :
[...]
> Thanks a lot. I will try these. Will take some time, because this is a
> production server and I'll have to take care to not completely bring it
> down. :)
Please note that Rui's device identifies as XID 0x3c400c0. Yours may need
something different (I do not remember seeing the XID in the thread back
in february).
--
Ueimor
On Sunday 22 March 2009 23:00:32 Francois Romieu wrote:
> Michael Buesch <[email protected]> :
> [...]
> > Thanks a lot. I will try these. Will take some time, because this is a
> > production server and I'll have to take care to not completely bring it
> > down. :)
>
> Please note that Rui's device identifies as XID 0x3c400c0. Yours may need
> something different (I do not remember seeing the XID in the thread back
> in february).
How do I find out? I don't find "XID" in the kernel log.
--
Greetings, Michael.
Michael Buesch <[email protected]> :
[...]
> How do I find out? I don't find "XID" in the kernel log.
Check the log level ?
[...]
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:00:0a.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
r8169 0000:00:0a.0: no PCI Express capability
eth2: RTL8110s at 0xf8992400, 00:09:5b:bc:e9:9e, XID 04000000 IRQ 17
--
Ueimor
On Sunday 22 March 2009 23:27:32 Francois Romieu wrote:
> Michael Buesch <[email protected]> :
> [...]
> > How do I find out? I don't find "XID" in the kernel log.
>
> Check the log level ?
>
> [...]
> r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> r8169 0000:00:0a.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> r8169 0000:00:0a.0: no PCI Express capability
> eth2: RTL8110s at 0xf8992400, 00:09:5b:bc:e9:9e, XID 04000000 IRQ 17
>
Ok, it scrolled out of the buffer due to massive warnings from all over
the kernel (there also are unrelated problems elsewhere).
:)
Mar 13 12:38:07 quimby kernel: [ 0.723831] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Mar 13 12:38:07 quimby kernel: [ 0.723912] r8169 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 13 12:38:07 quimby kernel: [ 0.723978] r8169 0000:01:00.0: setting latency timer to 64
Mar 13 12:38:07 quimby kernel: [ 0.724073] r8169 0000:01:00.0: irq 383 for MSI/MSI-X
Mar 13 12:38:07 quimby kernel: [ 0.724305] eth0: RTL8168c/8111c at 0xffffc20000542000, 00:1c:c0:8d:2b:47, XID 3c4000c0 IRQ 383
--
Greetings, Michael.
On Sunday 22 March 2009 22:12:00 Francois Romieu wrote:
> Can you see if one of the attached patches or a combination of them
> makes a difference for the current (-git) kernel ?
I think I'm forced to run the test on 2.6.28.8, because current linus-git has
major bugs in it that let pppd completely fail, which is required for proper operation
of the machine.
These two patches apply cleanly (with a small offset) to 2.6.28.8, so I guess
there's no problem with trying these on the stable kernel.
--
Greetings, Michael.
On Monday 23 March 2009 12:47:13 Michael Buesch wrote:
> On Sunday 22 March 2009 22:12:00 Francois Romieu wrote:
> > Can you see if one of the attached patches or a combination of them
> > makes a difference for the current (-git) kernel ?
>
> I think I'm forced to run the test on 2.6.28.8, because current linus-git has
> major bugs in it that let pppd completely fail, which is required for proper operation
> of the machine.
>
> These two patches apply cleanly (with a small offset) to 2.6.28.8, so I guess
> there's no problem with trying these on the stable kernel.
>
Ok, it still happens with both patches applied to 2.6.28.8
[ 0.723814] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 0.723895] r8169 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 0.723960] r8169 0000:01:00.0: setting latency timer to 64
[ 0.724055] r8169 0000:01:00.0: irq 383 for MSI/MSI-X
[ 0.724286] eth0: RTL8168c/8111c at 0xffffc20000542000, 00:1c:c0:8d:2b:47, XID 3c4000c0 IRQ 383
...
Mar 23 13:41:12 quimby kernel: [ 185.663615] kjournald starting. Commit interval 5 seconds
Mar 23 13:41:12 quimby kernel: [ 185.693026] EXT3 FS on dm-1, internal journal
Mar 23 13:41:12 quimby kernel: [ 185.693103] EXT3-fs: mounted filesystem with ordered data mode.
Mar 23 13:41:13 quimby kernel: [ 187.056244] usb 1-7: USB disconnect, address 3
Mar 23 13:41:18 quimby kernel: [ 192.003563] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Mar 23 13:41:18 quimby kernel: [ 192.004133] NFSD: starting 90-second grace period
Mar 23 13:41:22 quimby kernel: [ 195.752381] warning: `ntpd' uses 32-bit capabilities (legacy support in use)
Mar 23 13:45:13 quimby kernel: [ 426.804022] ------------[ cut here ]------------
Mar 23 13:45:13 quimby kernel: [ 426.804077] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x22e/0x240()
Mar 23 13:45:13 quimby kernel: [ 426.804136] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
Mar 23 13:45:13 quimby kernel: [ 426.804181] Modules linked in:
Mar 23 13:45:13 quimby kernel: [ 426.804247] Pid: 0, comm: swapper Not tainted 2.6.28.8 #10
Mar 23 13:45:13 quimby kernel: [ 426.804291] Call Trace:
Mar 23 13:45:13 quimby kernel: [ 426.804331] <IRQ> [<ffffffff80240aed>] warn_slowpath+0xcd/0x110
Mar 23 13:45:13 quimby kernel: [ 426.804412] [<ffffffff80235a53>] __wake_up+0x43/0x70
Mar 23 13:45:13 quimby kernel: [ 426.804458] [<ffffffff8023776c>] find_busiest_group+0x1dc/0x970
Mar 23 13:45:13 quimby kernel: [ 426.804510] [<ffffffff8025d838>] getnstimeofday+0x58/0xe0
Mar 23 13:45:13 quimby kernel: [ 426.804559] [<ffffffff8040bd81>] strlcpy+0x41/0x50
Mar 23 13:45:13 quimby kernel: [ 426.804605] [<ffffffff8062a2ce>] dev_watchdog+0x22e/0x240
Mar 23 13:45:13 quimby kernel: [ 426.804652] [<ffffffff8023db82>] scheduler_tick+0xd2/0x230
Mar 23 13:45:13 quimby kernel: [ 426.804699] [<ffffffff8062a0a0>] dev_watchdog+0x0/0x240
Mar 23 13:45:13 quimby kernel: [ 426.804747] [<ffffffff8024b1ee>] run_timer_softirq+0x12e/0x200
Mar 23 13:45:13 quimby kernel: [ 426.804796] [<ffffffff8025acfc>] ktime_get+0xc/0x50
Mar 23 13:45:13 quimby kernel: [ 426.804843] [<ffffffff80246833>] __do_softirq+0x93/0x160
Mar 23 13:45:13 quimby kernel: [ 426.804892] [<ffffffff8020d49c>] call_softirq+0x1c/0x30
Mar 23 13:45:13 quimby kernel: [ 426.804939] [<ffffffff8020ee45>] do_softirq+0x35/0x70
Mar 23 13:45:13 quimby kernel: [ 426.804984] [<ffffffff80246525>] irq_exit+0x95/0xa0
Mar 23 13:45:13 quimby kernel: [ 426.805030] [<ffffffff802200c6>] smp_apic_timer_interrupt+0x86/0xd0
Mar 23 13:45:13 quimby kernel: [ 426.805080] [<ffffffff8020ceeb>] apic_timer_interrupt+0x6b/0x70
Mar 23 13:45:13 quimby kernel: [ 426.805126] <EOI> [<ffffffff8021f7f0>] lapic_next_event+0x0/0x20
Mar 23 13:45:13 quimby kernel: [ 426.805202] [<ffffffff8021469c>] mwait_idle+0x3c/0x50
Mar 23 13:45:13 quimby kernel: [ 426.805248] [<ffffffff8020b34e>] cpu_idle+0x5e/0xb0
Mar 23 13:45:13 quimby kernel: [ 426.805292] ---[ end trace 2d80c75815a19f25 ]---
Mar 23 13:45:13 quimby kernel: [ 426.821419] r8169: eth0: link up
--
Greetings, Michael.
On Sunday 22 March 2009, Francois Romieu wrote:
> Can you see if one of the attached patches or a combination of them
> makes a difference for the current (-git) kernel ?
I have not tried your patches, but I'd like to point out that I can also
reproduce this error with a card that uses the 8139too driver, so it seems
unlikely to me that the root of it all lies in the r8169 code.
Please don't be annoyed if this comment is utterly unqualified, it just seems
like a point that should be made :)
Michael
--
Installing Windows XP on a Macintosh is like giving a dolphin AIDS.
Michael B?ker wrote:
> On Sunday 22 March 2009, Francois Romieu wrote:
>
> I have not tried your patches, but I'd like to point out that I can also
> reproduce this error with a card that uses the 8139too driver, so it seems
> unlikely to me that the root of it all lies in the r8169 code.
>
> Please don't be annoyed if this comment is utterly unqualified, it just seems
> like a point that should be made :)
>
There are also a few other notes that should be taken into
consideration, witch I already mentioned:
1) I have 3 other NIC's on the same machine that also use r8169 driver.
I cannot reproduce this problem in any of them. These are PCI cards.
2) The card on which I can reproduce this problem is a PCI Express one (
on board ).
3) If I compile the kernel without MSI/MSI-X support, I can't also
reproduce the problem. You may test it by booting with pci=nomsi
I believe that Francoi's patches (correct me if I'm wrong), are meant
for him just be sure/clue on where the problem lies. I'll test those
patches in a few hours and will report back the results.
> Michael
>
>
--
Regards,
Rui Santos
On Monday 23 March 2009, Rui Santos wrote:
> 3) If I compile the kernel without MSI/MSI-X support, I can't also
> reproduce the problem. You may test it by booting with pci=nomsi
I hate to make things more complicated, but my laptop doesn't use MSI (in
fact, it doesn't even seem to support it), but still I see our error occur.
Michael
--
Num mihi dolebit hoc?
// This isn't going to hurt, is it?
>From "Practical Latin" by Henry Beard
Michael Buesch <[email protected]> :
[...]
> Ok, it still happens with both patches applied to 2.6.28.8
So no change of behavior for you : kernel warning _and_ unreliable
network connectivity ?
--
Ueimor
On Tuesday 24 March 2009 00:47:30 Francois Romieu wrote:
> Michael Buesch <[email protected]> :
> [...]
> > Ok, it still happens with both patches applied to 2.6.28.8
>
> So no change of behavior for you : kernel warning _and_ unreliable
> network connectivity ?
Well, the connectivity never was "unreliable", but due to the frequent watchdog
reset there were connectivity stalls of 5 seconds or so when it happens.
--
Greetings, Michael.
Francois Romieu wrote:
> Can you see if one of the attached patches or a combination of them
> makes a difference for the current (-git) kernel ?
>
I've concluded the tests. Unfortunately, the problem still remains.
However, with both, and only both patches applied, the connection speed
was higher and took longer, a lot longer to reproduce the problem but,
it did.
Regards,
Rui Santos
On Wednesday 25 March 2009 12:40:18 Rui Santos wrote:
> Francois Romieu wrote:
> > Can you see if one of the attached patches or a combination of them
> > makes a difference for the current (-git) kernel ?
> >
>
> I've concluded the tests. Unfortunately, the problem still remains.
> However, with both, and only both patches applied, the connection speed
> was higher and took longer, a lot longer to reproduce the problem but,
> it did.
I'm currently testing 2.6.29.1 without any additional patches but
with the pci=nomsi boot option.
I didn't notice any hickups, yet. I'm running a stresstest on a GBit link for quite
some time now. Earlier tests with older kernels and MSI burped earlier.
I will do more testing. If it turns out this is stable I will test the same kernel
with Message Signaled Interrupts to see if that causes some breakage.
--
Greetings, Michael.