so ever since arch rolled out 2.6.31.x I've been having problems with
my network (again) where I've been losing a large amount of packets
(just testing with mtr somewhere between 30/50%). first I figured it
was the same problem as I had in 2.6.30.x (and maybe it is?) but that
appeared to get fixed. when I started bisecting the bug wasn't
apparent in 2.6.31.0 but I knew for sure it was in .5 (I couldn't
remember if I had noticed it again in .3)
I'm attaching the bisection log and a 'good' dmesg output.
c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
I'm not going to pretend to understand why this patch is breaking my
networking but between bisection and testing it appears to be... I've
never bisected before and I'm definitely not a kernel hacker (I can
barely read C).
I should also note that the wireshark dump here
http://bugzilla.kernel.org/show_bug.cgi?id=13835 is related to this.
and if it's not the same bug then possibly a new one should be opened.
P.S. I'm not subscribed to the list please CC me
--
Caleb Cushing
http://xenoterracide.blogspot.com
Adding netdev in CC. Original message + attachments follow.
=========
so ever since arch rolled out 2.6.31.x I've been having problems with
my network (again) where I've been losing a large amount of packets
(just testing with mtr somewhere between 30/50%). first I figured it
was the same problem as I had in 2.6.30.x (and maybe it is?) but that
appeared to get fixed. when I started bisecting the bug wasn't
apparent in 2.6.31.0 but I knew for sure it was in .5 (I couldn't
remember if I had noticed it again in .3)
I'm attaching the bisection log and a 'good' dmesg output.
c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
I'm not going to pretend to understand why this patch is breaking my
networking but between bisection and testing it appears to be... I've
never bisected before and I'm definitely not a kernel hacker (I can
barely read C).
I should also note that the wireshark dump here
http://bugzilla.kernel.org/show_bug.cgi?id=13835 is related to this.
and if it's not the same bug then possibly a new one should be opened.
P.S. I'm not subscribed to the list please CC me
Caleb Cushing
http://xenoterracide.blogspot.com
not to be impatient or ungrateful, but has anyone had time to look at
this? will I be able to upgrade to 31.6? will 32.0 work?
On Sat, Oct 31, 2009 at 1:44 PM, Frans Pop <[email protected]> wrote:
> Adding netdev in CC. Original message + attachments follow.
>
> =========
> so ever since arch rolled out 2.6.31.x I've been having problems with
> my network (again) where I've been losing a large amount of packets
> (just testing with mtr somewhere between 30/50%). first I figured it
> was the same problem as I had in 2.6.30.x (and maybe it is?) but that
> appeared to get fixed. when I started bisecting the bug wasn't
> apparent in 2.6.31.0 but I knew for sure it was in .5 (I couldn't
> remember if I had noticed it again in .3)
>
> I'm attaching the bisection log and a 'good' dmesg output.
>
> c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
>
> I'm not going to pretend to understand why this patch is breaking my
> networking but between bisection and testing it appears to be... I've
> never bisected before and I'm definitely not a kernel hacker (I can
> barely read C).
>
> I should also note that the wireshark dump here
> http://bugzilla.kernel.org/show_bug.cgi?id=13835 is related to this.
> and if it's not the same bug then possibly a new one should be opened.
>
> P.S. I'm not subscribed to the list please CC me
>
> Caleb Cushing
> http://xenoterracide.blogspot.com
>
>
--
Caleb Cushing
http://xenoterracide.blogspot.com
Caleb Cushing <[email protected]> writes:
>>
>> I'm attaching the bisection log and a 'good' dmesg output.
>>
>> c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
Just gives fatal: bad object c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
here on a standard Linus linux-2.6 tree.
It might be also useful if you could describe what kind
of network devices you use and how you determine
the packet loss.
-Andi
--
[email protected] -- Speaking for myself only.
On Wednesday 11 November 2009, Andi Kleen wrote:
> Caleb Cushing <[email protected]> writes:
> >> I'm attaching the bisection log and a 'good' dmesg output.
> >>
> >> c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
>
> Just gives fatal: bad object c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
> here on a standard Linus linux-2.6 tree.
Looks to be a commit from a stable update:
commit c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
Author: Alan Stern <[email protected]>
Date: Tue Sep 1 11:38:34 2009 -0400
usb-serial: change referencing of port and serial structures
commit 41bd34ddd7aa46dbc03b5bb33896e0fa8100fe7b upstream.
Cheers,
FJP
On Wed, Nov 11, 2009 at 5:05 PM, Frans Pop <[email protected]> wrote:
> On Wednesday 11 November 2009, Andi Kleen wrote:
>> Caleb Cushing <[email protected]> writes:
>> >> I'm attaching the bisection log and a 'good' dmesg output.
>> >>
>> >> c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
>>
>> Just gives fatal: bad object c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
>> here on a standard Linus linux-2.6 tree.
>
> Looks to be a commit from a stable update:
>
> commit c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
> Author: Alan Stern <[email protected]>
> Date: Tue Sep 1 11:38:34 2009 -0400
>
> usb-serial: change referencing of port and serial structures
>
> commit 41bd34ddd7aa46dbc03b5bb33896e0fa8100fe7b upstream.
>
> Cheers,
> FJP
>
yeah it is. it's from greg kroah-hartman's tree.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On 11-11-2009 23:48, Caleb Cushing wrote:
> On Wed, Nov 11, 2009 at 5:05 PM, Frans Pop <[email protected]> wrote:
>> On Wednesday 11 November 2009, Andi Kleen wrote:
>>> Caleb Cushing <[email protected]> writes:
>>>>> I'm attaching the bisection log and a 'good' dmesg output.
>>>>>
>>>>> c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9 is the first bad commit
>>> Just gives fatal: bad object c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
>>> here on a standard Linus linux-2.6 tree.
>> Looks to be a commit from a stable update:
>>
>> commit c9fb3ded7a8a6769f3bcb3ef3d9aed61d3e376a9
>> Author: Alan Stern <[email protected]>
>> Date: ? Tue Sep 1 11:38:34 2009 -0400
>>
>> ? ? usb-serial: change referencing of port and serial structures
>>
>> ? ? commit 41bd34ddd7aa46dbc03b5bb33896e0fa8100fe7b upstream.
>>
>> Cheers,
>> FJP
>>
>
> yeah it is. it's from greg kroah-hartman's tree.
Could you answer the previous question too:
On 11-11-2009 22:47, Andi Kleen wrote:
...
> It might be also useful if you could describe what kind
> of network devices you use and how you determine
> the packet loss.
Btw, you didn't send the stats you compared, and your wireshark dump
doesn't show anything wrong either.
Jarek P.
> On 11-11-2009 22:47, Andi Kleen wrote:
> ...
>> It might be also useful if you could describe what kind
>> of network devices you use and how you determine
>> the packet loss.
>
> Btw, you didn't send the stats you compared, and your wireshark dump
> doesn't show anything wrong either.
>
> Jarek P.
>
I didn't see that sorry. I wasn't sure if the dump would or not (I'm
not a networking expert, just know more than the average joe).
from dmesg (networking device)
e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k2
from lspci
00:19.0 Ethernet controller: Intel Corporation 82562V-2 10/100 Network
Connection (rev 02)
the attached png's show mtr with bad being when I have the problem.
for those not familiar mtr sends an icmp packet to each hop in 1
second then loops. when I'm having this kind of packet loss (and
sometimes it's higher) all services including dhcp, dns, and http (web
browsing) get flaky, or don't work at all (I really can't browse the
web).
--
Caleb Cushing
http://xenoterracide.blogspot.com
Caleb Cushing wrote, On 11/12/2009 02:46 PM:
>> On 11-11-2009 22:47, Andi Kleen wrote:
>> ...
>>> It might be also useful if you could describe what kind
>>> of network devices you use and how you determine
>>> the packet loss.
>> Btw, you didn't send the stats you compared, and your wireshark dump
>> doesn't show anything wrong either.
>>
>> Jarek P.
>>
>
> I didn't see that sorry. I wasn't sure if the dump would or not (I'm
> not a networking expert, just know more than the average joe).
>
> from dmesg (networking device)
>
> e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k2
>
> from lspci
>
> 00:19.0 Ethernet controller: Intel Corporation 82562V-2 10/100 Network
> Connection (rev 02)
So I assume it's your only network device on this box and according
to these reports it's 192.168.1.3 with 192.168.1.1 as the gateway,
and your only change is kernel on this 192.168.1.3 box, right?
> the attached png's show mtr with bad being when I have the problem.
> for those not familiar mtr sends an icmp packet to each hop in 1
> second then loops. when I'm having this kind of packet loss (and
> sometimes it's higher) all services including dhcp, dns, and http (web
> browsing) get flaky, or don't work at all (I really can't browse the
> web).
Since the loss is seen on the first hop already, it seems it should be
enough to query 192.168.1.1 only - did you try this? If so, does this
happen from the beginning of the test or after many loops? Could you
try to repeat this wireshark dump with more data than before (but just
to be sure there are a few unanswered pings). If possible it would be
nice to have wireshark or tcpdump data from 192.168.1.1 too, while
pinged from 192.168.1.3. Please, send it gzipped to bugzilla only plus
ifconfig eth0 before and after the test (and let us know here).
Btw, mtr has text reporting too (--report). Larger things send to
bugzilla only.
Thanks,
Jarek P.
Jarek Poplawski wrote, On 11/12/2009 08:04 PM:
> Caleb Cushing wrote, On 11/12/2009 02:46 PM:
...
>>> Btw, you didn't send the stats you compared, and your wireshark dump
>>> doesn't show anything wrong either.
...
>> I didn't see that sorry. I wasn't sure if the dump would or not (I'm
>> not a networking expert, just know more than the average joe).
Hmm... I didn't see that either, sorry! After re-checking I can see
unanswered requests in this dump. Anyway, the main thing to test now is
the first hop to 192.168.1.1 (some info about it?), as I wrote before.
Jarek P.
> So I assume it's your only network device on this box and according
> to these reports it's 192.168.1.3 with 192.168.1.1 as the gateway,
> and your only change is kernel on this 192.168.1.3 box, right?
yes, and semi obviously that router is my box (LinkSys wrt 54gl
openwrt kamikaze 8.09.1.
> Since the loss is seen on the first hop already, it seems it should be
> enough to query 192.168.1.1 only - did you try this? If so, does this
> happen from the beginning of the test or after many loops? Could you
> try to repeat this wireshark dump with more data than before (but just
> to be sure there are a few unanswered pings). If possible it would be
> nice to have wireshark or tcpdump data from 192.168.1.1 too, while
> pinged from 192.168.1.3. Please, send it gzipped to bugzilla only plus
> ifconfig eth0 before and after the test (and let us know here).
same bug? or new bug? I can see what I can do to get a tcpdump from
the router. yes I tried that, I can tell within the first 10 pings. I
should say I don't notice it on every kernel boot, it's ~80% of
reboots (but that's pulled from my behind). but I haven't noticed it
on gfa31221 at all. it's reproducible in 31.6 too (arch just added
that).
> Btw, mtr has text reporting too (--report). Larger things send to
> bugzilla only.
didn't know that, although I should have guessed (or rtfm), thanks.
--
Caleb Cushing
http://xenoterracide.blogspot.com
any specific switches I should run tcpdump with? or any other tests I
should be trying while capturing? (on either end).
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Fri, Nov 13, 2009 at 11:25:25AM -0500, Caleb Cushing wrote:
> > So I assume it's your only network device on this box and according
> > to these reports it's 192.168.1.3 with 192.168.1.1 as the gateway,
> > and your only change is kernel on this 192.168.1.3 box, right?
>
> yes, and semi obviously that router is my box (LinkSys wrt 54gl
> openwrt kamikaze 8.09.1.
>
> > Since the loss is seen on the first hop already, it seems it should be
> > enough to query 192.168.1.1 only - did you try this? If so, does this
> > happen from the beginning of the test or after many loops? Could you
> > try to repeat this wireshark dump with more data than before (but just
> > to be sure there are a few unanswered pings). If possible it would be
> > nice to have wireshark or tcpdump data from 192.168.1.1 too, while
> > pinged from 192.168.1.3. Please, send it gzipped to bugzilla only plus
> > ifconfig eth0 before and after the test (and let us know here).
>
> same bug? or new bug? I can see what I can do to get a tcpdump from
> the router. yes I tried that, I can tell within the first 10 pings. I
> should say I don't notice it on every kernel boot, it's ~80% of
> reboots (but that's pulled from my behind). but I haven't noticed it
> on gfa31221 at all. it's reproducible in 31.6 too (arch just added
> that).
Might be the same bugzilla report, I guess. We need to establish if
these pings reach 192.168.1.1, so a short test and tcpdump without any
special options just to get a few lost cases as seen on both sides.
(And ifconfigs before and after the test.)
Btw, could you check with lsmod if usbserial module is loaded before
this test? I'd like to verify this git bisection result. (If the
module is loaded or you have CONFIG_USB_SERIAL=y instead of m, try to
recompile the kernel with this option turned off, for this test.)
Thanks,
Jarek P.
> Might be the same bugzilla report, I guess. We need to establish if
> these pings reach 192.168.1.1, so a short test and tcpdump without any
> special options just to get a few lost cases as seen on both sides.
> (And ifconfigs before and after the test.)
>
> Btw, could you check with lsmod if usbserial module is loaded before
> this test? I'd like to verify this git bisection result. (If the
> module is loaded or you have CONFIG_USB_SERIAL=y instead of m, try to
> recompile the kernel with this option turned off, for this test.)
sorry for taking so long to get back. busy problematic times.
the dumps and ifconfigs are a bit less 'clean' because the router
serves several other computers (none of which have this issue
(windows)) here's the ifconfig -a from the router.
usbserial is not loaded. actually from reading the patch submission I
suspected the official cause might be off... but I'm not kernel
programmer all I know is where I could see the loss during tests.and I
haven't been able to reproduce over dozens of reboots from this
2.6.31.1-test-00091-gfa31221 kernel.
I totally forgot to do it during the dump's so I hope these are still useful
I haven't rebooted this in a few weeks (the router)
br-lan Link encap:Ethernet HWaddr 00:1D:7E:F8:21:66
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:60613991 errors:0 dropped:0 overruns:0 frame:0
TX packets:67849334 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2172912561 (2.0 GiB) TX bytes:3999263405 (3.7 GiB)
eth0 Link encap:Ethernet HWaddr 00:1D:7E:F8:21:66
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:144116625 errors:0 dropped:0 overruns:0 frame:0
TX packets:122639966 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1986512923 (1.8 GiB) TX bytes:1548485891 (1.4 GiB)
Interrupt:4
eth0.0 Link encap:Ethernet HWaddr 00:1D:7E:F8:21:66
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:57318567 errors:0 dropped:0 overruns:0 frame:0
TX packets:62317675 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3466538358 (3.2 GiB) TX bytes:2132301174 (1.9 GiB)
eth0.1 Link encap:Ethernet HWaddr 00:1D:7E:F8:21:66
inet addr:68.42.198.183 Bcast:255.255.255.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:86777655 errors:0 dropped:0 overruns:0 frame:0
TX packets:60312064 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:205005516 (195.5 MiB) TX bytes:3162930981 (2.9 GiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:168 errors:0 dropped:0 overruns:0 frame:0
TX packets:168 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:19706 (19.2 KiB) TX bytes:19706 (19.2 KiB)
wl0 Link encap:Ethernet HWaddr 00:1D:7E:F8:21:68
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5114480 errors:0 dropped:0 overruns:0 frame:720205
TX packets:7576790 errors:1902 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:762579947 (727.2 MiB) TX bytes:3981402458 (3.7 GiB)
Interrupt:2 Base address:0x5000
this is the ifconfig -a from my desktop while experiencing the issue
eth0 Link encap:Ethernet HWaddr 00:21:9B:06:4C:C9
inet addr:192.168.1.3 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::221:9bff:fe06:4cc9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3465 errors:0 dropped:0 overruns:0 frame:0
TX packets:4951 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1467320 (1.3 Mb) TX bytes:631808 (617.0 Kb)
Memory:fdfc0000-fdfe0000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:624 errors:0 dropped:0 overruns:0 frame:0
TX packets:624 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:64397 (62.8 Kb) TX bytes:64397 (62.8 Kb)
--
Caleb Cushing
http://xenoterracide.blogspot.com
p.s. dumps are on the old bug here...
http://bugzilla.kernel.org/show_bug.cgi?id=13835
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Wed, Nov 18, 2009 at 04:59:03AM -0500, Caleb Cushing wrote:
> > Might be the same bugzilla report, I guess. We need to establish if
> > these pings reach 192.168.1.1, so a short test and tcpdump without any
> > special options just to get a few lost cases as seen on both sides.
> > (And ifconfigs before and after the test.)
> >
> > Btw, could you check with lsmod if usbserial module is loaded before
> > this test? I'd like to verify this git bisection result. (If the
> > module is loaded or you have CONFIG_USB_SERIAL=y instead of m, try to
> > recompile the kernel with this option turned off, for this test.)
>
> sorry for taking so long to get back. busy problematic times.
No problem, don't hurry.
>
> the dumps and ifconfigs are a bit less 'clean' because the router
> serves several other computers (none of which have this issue
> (windows)) here's the ifconfig -a from the router.
Actually, I'm a little bit surprised. Maybe I missed something from
your previous messages, but I expected something more similar to the
first wireshark dump, which suggested to me there was only this mtr
traffic. Now there is a lot more (plus we know it's not all).
So, there is a basic question: can this mtr loss be seen while no
other traffic is present? After looking into these current dumps I
doubt. There are e.g. 3 pings unanswered between 09:21:50 and
09:21:52 (21:31:34 to 21:31:38 router time), but a lot of tcp
packets to and from 192.168.1.3, so looks like simply dropped and
we can guess the reason.
>
> usbserial is not loaded. actually from reading the patch submission I
> suspected the official cause might be off... but I'm not kernel
> programmer all I know is where I could see the loss during tests.and I
> haven't been able to reproduce over dozens of reboots from this
> 2.6.31.1-test-00091-gfa31221 kernel.
Since this patch from the bisection is really limited to this one
module I doubt we should follow this direction. IMHO it shows the
test wasn't reproducible enough. Probably the amount and/or kind of
other traffic really matter. If I'm wrong and missed something again
let me know. Btw, could you try if changing with ifconfig the
txqueuelen of desktop's eth0 from 100 to 1000 changes anything
in this mtr test?
Jarek P.
> this is the ifconfig -a from my desktop while experiencing the issue
>
> eth0 Link encap:Ethernet HWaddr 00:21:9B:06:4C:C9
> inet addr:192.168.1.3 Bcast:192.168.1.255 Mask:255.255.255.0
> inet6 addr: fe80::221:9bff:fe06:4cc9/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:3465 errors:0 dropped:0 overruns:0 frame:0
> TX packets:4951 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:100
> RX bytes:1467320 (1.3 Mb) TX bytes:631808 (617.0 Kb)
> Memory:fdfc0000-fdfe0000
> Actually, I'm a little bit surprised. Maybe I missed something from
> your previous messages, but I expected something more similar to the
> first wireshark dump, which suggested to me there was only this mtr
> traffic. Now there is a lot more (plus we know it's not all).
probably just me lazy at 5 am? did I do the dump on the router right
so it wasn't showing traffic that's just idling from other computers
(windows likes to make a lot of noise). I could do it by ip...
> So, there is a basic question: can this mtr loss be seen while no
> other traffic is present? After looking into these current dumps I
> doubt. There are e.g. 3 pings unanswered between 09:21:50 and
> 09:21:52 (21:31:34 to 21:31:38 router time), but a lot of tcp
> packets to and from 192.168.1.3, so looks like simply dropped and
> we can guess the reason.
yes. this was at a fairly low traffic time of day. 5am only 2 people
were up, and I was using the other computer during. I've had everyone
actively doing one or more of downloading/uploading/video/voip/gaming
stuff on this network with no noticeable packet loss. if really,
really needed I can probably restrict this network to 2 machines for
the duration of the test.
> Since this patch from the bisection is really limited to this one
> module I doubt we should follow this direction. IMHO it shows the
> test wasn't reproducible enough. Probably the amount and/or kind of
> other traffic really matter. If I'm wrong and missed something again
> let me know. Btw, could you try if changing with ifconfig the
> txqueuelen of desktop's eth0 from 100 to 1000 changes anything
> in this mtr test?
yeah testing it under my known working config first. I'll get back w/ you later.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Wed, Nov 18, 2009 at 01:21:19PM -0500, Caleb Cushing wrote:
> > So, there is a basic question: can this mtr loss be seen while no
> > other traffic is present? After looking into these current dumps I
> > doubt. There are e.g. 3 pings unanswered between 09:21:50 and
> > 09:21:52 (21:31:34 to 21:31:38 router time), but a lot of tcp
> > packets to and from 192.168.1.3, so looks like simply dropped and
> > we can guess the reason.
>
> yes. this was at a fairly low traffic time of day. 5am only 2 people
> were up, and I was using the other computer during. I've had everyone
> actively doing one or more of downloading/uploading/video/voip/gaming
> stuff on this network with no noticeable packet loss. if really,
> really needed I can probably restrict this network to 2 machines for
> the duration of the test.
Alas "a fairly low traffic" can have a fairly high surges, so it's not
easy to compare. Anyway, try to check, if it's still available, if
there were any messages from the NIC in syslog etc. during this test
(~09:21:50).
>
> > Since this patch from the bisection is really limited to this one
> > module I doubt we should follow this direction. IMHO it shows the
> > test wasn't reproducible enough. Probably the amount and/or kind of
> > other traffic really matter. If I'm wrong and missed something again
> > let me know. Btw, could you try if changing with ifconfig the
> > txqueuelen of desktop's eth0 from 100 to 1000 changes anything
> > in this mtr test?
>
> yeah testing it under my known working config first. I'll get back w/ you later.
Btw, since dropping at hardware (NIC) level seems more likely to me,
could you send 'ethtool eth0', and 'ethtool -S eth0' after such tests
(both sides).
Jarek P.
On Wed, Nov 18, 2009 at 09:10:34PM +0100, Jarek Poplawski wrote:
> On Wed, Nov 18, 2009 at 01:21:19PM -0500, Caleb Cushing wrote:
> > yeah testing it under my known working config first. I'll get back w/ you later.
>
> Btw, since dropping at hardware (NIC) level seems more likely to me,
> could you send 'ethtool eth0', and 'ethtool -S eth0' after such tests
> (both sides).
Hmm... and 'netstat -s' before and after the test (both sides).
Jarek P.
haven't had time to do a test yet. but would it be of any use for you
all for me to throw another nic (it'd be a different driver for sure)
in this box and test that on a problematic kernel? I have some but not
with me.
On Wed, Nov 18, 2009 at 5:38 PM, Jarek Poplawski <[email protected]> wrote:
> On Wed, Nov 18, 2009 at 09:10:34PM +0100, Jarek Poplawski wrote:
>> On Wed, Nov 18, 2009 at 01:21:19PM -0500, Caleb Cushing wrote:
>> > yeah testing it under my known working config first. I'll get back w/ you later.
>>
>> Btw, since dropping at hardware (NIC) level seems more likely to me,
>> could you send 'ethtool eth0', and 'ethtool -S eth0' after such tests
>> (both sides).
>
> Hmm... and 'netstat -s' before and after the test (both sides).
>
> Jarek P.
>
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Sun, Nov 22, 2009 at 02:35:10PM -0500, Caleb Cushing wrote:
> haven't had time to do a test yet. but would it be of any use for you
> all for me to throw another nic (it'd be a different driver for sure)
> in this box and test that on a problematic kernel? I have some but not
> with me.
Of course it would be useful. Especially if you find new bugs. ;-)
I'm not sure it's the fastest way to diagnose this problem, but if
it's not a problem for you...
Btw, currently I don't consider this dropping means there has to be
a bug. It could be otherwise - a feature... e.g. when a new kernel
can transmit faster (then dropping in some other, slower place can
happen).
Jarek P.
>
> On Wed, Nov 18, 2009 at 5:38 PM, Jarek Poplawski <[email protected]> wrote:
> > On Wed, Nov 18, 2009 at 09:10:34PM +0100, Jarek Poplawski wrote:
> >> On Wed, Nov 18, 2009 at 01:21:19PM -0500, Caleb Cushing wrote:
> >> > yeah testing it under my known working config first. I'll get back w/ you later.
> >>
> >> Btw, since dropping at hardware (NIC) level seems more likely to me,
> >> could you send 'ethtool eth0', and 'ethtool -S eth0' after such tests
> >> (both sides).
> >
> > Hmm... and 'netstat -s' before and after the test (both sides).
> >
> > Jarek P.
> >
>
>
>
> --
> Caleb Cushing
>
> http://xenoterracide.blogspot.com
> Btw, currently I don't consider this dropping means there has to be
> a bug. It could be otherwise - a feature... e.g. when a new kernel
> can transmit faster (then dropping in some other, slower place can
> happen).
um... where would it be dropping that we wouldn't have a bug? I mean
sure faster is great... but if it makes my network not work right...
I've added all (I think) information you've asked for to the bug
http://bugzilla.kernel.org/show_bug.cgi?id=13835 except for ethtool
and netstat on the router side. ethtool complains about not having
driver or capability (maybe because it's a 2.4 kernel?) and the
version of netstat doesn't support -s. I disabled everything that I
can think of that would send/receive packets before doing the test
client side, except dhcp/dns windows box's were probably sending some
broadcasts too. but the traffic should be pretty low. I did remember
to set the txqueuelen didn't seem to make a difference
only error in dmesg I see is
e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
but it's in working versions too.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Tue, Nov 24, 2009 at 01:17:09AM -0500, Caleb Cushing wrote:
> > Btw, currently I don't consider this dropping means there has to be
> > a bug. It could be otherwise - a feature... e.g. when a new kernel
> > can transmit faster (then dropping in some other, slower place can
> > happen).
>
> um... where would it be dropping that we wouldn't have a bug? I mean
> sure faster is great... but if it makes my network not work right...
E.g. if it were dropped because of a queue overflow (but it doesn't
seem to be the case, at least at your box) or because of memory
problems while handling a lot of traffic.
>
> I've added all (I think) information you've asked for to the bug
> http://bugzilla.kernel.org/show_bug.cgi?id=13835 except for ethtool
> and netstat on the router side. ethtool complains about not having
> driver or capability (maybe because it's a 2.4 kernel?) and the
> version of netstat doesn't support -s. I disabled everything that I
> can think of that would send/receive packets before doing the test
> client side, except dhcp/dns windows box's were probably sending some
> broadcasts too. but the traffic should be pretty low. I did remember
> to set the txqueuelen didn't seem to make a difference
Alas it's not all information I asked. E.g. "netstat -s before faulty
kernel" and "netstat -s after faulty kernel" seem to be the same file:
netstat_after.slave4.log.gz. Anyway, since there are problems with
getting stats from the router we still can't compare them, or check
for the dropped stats. (Btw, could you check for /proc/net/softnet_stat
yet?)
So, it might be the kernel problem you reported, but there is not
enough data to prove it. Then my proposal is to try to repeat this
problem in more "testing friendly" conditions - preferably against
some other, more up-to-date linux box, if possible?
> only error in dmesg I see is
>
> e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
I added e1000e maintainers to CC to have a look at this warning.
Jarek P.
On Tue, Nov 24, 2009 at 11:19:46AM +0000, Jarek Poplawski wrote:
...
> Alas it's not all information I asked. E.g. "netstat -s before faulty
> kernel" and "netstat -s after faulty kernel" seem to be the same file:
> netstat_after.slave4.log.gz.
On the other hand, there is a lot of tcp retransmits there:
Tcp:
17 active connections openings
0 passive connection openings
14 failed connection attempts
0 connection resets received
0 connections established
45 segments received
49 segments send out
19 segments retransmited
0 bad segments received.
19 resets sent
So it might point at the driver yet. It would be interesting to see
more of this: could you repeat "netstat -s" and "ethtool -S eth0"
after rebooting with both kernels and doing a few minutes of similar
tcp activities (against the router or some other "good" site). Btw,
please remind us the exact kernel versions. If you can, try 2.6.32-rc8
instead of 2.6.31.
Jarek P.
>-----Original Message-----
>From: Jarek Poplawski [mailto:[email protected]]
>Sent: Tuesday, November 24, 2009 3:20 AM
>To: Caleb Cushing
>Cc: [email protected]; [email protected]; Frans Pop;
>Brandeburg, Jesse; [email protected]; Andi Kleen; Kirsher,
>Jeffrey T
>Subject: Re: [E1000-devel] large packet loss take2 2.6.31.x
>
>On Tue, Nov 24, 2009 at 01:17:09AM -0500, Caleb Cushing wrote:
>> > Btw, currently I don't consider this dropping means there has to be
>> > a bug. It could be otherwise - a feature... e.g. when a new kernel
>> > can transmit faster (then dropping in some other, slower place can
>> > happen).
>>
>> um... where would it be dropping that we wouldn't have a bug? I mean
>> sure faster is great... but if it makes my network not work right...
>
>E.g. if it were dropped because of a queue overflow (but it doesn't
>seem to be the case, at least at your box) or because of memory
>problems while handling a lot of traffic.
>
>>
>> I've added all (I think) information you've asked for to the bug
>> http://bugzilla.kernel.org/show_bug.cgi?id=13835 except for ethtool
>> and netstat on the router side. ethtool complains about not having
>> driver or capability (maybe because it's a 2.4 kernel?) and the
>> version of netstat doesn't support -s. I disabled everything that I
>> can think of that would send/receive packets before doing the test
>> client side, except dhcp/dns windows box's were probably sending some
>> broadcasts too. but the traffic should be pretty low. I did remember
>> to set the txqueuelen didn't seem to make a difference
>
>Alas it's not all information I asked. E.g. "netstat -s before faulty
>kernel" and "netstat -s after faulty kernel" seem to be the same file:
>netstat_after.slave4.log.gz. Anyway, since there are problems with
>getting stats from the router we still can't compare them, or check
>for the dropped stats. (Btw, could you check for /proc/net/softnet_stat
>yet?)
>
>So, it might be the kernel problem you reported, but there is not
>enough data to prove it. Then my proposal is to try to repeat this
>problem in more "testing friendly" conditions - preferably against
>some other, more up-to-date linux box, if possible?
>
>> only error in dmesg I see is
>>
>> e1000e 0000:00:19.0: pci_enable_pcie_error_reporting failed 0xfffffffb
>
>I added e1000e maintainers to CC to have a look at this warning.
>
>Jarek P.
The "pci_enable_pcie_error_reporting failed" message is a non-fatal warning that has recently been removed.
On Tue, Nov 24, 2009 at 07:57:41AM -0800, Allan, Bruce W wrote:
> The "pci_enable_pcie_error_reporting failed" message is a non-fatal warning that has recently been removed.
>
Thanks for the explanation,
Jarek P.
> Alas it's not all information I asked. E.g. "netstat -s before faulty
> kernel" and "netstat -s after faulty kernel" seem to be the same file:
> netstat_after.slave4.log.gz.
sorry I guess I misunderstood what you wanted? (or maybe I just dorked
it when I created all the files) I upped a netstat -s from a good
kernel shortly after reboot.
> Anyway, since there are problems with
> getting stats from the router we still can't compare them, or check
> for the dropped stats. (Btw, could you check for /proc/net/softnet_stat
> yet?)
router? good kernel? bad kernel?
> So, it might be the kernel problem you reported, but there is not
> enough data to prove it. Then my proposal is to try to repeat this
> problem in more "testing friendly" conditions - preferably against
> some other, more up-to-date linux box, if possible?
yeah 2.6.31.6 works on my laptop fine I'll just have to see about
getting a direct connection to it. probably do that after I bring the
other NIC back from home, on thanksgiving.
--
Caleb Cushing
http://xenoterracide.blogspot.com
> yeah 2.6.31.6 works on my laptop fine I'll just have to see about
> getting a direct connection to it. probably do that after I bring the
> other NIC back from home, on thanksgiving.
meh scratch that sorta.. screen mounting brackets/joint on laptop
broke... I'm thinking of scrapping it for a netbook. but replacing may
take a few weeks.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Wed, Nov 25, 2009 at 09:06:30AM -0500, Caleb Cushing wrote:
> > Anyway, since there are problems with
> > getting stats from the router we still can't compare them, or check
> > for the dropped stats. (Btw, could you check for /proc/net/softnet_stat
> > yet?)
>
> router? good kernel? bad kernel?
router.
>
> > So, it might be the kernel problem you reported, but there is not
> > enough data to prove it. Then my proposal is to try to repeat this
> > problem in more "testing friendly" conditions - preferably against
> > some other, more up-to-date linux box, if possible?
>
> yeah 2.6.31.6 works on my laptop fine I'll just have to see about
> getting a direct connection to it. probably do that after I bring the
> other NIC back from home, on thanksgiving.
This other NIC is a really good idea, so let's wait and see.
Happy Thanksgiving!
Jarek P.
2.6.32-rc8 seemed to be affected (guess. because my net didn't come up
on reboot. further testing will likely verify) also during reboots I
found out that the version I've been thinking is good is afflicted. I
supposed maybe I should try bisecting again? starting with that point.
not sure it'll do us much good if that version was able to slip by for
so long. I really hate intermittent bugs.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Fri, Nov 27, 2009 at 01:07:53PM -0500, Caleb Cushing wrote:
> 2.6.32-rc8 seemed to be affected (guess. because my net didn't come up
> on reboot. further testing will likely verify) also during reboots I
> found out that the version I've been thinking is good is afflicted. I
> supposed maybe I should try bisecting again? starting with that point.
> not sure it'll do us much good if that version was able to slip by for
> so long. I really hate intermittent bugs.
I doubt bisecting is a good idea with so unpredictable bug. First, you
should make sure it's not a hardware problem, so go back to the kernel
you trust most, and give it a really long try with a few recompilations
after slightly changing the config. Btw, I wonder if you tried e1000e
module parameters like IntMode=0 or 1.
Jarek P.
> I doubt bisecting is a good idea with so unpredictable bug. First, you
> should make sure it's not a hardware problem, so go back to the kernel
> you trust most, and give it a really long try with a few recompilations
> after slightly changing the config. Btw, I wonder if you tried e1000e
> module parameters like IntMode=0 or 1.
no, how do I set those?
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Fri, Nov 27, 2009 at 05:35:34PM -0500, Caleb Cushing wrote:
> > I doubt bisecting is a good idea with so unpredictable bug. First, you
> > should make sure it's not a hardware problem, so go back to the kernel
> > you trust most, and give it a really long try with a few recompilations
> > after slightly changing the config. Btw, I wonder if you tried e1000e
> > module parameters like IntMode=0 or 1.
>
> no, how do I set those?
modprobe -r e1000e
modprobe e1000e IntMode=0
Jarek P.
>
> modprobe -r e1000e
> modprobe e1000e IntMode=0
>
> Jarek P.
>
tested on kernel behaving properly no change. what do these modes do?
I've installed a 10/100 linksys nic into my system. it appears to be
working fine on a bad kernel (2.6.32-final tested and for sure
verified). I've only tested it once though. my laptop died so direct
connection between that won't work. can I test between these 2 nics?
(suppose no real reason why not) but what should I proceed with at
this point?
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Thu, Dec 03, 2009 at 08:49:17PM -0500, Caleb Cushing wrote:
> >
> > modprobe -r e1000e
> > modprobe e1000e IntMode=0
> >
> > Jarek P.
> >
> tested on kernel behaving properly no change. what do these modes do?
e1000e by default uses MSI-X interrupts if possible, which are most
modern. If there are some problems IntMode lets us try older types,
so I rather meant it for the misbehaving kernel.
>
> I've installed a 10/100 linksys nic into my system. it appears to be
> working fine on a bad kernel (2.6.32-final tested and for sure
> verified). I've only tested it once though. my laptop died so direct
> connection between that won't work. can I test between these 2 nics?
> (suppose no real reason why not) but what should I proceed with at
> this point?
If you have it fixed easily with another nic you should first
reconsider if this debugging is worth of your time. Of course it's
could be very useful for the kernel (unless it's a hardware fault),
but on the other hand this is a popular nic, tested by many people.
Then, if you find time for such testing, I'd suggest to try mainly
2.6.32 - I'm not sure if you tried it with e1000e. So if after longer
testing both linksys and e1000e you find only the latter has problems
I think you should open the new report in bugzilla for e1000e and
submit things like: dmesg, .config, lspci -v from 2.6.32, and if
possible the same things from the last kernel which didn't have these
problems. Add some references to previous attempts in bugzilla and
this thread. (Btw, any reproducible tests should be fine.)
Jarek P.
On Fri, Dec 4, 2009 at 4:05 AM, Jarek Poplawski <[email protected]> wrote:
> On Thu, Dec 03, 2009 at 08:49:17PM -0500, Caleb Cushing wrote:
>> >
>> > modprobe -r e1000e
>> > modprobe e1000e IntMode=0
>> >
>> > Jarek P.
>> >
>> tested on kernel behaving properly no change. what do these modes do?
>
> e1000e by default uses MSI-X interrupts if possible, which are most
> modern. If there are some problems IntMode lets us try older types,
> so I rather meant it for the misbehaving kernel.
>
>>
>> I've installed a 10/100 linksys nic into my system. it appears to be
>> working fine on a bad kernel (2.6.32-final tested and for sure
>> verified). I've only tested it once though. my laptop died so direct
>> connection between that won't work. can I test between these 2 nics?
>> (suppose no real reason why not) but what should I proceed with at
>> this point?
>
> If you have it fixed easily with another nic you should first
> reconsider if this debugging is worth of your time. Of course it's
> could be very useful for the kernel (unless it's a hardware fault),
> but on the other hand this is a popular nic, tested by many people.
trying to figure out if it's hardware, I wish I'd figured that out a
month ago because dell would have been shipping me a new mobo at that
point... oh well... I'm starting to think it is but given the age of
the computer... (just over 1 year now) I'm not happy about that. my
nic is a 10/100 card the e1000e is obviously gigabit though I don't
think I've a need (or network) for the gigabit atm
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Fri, Dec 04, 2009 at 01:28:26PM -0500, Caleb Cushing wrote:
> On Fri, Dec 4, 2009 at 4:05 AM, Jarek Poplawski <[email protected]> wrote:
> > If you have it fixed easily with another nic you should first
> > reconsider if this debugging is worth of your time. Of course it's
> > could be very useful for the kernel (unless it's a hardware fault),
> > but on the other hand this is a popular nic, tested by many people.
>
> trying to figure out if it's hardware, I wish I'd figured that out a
> month ago because dell would have been shipping me a new mobo at that
> point... oh well... I'm starting to think it is but given the age of
> the computer... (just over 1 year now) I'm not happy about that. my
> nic is a 10/100 card the e1000e is obviously gigabit though I don't
> think I've a need (or network) for the gigabit atm
For now there is no proof it's hardware, so don't worry ;-) It might
be firmware, bios etc. And might be kernel too. I meant it's hard to
debug, considering your bisection results, but easy to avoid with
other hardware, so it's up to.
Btw, maybe we(?!) should've done it earlier, but just did some google,
and it looks like these NICs aren't so innocent as I assumed. I'm just
looking here:
https://bugs.launchpad.net/ubuntu/+bug/382671
and there:
http://bugzilla.kernel.org/show_bug.cgi?id=11998
and maybe it's a bit different story, but actors mainly the same.
So, again, if you're willing to debug this, the new bugzilla report
seems reasonable to me, plus maybe some notice to this #11998 too.
Jarek P.
> So, again, if you're willing to debug this, the new bugzilla report
> seems reasonable to me, plus maybe some notice to this #11998 too.
I will later today. I'm thinking since I now have 2 active nics...
would hooking the 1 card directly to the other and then running the
tests be helpful? (I wish this mobo had a 3rd pci slot because I'd
have put a 3rd card in so I can connect to the net at the same time.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Fri, Dec 04, 2009 at 06:07:42PM -0500, Caleb Cushing wrote:
> > So, again, if you're willing to debug this, the new bugzilla report
> > seems reasonable to me, plus maybe some notice to this #11998 too.
>
> I will later today. I'm thinking since I now have 2 active nics...
> would hooking the 1 card directly to the other and then running the
> tests be helpful? (I wish this mobo had a 3rd pci slot because I'd
> have put a 3rd card in so I can connect to the net at the same time.
I guess, you should better wait with new tests for some assistance
from e1000e maintainers - it seems they might be interested in some
specific dumps and registers - like in this #11998 case.
Jarek P.
> I guess, you should better wait with new tests for some assistance
> from e1000e maintainers - it seems they might be interested in some
> specific dumps and registers - like in this #11998 case.
I reported here.
http://bugzilla.kernel.org/show_bug.cgi?id=14737
--
Caleb Cushing
http://xenoterracide.blogspot.com
I sadly wonder if this is why Dell pulled their 530n product line for
ubuntu and I didn't see the last time I checked if it was replaced.
--
Caleb Cushing
http://xenoterracide.blogspot.com
On Sat, Dec 05, 2009 at 02:06:09AM -0500, Caleb Cushing wrote:
> > I guess, you should better wait with new tests for some assistance
> > from e1000e maintainers - it seems they might be interested in some
> > specific dumps and registers - like in this #11998 case.
>
> I reported here.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=14737
Please, remember to add at least standard things like dmesg, .config,
'lspci -vvv', /proc/interrupts etc. (linux-2.6/REPORTING_BUGS), and
maybe 'netstat -s' both for non-working and working case/boot.
And some summary (incl. the router type), so people don't have to
browse all this thread.
Jarek P.