I am having a problem when using PPP over a particular PCMCIA based serial device and have
pinned the problem down using git-bisect to this particular commit that was made between
2.6.12.6 and 2.6.13:
59121003721a8fad11ee72e646fd9d3076b5679c is first bad commit
diff-tree 59121003721a8fad11ee72e646fd9d3076b5679c (from
799d19f6ec5ca2102c61122f5219a17f1c4e961a)
Author: Christoph Lameter <[email protected]>
Date: Thu Jun 23 00:08:25 2005 -0700
[PATCH] i386: Selectable Frequency of the Timer Interrupt
Make the timer frequency selectable. The timer interrupt may cause bus
and memory contention in large NUMA systems since the interrupt occurs
on each processor HZ times per second.
The problem that I am seeing can be reproduced by attempting to send large packets via a
PPP interface on the modem, e.g. "ping -s 1000 http://www.kernel.org".
With this patch, the value of HZ was changed from the previously "hardcoded" 1000 to be
configurable with the default being 250. My ping packet size problem goes away if I
select 1000 as the new value. I do need to be able to run with HZ = 250 (actually 100 is
better for my situation).
It looks to me like the git-bisect did not really get me down to the core problem which is
that there is something is the system that isn't happy with HZ == 250.
I have also tried a number of other kernels and the problem exists all the way to 2.6.15.6
but is fixed in 2.6.16, so I am going to git-bisect 2.6.15.6 to 2.6.16, but I thought I
would get this message out now in case someone has an inkling of what the problem is.
Please cc me on any responses. Russell I copied you directly since I think you may be in
the best position to understand the problem.
Greg Lee
On Mon, 2006-03-27 at 18:46 -0500, Greg Lee wrote:
> I have also tried a number of other kernels and the problem exists all
> the way to 2.6.15.6
> but is fixed in 2.6.16, so I am going to git-bisect 2.6.15.6 to
> 2.6.16, but I thought I
> would get this message out now in case someone has an inkling of what
> the problem is.
If it's fixed in 2.6.16, what's the problem? It's not as if we can go
back and fix those old kernels...
Lee
> If it's fixed in 2.6.16, what's the problem? It's not as if we can go
> back and fix those old kernels...
>
> Lee
I wish! I can't switch kernels once this one has been qualified without major testing
headaches. :-(
Greg
On Tuesday 28 March 2006 03:03, Greg Lee wrote:
> > If it's fixed in 2.6.16, what's the problem? It's not as if we can go
> > back and fix those old kernels...
> >
> > Lee
>
> I wish! I can't switch kernels once this one has been qualified without
> major testing headaches. :-(
Just do what you were planning to; bisect down to the responsible patch. If
you find it, CC the -stable team if you think it's serious enough, and it
might end up in the next 2.6.15.x release..
--
Cheers,
Alistair.
'No sense being pessimistic, it probably wouldn't work anyway.'
Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Mon, 27 Mar 2006 20:27:19 EST, Lee Revell said:
> On Mon, 2006-03-27 at 18:46 -0500, Greg Lee wrote:
> > I have also tried a number of other kernels and the problem exists all
> > the way to 2.6.15.6
> > but is fixed in 2.6.16, so I am going to git-bisect 2.6.15.6 to
> > 2.6.16, but I thought I
> > would get this message out now in case someone has an inkling of what
> > the problem is.
>
> If it's fixed in 2.6.16, what's the problem? It's not as if we can go
> back and fix those old kernels...
I may be misreading Greg's concern, but I got the feeling that he's worried
that 2.6.16 isn't *really* fixed, but that something is just papering over the
driver's innate displeasure with HZ==250 (and thus it's likely that in .17 or
.18 or whenever, some *other* patch will make it re-manifest).
And we've seen *enough* bugs that only manifest in even or odd or
divisible-by-7.48 kernels that we know that "it works in 2.6.16" is vastly
different than being able to point at a changeset (or even a stream of fixes)
and say "fixed by that" or "probably went away when this stream of patches
totally reworked the code".
On Mon, Mar 27, 2006 at 06:46:02PM -0500, Greg Lee wrote:
> I have also tried a number of other kernels and the problem exists all
> the way to 2.6.15.6 but is fixed in 2.6.16, so I am going to git-bisect
> 2.6.15.6 to 2.6.16, but I thought I would get this message out now in
> case someone has an inkling of what the problem is.
Saying that the problem is between 2.6.15.6 and 2.6.16 is rather
meaningless because you're effectively omitting _all_ the development
work between 2.6.15 to 2.6.16, and that's likely where the problem
lies. Hence, you're omitting all the 2.6.16-rc kernels from your
testing.
> Please cc me on any responses. Russell I copied you directly since I
> think you may be in the best position to understand the problem.
I have no ideas at the moment.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
> I may be misreading Greg's concern, but I got the feeling that he's worried
> that 2.6.16 isn't *really* fixed, but that something is just papering over the
> driver's innate displeasure with HZ==250 (and thus it's likely that in .17 or
> .18 or whenever, some *other* patch will make it re-manifest).
To be clear, I am virtually locked into a 2.6.14.2 kernel since that is what has been
qualified for use in the product. I am "permitted" to patch the kernel with specifically
justified bug fixes. If these patches are too broad (a judgment call by a group of
engineer's) then we have to re-qualify the kernel selection which is the (long) process
that I am trying to avoid. I performed a git bisect between 2.6.12.6 and 2.6.13 and found
the problem is first noticed when the commit that allows HZ==250 is made. Support for
this change is pretty wide ranging --- a lot of use of msleep() instead of busy loops,
etc.
I did try the brute force approach, diffed the two kernels (800,000 line diff) and then
looked for any changes related to HZ (diff -Naur kernel1 kernel2 | egrep ^-.*HZ) and then
studied the code that was changed to see if I thought it might be related to this problem.
Given my limited understanding of the kernel code I was really just hoping to get lucky.
That did not prove out which is not surprising since it did not test the cases where the
code did not change and HZ is used and the code is not friendly to HZ != 1000. This
pretty much leaves me at a dead end with this approach.
Then we decided, what the heck, we'll try the latest kernel (2.6.16) just to see if it is
fixed and voila, it is. Jumping back one revision to 2.6.15.6 showed that the problem
existed again, so I decided to git-bisect those two versions (I'm down to 9 more
iterations) and then see if the change in those two versions yield any insight into the
core problem. I'm not very hopeful though since this commit is in the path between
2.6.15.6 and 2.6.16):
commit 33f0f88f1c51ae5c2d593d26960c760ea154c2e2
Author: Alan Cox <[email protected]>
Date: Mon Jan 9 20:54:13 2006 -0800
[PATCH] TTY layer buffering revamp
The API and code have been through various bits of initial review by
serial driver people but they definitely need to live somewhere for a
while so the unconverted drivers can get knocked into shape, existing
drivers that have been updated can be better tuned and bugs whacked out.
So, any recommendations for a better approach?
(please cc replies)
Greg
On Llu, 2006-03-27 at 18:46 -0500, Greg Lee wrote:
> I am having a problem when using PPP over a particular PCMCIA based serial device and have
> pinned the problem down using git-bisect to this particular commit that was made between
> 2.6.12.6 and 2.6.13:
That would make sense.
> I have also tried a number of other kernels and the problem exists all the way to 2.6.15.6
> but is fixed in 2.6.16, so I am going to git-bisect 2.6.15.6 to 2.6.16, but I thought I
> would get this message out now in case someone has an inkling of what the problem is.
I think I can tell you fairly accurately if you are running fairly high
data rates.
The old pre 2.6.16 tty code works something like this
Each serial IRQ
add chars to buffer if they fit
Each timer IRQ
switch buffers
process original buffer
So the higher HZ is the faster data speed you can do. With very high
data rates lower HZ means more dropped characters.
2.6.16 implements the new tty layer which replaces this with a proper
buffering and queueing mechanism and is SMP aware (and thanks to Paul
rather SMP clever too).
If you want to do very high data rates you either
1. Up HZ
2. Set the tty to low latency mode (process each char as it arrives) and
pray you have enough CPU power
3. Fix the buffering. (2.6.16)
In theory you can change the flip buffer sizes but that needs care.
Alan
> Saying that the problem is between 2.6.15.6 and 2.6.16 is rather
> meaningless because you're effectively omitting _all_ the development
> work between 2.6.15 to 2.6.16, and that's likely where the problem
> lies. Hence, you're omitting all the 2.6.16-rc kernels from your
> testing.
Yes, I realized last night that the -rc kernels actually came before the 2.6.16. I'm in
the process of a git-bisect between 2.6.15 and 2.6.16 which will cover all of the changes
made in the -rc kernels, right?
Greg
Russell King wrote:
> On Mon, Mar 27, 2006 at 06:46:02PM -0500, Greg Lee wrote:
>> I have also tried a number of other kernels and the problem exists all
>> the way to 2.6.15.6 but is fixed in 2.6.16, so I am going to git-bisect
>> 2.6.15.6 to 2.6.16, but I thought I would get this message out now in
>> case someone has an inkling of what the problem is.
>
> Saying that the problem is between 2.6.15.6 and 2.6.16 is rather
> meaningless because you're effectively omitting _all_ the development
> work between 2.6.15 to 2.6.16, and that's likely where the problem
> lies. Hence, you're omitting all the 2.6.16-rc kernels from your
> testing.
But won't a git bisect cover all the bases even so? Aren't rc versions
just selected git pulls?
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
Alan Cox wrote:
> On Llu, 2006-03-27 at 18:46 -0500, Greg Lee wrote:
>> I am having a problem when using PPP over a particular PCMCIA based serial device and have
>> pinned the problem down using git-bisect to this particular commit that was made between
>> 2.6.12.6 and 2.6.13:
>
> That would make sense.
>
>> I have also tried a number of other kernels and the problem exists all the way to 2.6.15.6
>> but is fixed in 2.6.16, so I am going to git-bisect 2.6.15.6 to 2.6.16, but I thought I
>> would get this message out now in case someone has an inkling of what the problem is.
>
> I think I can tell you fairly accurately if you are running fairly high
> data rates.
>
> The old pre 2.6.16 tty code works something like this
>
> Each serial IRQ
> add chars to buffer if they fit
>
> Each timer IRQ
> switch buffers
> process original buffer
>
>
> So the higher HZ is the faster data speed you can do. With very high
> data rates lower HZ means more dropped characters.
>
> 2.6.16 implements the new tty layer which replaces this with a proper
> buffering and queueing mechanism and is SMP aware (and thanks to Paul
> rather SMP clever too).
Just as an aside to this and thanks to Paul, this seems in practice to
work as well with HT (as I would expect) and handle fairly high rate
(230kb) connections perfectly. I hope it applies for dumb multiport
cards as well, I have a fair number of them here and there.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me