2013-04-08 09:26:03

by Steffen Trumtrar

[permalink] [raw]
Subject: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

Hi!

I noticed a problem with the tty subsystem on ARM. Starting with 2.6.38+ load
on the serial connection causes a 10-15% increase in system/userspace load.
This doesn't change up to v3.9-rc4.

The following setup was used:

telnet && screen microcom -p /dev/ttyUSB0
| +--------+
|-------------->------------|----+ |
+-------+<---------<------------|----+ |
| | +------+ | |
| UUT |<-USB->| FTDI |<-UART->| |
| | +------+ | PC |
+-------+ +--------+
^
|
telnet && top -d1

The unit under test (UUT) is connected via USB->FTDI->UART to a PC. On the PC
a "while true; do find /; done" produces some random output.
I connect to the UUT via telnet and then open a serial connection to the PC
in a screen session, seeing the output produced on the PC. Then screen gets
detached. So, basically, what I'm trying to do is producing load only on the
USB->FTDI->UART connection and not on the UUT itself.
Then another telnet connection is opened, to monitor the UUT with "top -d1".
As UUT an imx27, kirkwood and an AT91 were used.

To find the "offending" code, I bisected v2.6.38..v3.0 which gave the following
top output (non-scientifically, I know. But the switch in load distribution is
obvious nevertheless):

2.6.38 Cpu(s): 3.8%us, 1.9%sy, 0.0%ni, 94.3%id
2.6.38+ Cpu(s): 1.9%us, 3.8%sy, 0.0%ni, 94.3%id
last good commit Cpu(s): 1.9%us, 2.8%sy, 0.0%ni, 95.3%id
first bad commit Cpu(s): 4.8%us, 14.5%sy, 0.0%ni, 80.6%id
2.6.39-rc4 Cpu(s): 10.5%us, 8.9%sy, 0.0%ni, 79.8%id
3.0 Cpu(s): 15.9%us, 19.6%sy, 0.0%ni, 62.3%id

This resulted in
f23eb2b2b28547fc70df82dd5049eb39bec5ba12
tty: stop using "delayed_work" in the tty layer

as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
similar to 2.6.38.
What I haven't done, is measure if the load is really increasing or if top only
tells me so. Maybe the algorithm to calculate this somehow produces different
results because of the switch from schedule_delayed_work to schedule_work?
So, is this a bug, a feature, a symptom,...?

Regards,
Steffen

--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


2013-04-08 15:06:17

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

On Mon, Apr 08, 2013 at 11:25:58AM +0200, Steffen Trumtrar wrote:
> Hi!
>
> I noticed a problem with the tty subsystem on ARM. Starting with 2.6.38+ load
> on the serial connection causes a 10-15% increase in system/userspace load.
> This doesn't change up to v3.9-rc4.
>
> The following setup was used:
>
> telnet && screen microcom -p /dev/ttyUSB0
> | +--------+
> |-------------->------------|----+ |
> +-------+<---------<------------|----+ |
> | | +------+ | |
> | UUT |<-USB->| FTDI |<-UART->| |
> | | +------+ | PC |
> +-------+ +--------+
> ^
> |
> telnet && top -d1
>
> The unit under test (UUT) is connected via USB->FTDI->UART to a PC. On the PC
> a "while true; do find /; done" produces some random output.
> I connect to the UUT via telnet and then open a serial connection to the PC
> in a screen session, seeing the output produced on the PC. Then screen gets
> detached. So, basically, what I'm trying to do is producing load only on the
> USB->FTDI->UART connection and not on the UUT itself.
> Then another telnet connection is opened, to monitor the UUT with "top -d1".
> As UUT an imx27, kirkwood and an AT91 were used.
>
> To find the "offending" code, I bisected v2.6.38..v3.0 which gave the following
> top output (non-scientifically, I know. But the switch in load distribution is
> obvious nevertheless):
>
> 2.6.38 Cpu(s): 3.8%us, 1.9%sy, 0.0%ni, 94.3%id
> 2.6.38+ Cpu(s): 1.9%us, 3.8%sy, 0.0%ni, 94.3%id
> last good commit Cpu(s): 1.9%us, 2.8%sy, 0.0%ni, 95.3%id
> first bad commit Cpu(s): 4.8%us, 14.5%sy, 0.0%ni, 80.6%id
> 2.6.39-rc4 Cpu(s): 10.5%us, 8.9%sy, 0.0%ni, 79.8%id
> 3.0 Cpu(s): 15.9%us, 19.6%sy, 0.0%ni, 62.3%id
>
> This resulted in
> f23eb2b2b28547fc70df82dd5049eb39bec5ba12
> tty: stop using "delayed_work" in the tty layer
>
> as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
> similar to 2.6.38.
> What I haven't done, is measure if the load is really increasing or if top only
> tells me so. Maybe the algorithm to calculate this somehow produces different
> results because of the switch from schedule_delayed_work to schedule_work?
> So, is this a bug, a feature, a symptom,...?

It's a "fake" load (i.e. no extra cpu is being used, just a "busy" wait
is happening.)

You should see an increased throughput with that patch applied, have you
tested a real workload?

greg k-h

2013-04-09 07:30:55

by Steffen Trumtrar

[permalink] [raw]
Subject: Re: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

On Mon, Apr 08, 2013 at 08:06:11AM -0700, Greg Kroah-Hartman wrote:
> On Mon, Apr 08, 2013 at 11:25:58AM +0200, Steffen Trumtrar wrote:
> > Hi!
> >
> > I noticed a problem with the tty subsystem on ARM. Starting with 2.6.38+ load
> > on the serial connection causes a 10-15% increase in system/userspace load.
> > This doesn't change up to v3.9-rc4.
> >
> > The following setup was used:
> >
> > telnet && screen microcom -p /dev/ttyUSB0
> > | +--------+
> > |-------------->------------|----+ |
> > +-------+<---------<------------|----+ |
> > | | +------+ | |
> > | UUT |<-USB->| FTDI |<-UART->| |
> > | | +------+ | PC |
> > +-------+ +--------+
> > ^
> > |
> > telnet && top -d1
> >
> > The unit under test (UUT) is connected via USB->FTDI->UART to a PC. On the PC
> > a "while true; do find /; done" produces some random output.
> > I connect to the UUT via telnet and then open a serial connection to the PC
> > in a screen session, seeing the output produced on the PC. Then screen gets
> > detached. So, basically, what I'm trying to do is producing load only on the
> > USB->FTDI->UART connection and not on the UUT itself.
> > Then another telnet connection is opened, to monitor the UUT with "top -d1".
> > As UUT an imx27, kirkwood and an AT91 were used.
> >
> > To find the "offending" code, I bisected v2.6.38..v3.0 which gave the following
> > top output (non-scientifically, I know. But the switch in load distribution is
> > obvious nevertheless):
> >
> > 2.6.38 Cpu(s): 3.8%us, 1.9%sy, 0.0%ni, 94.3%id
> > 2.6.38+ Cpu(s): 1.9%us, 3.8%sy, 0.0%ni, 94.3%id
> > last good commit Cpu(s): 1.9%us, 2.8%sy, 0.0%ni, 95.3%id
> > first bad commit Cpu(s): 4.8%us, 14.5%sy, 0.0%ni, 80.6%id
> > 2.6.39-rc4 Cpu(s): 10.5%us, 8.9%sy, 0.0%ni, 79.8%id
> > 3.0 Cpu(s): 15.9%us, 19.6%sy, 0.0%ni, 62.3%id
> >
> > This resulted in
> > f23eb2b2b28547fc70df82dd5049eb39bec5ba12
> > tty: stop using "delayed_work" in the tty layer
> >
> > as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
> > similar to 2.6.38.
> > What I haven't done, is measure if the load is really increasing or if top only
> > tells me so. Maybe the algorithm to calculate this somehow produces different
> > results because of the switch from schedule_delayed_work to schedule_work?
> > So, is this a bug, a feature, a symptom,...?
>
> It's a "fake" load (i.e. no extra cpu is being used, just a "busy" wait
> is happening.)
>
> You should see an increased throughput with that patch applied, have you
> tested a real workload?
>

Hi Greg,

we found this "fake" load via a normal userspace program, that is using
UART for communication, if that is what you mean by "real workload".
But the next step would be measuring throughput and the real load.
Sounds like we will not find anything, but will still have an explanation
for the load.

Thanks,
Steffen

--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

2013-04-22 15:56:26

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

Hello Greg,

On 04/08/2013 05:06 PM, Greg Kroah-Hartman wrote:
> On Mon, Apr 08, 2013 at 11:25:58AM +0200, Steffen Trumtrar wrote:
>> Hi!
>>
>> I noticed a problem with the tty subsystem on ARM. Starting with 2.6.38+ load
>> on the serial connection causes a 10-15% increase in system/userspace load.
>> This doesn't change up to v3.9-rc4.
>>
>> The following setup was used:
>>
>> telnet && screen microcom -p /dev/ttyUSB0
>> | +--------+
>> |-------------->------------|----+ |
>> +-------+<---------<------------|----+ |
>> | | +------+ | |
>> | UUT |<-USB->| FTDI |<-UART->| |
>> | | +------+ | PC |
>> +-------+ +--------+
>> ^
>> |
>> telnet && top -d1
>>
>> The unit under test (UUT) is connected via USB->FTDI->UART to a PC. On the PC
>> a "while true; do find /; done" produces some random output.
>> I connect to the UUT via telnet and then open a serial connection to the PC
>> in a screen session, seeing the output produced on the PC. Then screen gets
>> detached. So, basically, what I'm trying to do is producing load only on the
>> USB->FTDI->UART connection and not on the UUT itself.
>> Then another telnet connection is opened, to monitor the UUT with "top -d1".
>> As UUT an imx27, kirkwood and an AT91 were used.
>>
>> To find the "offending" code, I bisected v2.6.38..v3.0 which gave the following
>> top output (non-scientifically, I know. But the switch in load distribution is
>> obvious nevertheless):
>>
>> 2.6.38 Cpu(s): 3.8%us, 1.9%sy, 0.0%ni, 94.3%id
>> 2.6.38+ Cpu(s): 1.9%us, 3.8%sy, 0.0%ni, 94.3%id
>> last good commit Cpu(s): 1.9%us, 2.8%sy, 0.0%ni, 95.3%id
>> first bad commit Cpu(s): 4.8%us, 14.5%sy, 0.0%ni, 80.6%id
>> 2.6.39-rc4 Cpu(s): 10.5%us, 8.9%sy, 0.0%ni, 79.8%id
>> 3.0 Cpu(s): 15.9%us, 19.6%sy, 0.0%ni, 62.3%id
>>
>> This resulted in
>> f23eb2b2b28547fc70df82dd5049eb39bec5ba12
>> tty: stop using "delayed_work" in the tty layer
>>
>> as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
>> similar to 2.6.38.
>> What I haven't done, is measure if the load is really increasing or if top only
>> tells me so. Maybe the algorithm to calculate this somehow produces different
>> results because of the switch from schedule_delayed_work to schedule_work?
>> So, is this a bug, a feature, a symptom,...?
>
> It's a "fake" load (i.e. no extra cpu is being used, just a "busy" wait
> is happening.)

In order to have a more objective measurement, we hooked a scope to a
LED, which is triggered via the cpu0 trigger (echo cpu0 >
/sys/class/leds/led1/trigger) or the 'old' LED framework on the older
kernels.

The scope calculated the average voltage over a period of 10 seconds,
where 0 V equals 100% busy and 3.24 V means idle.

2.6.38 (good) 2,38V (load-led_2.6.38_good_20130422.png)
2.6.38 (bad) 1,71V (load-led_2.6.38_bad_20130422.png)

2.6.38 bad is "f23eb2b tty: stop using "delayed_work" in the tty layer"
2.6.38 good is f23eb2b^ == "f741a79 Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse"

Another test was to measure how long it takes to calculate 10 k digits
of PI. Less time means less load on the system by the tty layer:

idle tty open serial load

v2.6.38-good 20.9s 22.3s 32.8s
v2.6.38-bad 20.9s 22.3s 51.5s

We repeated the measurements on a recent kernel:

v3.9-rc8 20.8s 22.6s 36.3s
v3.9-rc8-revert 20.8s 22.6s 26.8s

v3.9-rc8-revert is v3.9-rc8 with

f23eb2b tty: stop using "delayed_work" in the tty layer

manually reverted.

On stock v3.9-rc8 we see input overruns, but I'll write a separate mail
once we've investigated the problem further. However the input overruns
disappear on our v3.9-rc8-revert.

> You should see an increased throughput with that patch applied, have you
> tested a real workload?

We've saturated the serial link which runs at 115200 baud. The problem
was originally observed by a customer of us in his real workload.

Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |


Attachments:
load-led_2.6.38_good_20130422.png (26.92 kB)
load-led_2.6.38_bad_20130422.png (24.96 kB)
signature.asc (263.00 B)
OpenPGP digital signature
Download all attachments

2013-04-22 19:55:45

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

(resent without attached scope screen shots)

On 04/08/2013 05:06 PM, Greg Kroah-Hartman wrote:
> On Mon, Apr 08, 2013 at 11:25:58AM +0200, Steffen Trumtrar wrote:
>> Hi!
>>
>> I noticed a problem with the tty subsystem on ARM. Starting with 2.6.38+ load
>> on the serial connection causes a 10-15% increase in system/userspace load.
>> This doesn't change up to v3.9-rc4.
>>
>> The following setup was used:
>>
>> telnet && screen microcom -p /dev/ttyUSB0
>> | +--------+
>> |-------------->------------|----+ |
>> +-------+<---------<------------|----+ |
>> | | +------+ | |
>> | UUT |<-USB->| FTDI |<-UART->| |
>> | | +------+ | PC |
>> +-------+ +--------+
>> ^
>> |
>> telnet && top -d1
>>
>> The unit under test (UUT) is connected via USB->FTDI->UART to a PC. On the PC
>> a "while true; do find /; done" produces some random output.
>> I connect to the UUT via telnet and then open a serial connection to the PC
>> in a screen session, seeing the output produced on the PC. Then screen gets
>> detached. So, basically, what I'm trying to do is producing load only on the
>> USB->FTDI->UART connection and not on the UUT itself.
>> Then another telnet connection is opened, to monitor the UUT with "top -d1".
>> As UUT an imx27, kirkwood and an AT91 were used.
>>
>> To find the "offending" code, I bisected v2.6.38..v3.0 which gave the following
>> top output (non-scientifically, I know. But the switch in load distribution is
>> obvious nevertheless):
>>
>> 2.6.38 Cpu(s): 3.8%us, 1.9%sy, 0.0%ni, 94.3%id
>> 2.6.38+ Cpu(s): 1.9%us, 3.8%sy, 0.0%ni, 94.3%id
>> last good commit Cpu(s): 1.9%us, 2.8%sy, 0.0%ni, 95.3%id
>> first bad commit Cpu(s): 4.8%us, 14.5%sy, 0.0%ni, 80.6%id
>> 2.6.39-rc4 Cpu(s): 10.5%us, 8.9%sy, 0.0%ni, 79.8%id
>> 3.0 Cpu(s): 15.9%us, 19.6%sy, 0.0%ni, 62.3%id
>>
>> This resulted in
>> f23eb2b2b28547fc70df82dd5049eb39bec5ba12
>> tty: stop using "delayed_work" in the tty layer
>>
>> as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
>> similar to 2.6.38.
>> What I haven't done, is measure if the load is really increasing or if top only
>> tells me so. Maybe the algorithm to calculate this somehow produces different
>> results because of the switch from schedule_delayed_work to schedule_work?
>> So, is this a bug, a feature, a symptom,...?
>
> It's a "fake" load (i.e. no extra cpu is being used, just a "busy" wait
> is happening.)

In order to have a more objective measurement, we hooked a scope to a
LED, which is triggered via the cpu0 trigger (echo cpu0 >
/sys/class/leds/led1/trigger) or the 'old' LED framework on the older
kernels.

The scope calculated the average voltage over a period of 10 seconds,
where 0 V equals 100% busy and 3.24 V means idle.

2.6.38 (good) 2,38V
2.6.38 (bad) 1,71V

2.6.38 bad is "f23eb2b tty: stop using "delayed_work" in the tty layer"
2.6.38 good is f23eb2b^ == "f741a79 Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse"

Another test was to measure how long it takes to calculate 10 k digits
of PI. Less time means less load on the system by the tty layer:

idle tty open serial load

v2.6.38-good 20.9s 22.3s 32.8s
v2.6.38-bad 20.9s 22.3s 51.5s

We repeated the measurements on a recent kernel:

v3.9-rc8 20.8s 22.6s 36.3s
v3.9-rc8-revert 20.8s 22.6s 26.8s

v3.9-rc8-revert is v3.9-rc8 with

f23eb2b tty: stop using "delayed_work" in the tty layer

manually reverted.

On stock v3.9-rc8 we see input overruns, but I'll write a separate mail
once we've investigated the problem further. However the input overruns
disappear on our v3.9-rc8-revert.

> You should see an increased throughput with that patch applied, have you
> tested a real workload?

We've saturated the serial link which runs at 115200 baud. The problem
was originally observed by a customer of us in his real workload.

Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |



Attachments:
signature.asc (263.00 B)
OpenPGP digital signature

2013-04-23 13:38:54

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [BUG] increased us/sys-load due to tty-layer in 2.6.38+ ?!

On 04/22/2013 09:55 PM, Marc Kleine-Budde wrote:
[...]

>>> This resulted in
>>> f23eb2b2b28547fc70df82dd5049eb39bec5ba12
>>> tty: stop using "delayed_work" in the tty layer
>>>
>>> as possible cause. Reverting this commit by hand in v3.8 showed a load distribution
>>> similar to 2.6.38.
>>> What I haven't done, is measure if the load is really increasing or if top only
>>> tells me so. Maybe the algorithm to calculate this somehow produces different
>>> results because of the switch from schedule_delayed_work to schedule_work?
>>> So, is this a bug, a feature, a symptom,...?
>>
>> It's a "fake" load (i.e. no extra cpu is being used, just a "busy" wait
>> is happening.)

The below test shows that with that offending patch, there is less time
to calculate PI. My conclusion from this is that he patch increases the
load generated by the processing of the serial line in the tty layer
and/or increased overhead by more context switches/scheduling.

> Another test was to measure how long it takes to calculate 10 k digits
> of PI. Less time means less load on the system by the tty layer:
>
> idle tty open serial load
>
> v2.6.38-good 20.9s 22.3s 32.8s
> v2.6.38-bad 20.9s 22.3s 51.5s

Are there any options to decrease the load to the old value and keep the
better latency introduce by that patch?

Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |


Attachments:
signature.asc (263.00 B)
OpenPGP digital signature