2009-12-18 17:26:57

by Luis R. Rodriguez

[permalink] [raw]
Subject: git pull on linux-next makes my system crawl to its knees and beg for mercy

I can't describe it any better. It really is pissing me the fuck off,
its as if I have invisible elves using my nuts as punching bags.
Something is seriously fucked with 2.6.32, my box, or linux-next, or
perhaps there is another possibility someone might be able to help
enlighten me about which I am not considering. I'd also would love to
hear from others and see if I'm not the only one because if this issue
is reproducible, it would be bad. First let me describe the issue in
detail.

I tend to always be on a 2.6.32 kernel + John's queued up patches for
wireless for the next kernel release (I use wireless-testing). My
system is a Thinkpad T61, userspace is Ubuntu 9.10 based (ships with
git 1.6.3.3) and I kept an ext3 filesystem to be able to go back in
time to 2.6.27 at will without issues. I git clone'd linux-next a few
weeks ago. After a few days I then tried to git pull and my system
became completely unusable, It took *ages* to open up a terminal and
start running commands. Even ssh'ing into my box became a hassle to
the point that my *entire* morning was spent trying to patiently wait
for the git pull to finish. I gave up, I don't recall I had anything
on my kernel logs. Bewildered with this issue I set out to prove to
myself this issue was not a 2.6.32 issue and booted other kernels,
including Ubuntu's distro kernel on 2.6.31 and then later my own built
fresh 2.6.27.41 kernel. The issue was reproducible on all three
kernels!

This lead me to believe this was a system / hard drive issue and
embraced myself for a system fix. I yet needed to prove this was
indeed a system issue. I've been using myself without touching
linux-next for a while now and it works flawlessly, and even doing
testing with ath5k / ath9k for some random projects I have. I git pull
wireless-testing just fine, and pm-suspend just fine every day without
any hiccup.

I then started to suspect I probably got a fucked linux-next somehow,
I do recall I did pm-suspend during a pull of wireless-testing before
and never had issues after resume or with the tree at all. I don't
recall doing the pm-suspend with linux-next but it could be possible.
Since my last giving up on the 'git pull' of linux-next I tried to
'git reset --hard origin' and then trying a 'git pull' but saw my
issue easily becoming unusable again, I ctrl-c'd out of that quickly,
tried 'git fsck' and did fine some complaints. I started to want to
blame my hard drive so I rm -rf'd linux-next and tried a fresh clone.
It pulled fine, my system was slow but nothing *that* unusual.

A couple of days ago I do a 'git pull' again and ... my system starts
crying again, begging me to stop, so I did. My 'git describe' now
tells me I'm at next-20091211 and 'git fsck' tells me:

dangling tree 3500a4301d572e57c700d18d6730f4ac3e33b923
dangling tree e50022fd1e44c3ca63d57e5b263a8263fa5e291b
dangling tree c105e67e2b609e02eefe2b676e53f79b3e375a32
dangling tree 850b60a21ebf9721d16eeb7d68d6e6250893b558
dangling tree dd0bc64a4fe9eac9de3edb0db68d7a83d0477655
dangling tree ac135ee3b2031dcbed733af87c1b82833c1bd035
dangling tree 635a4c8728714746bc1a80692bd7b998af36c7fd
dangling tree 08708a50cd385efc23cfda7dd88cacf951db2237
dangling tree d27198a37d7b393ed9f5dc99c2f56e1c715c4572
dangling tree cc753417b1a3c1b61c0f1b37e7560be1f5404b93
dangling tree 237dcc5120bfe3aebcd2e19ab3640fbecb855ff2
dangling tree 32815626af9eb48dbe04fa790b154a9424871041
dangling tree 299294383dc096b0363ae3f7a49fe937a5e6027c
dangling tree 18acf23ea04f77d96c5eb092bbed0d598eb580ef
dangling tree 08c9e248624d407404e04481adf27d385c3a7e57
dangling tree beccf2344a596fed67cbe0f874210a98bd2b7c40
dangling tree d8df306d4dcf551d47f7d914bd7754c000e541f8
dangling tree 93e83c3ea3a1fef405546af4d99ecd1032ab9b09
dangling tree 65f9e4a9c3af938337fb7ec49eb354ebb19553f5
dangling tree 0a0c81ebe4e60b5941adf92494c45a9cc4ebbd85
dangling tree 5e1b8d853b32c4ee11cf4ef142be4d9a3096f679
dangling tree e11b176056fb40d5ef9cae77af37f53d7bf9342d
dangling tree 8a1e91d681554cafc74586c7ac3eb77299fbb091
dangling tree 10201bc2b2ceac311424bdbc3949a726926a6a3d
dangling tree fa21357016b62184d347f16f22fce54a0fc3aef5
dangling tree e33177806b80d35b0547a76e5fe26b59c55b5aa1
dangling tree cf3a11b7f1b83c86870881cb40f7c1af5b1daf9a
dangling tree e43d57bb730a492b73f6e6a8e5fd218d14e4b741
dangling tree 9f62ff527891032a5f0511f8f38cfe686b15ee5f
dangling tree 7476a37bdcedc8ecc568d73225a55b59899e70df
dangling tree 6085f588e0656a48e76f5e87dfb5ca03e7649bc4
dangling tree ea93955aa22995d17cc90f300343a535c8bdbf0c
dangling tree 1d95edfaf4d9065bf86cae97629fa28dd76e9fd3
dangling tree f29a99c8fee39c9934cca06a00e7ca5b48437ca0
dangling tree ff9bcf59a9392f7856a13c7541107165d8eb5659
dangling tree 3d9e29ac71b065829550996559094525e6f4ea4e
dangling tree cea7adf5352ce365f580066d1e2123e63b48f261
dangling tree 2fa8f7cd3a02839cf41ae7b01267047ffdbcfbe5
dangling tree 42ddc9e3585548880386d48dd7393d4111347ccd
dangling tree 83e30d38fb1e8ca59440f830f9bc203c615eff49
dangling tree 71e83d06ef0bb5bb105cf64bbb80bc6580bc06eb
dangling tree 7bedcf303aafcdedd6d0b3119bd8040d1fab3983
dangling tree 83f1a1eae41b9d6c7d2b6d549c35485a4e20847a
dangling tree eff8c7a202e29a8793b581b4c1b8d372a5289356
dangling tree 84f9bb0e7549269370c88a0878655fa4f5c09b27
dangling tree 4ffa81c9721df067d741fbd40227163d77ff7513

I'm starting to doubt this is a hard drive issue, I will be cloning
linux-next as-is exactly on my system on some other T61 (but a little
bigger and with Nvidia graphics) I have by git clone'ing over ssh to
my linux-next/.git/ and then I'll scp over my linux-next/.git/config
to it and try a git pull and see if that system also goes ape shit.

I am wondering if others have experiences issues like this as well.

Here's my kernel config for wireless-testing:

http://bombadil.infradead.org/~mcgrof/configs/2009/12/wireless-testing.config

And my config for 2.6.27:

http://bombadil.infradead.org/~mcgrof/configs/2009/12/2.6.27.41.config

Even if a git tree gets terribly messed up the issues I'm seeing seem
to painful for an average user to experience, there has got to be
something major going on under the hood, and not sure why I don't see
this sort of thing with following wireless-testing.

Luis


Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Friday 18 December 2009 06:26:29 pm Luis R. Rodriguez wrote:

> on my kernel logs. Bewildered with this issue I set out to prove to
> myself this issue was not a 2.6.32 issue and booted other kernels,
> including Ubuntu's distro kernel on 2.6.31 and then later my own built
> fresh 2.6.27.41 kernel. The issue was reproducible on all three
> kernels!
>
> This lead me to believe this was a system / hard drive issue and
> embraced myself for a system fix. I yet needed to prove this was

Just some hints for ruling out the system / hard drive problem.

smartctl -a /dev/sdx is your friend for checking your disk (keep an eye
on anything suspicious like re-allocated sector count going up etc.)

It could be also fs related issue that shows up only under specific
conditions (i.e. almost full partition -- some file-systems starts to
crawl when the amount of available free space gets low).

HTH
--
Bartlomiej Zolnierkiewicz

2009-12-18 19:20:10

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 9:38 AM, Bartlomiej Zolnierkiewicz
<[email protected]> wrote:
> On Friday 18 December 2009 06:26:29 pm Luis R. Rodriguez wrote:
>
>> on my kernel logs. Bewildered with this issue I set out to prove to
>> myself this issue was not a 2.6.32 issue and booted other kernels,
>> including Ubuntu's distro kernel on 2.6.31 and then later my own built
>> fresh 2.6.27.41 kernel. The issue was reproducible on all three
>> kernels!
>>
>> This lead me to believe this was a system / hard drive issue and
>> embraced myself for a system fix. I yet needed to prove this was
>
> Just some hints for ruling out the system / hard drive problem.
>
> smartctl -a /dev/sdx is your friend for checking your disk (keep an eye
> on anything suspicious like re-allocated sector count going up etc.)

Sweet thanks, here's my current output, I'll try later after I get
some day work done to pull linux-next and make it moan. Let me know if
you see anything fishy.

smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: HITACHI HTS722010K9SA00
Serial Number: 080109DP0210DPG8DUEP
Firmware Version: DC2ZC75A
User Capacity: 100,030,242,816 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3f
Local Time is: Fri Dec 18 11:16:12 2009 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 645) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 39) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail
Always - 0
2 Throughput_Performance 0x0005 116 116 040 Pre-fail
Offline - 3380
3 Spin_Up_Time 0x0007 253 253 033 Pre-fail
Always - 0
4 Start_Stop_Count 0x0012 098 098 000 Old_age
Always - 3314
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail
Always - 0
8 Seek_Time_Performance 0x0005 128 128 040 Pre-fail
Offline - 29
9 Power_On_Hours 0x0012 081 081 000 Old_age
Always - 8401
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 1571
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age
Always - 65536
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 3932351
193 Load_Cycle_Count 0x0012 045 045 000 Old_age
Always - 559592
194 Temperature_Celsius 0x0002 134 134 000 Old_age
Always - 41 (Lifetime Min/Max 13/48)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age
Always - 0
223 Load_Retry_Count 0x000a 100 100 000 Old_age
Always - 0

SMART Error Log Version: 1
ATA Error Count: 8 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 8 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 9f 45 a5 e0 Error: IDNF at LBA = 0x00a5459f = 10831263

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 ff 01 9f 45 a5 e0 00 00:04:06.200 READ SECTOR(S) EXT
25 ff 01 9f 45 a5 e0 00 00:04:06.100 READ DMA EXT
34 ff 01 00 00 00 e0 00 00:04:04.100 WRITE SECTORS(S) EXT
25 ff 01 00 00 00 e0 00 00:04:04.100 READ DMA EXT
25 ff 01 c0 17 fa e0 00 00:04:04.100 READ DMA EXT

Error 7 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 9f 45 a5 e0 Error: IDNF 1 sectors at LBA = 0x00a5459f = 10831263

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 ff 01 9f 45 a5 e0 00 00:04:06.100 READ DMA EXT
34 ff 01 00 00 00 e0 00 00:04:04.100 WRITE SECTORS(S) EXT
25 ff 01 00 00 00 e0 00 00:04:04.100 READ DMA EXT
25 ff 01 c0 17 fa e0 00 00:04:04.100 READ DMA EXT
25 ff 01 3f 00 00 e0 00 00:04:04.100 READ DMA EXT

Error 6 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 9f 45 a5 e0 Error: IDNF at LBA = 0x00a5459f = 10831263

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 ff 01 9f 45 a5 e0 00 00:04:04.000 READ SECTOR(S) EXT
25 ff 01 9f 45 a5 e0 00 00:04:04.000 READ DMA EXT
34 ff 01 00 00 00 e0 00 00:04:02.000 WRITE SECTORS(S) EXT
35 ff 01 cf 17 fa e0 00 00:04:02.000 WRITE DMA EXT
35 ff 01 ce 17 fa e0 00 00:04:02.000 WRITE DMA EXT

Error 5 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 9f 45 a5 e0 Error: IDNF 1 sectors at LBA = 0x00a5459f = 10831263

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 ff 01 9f 45 a5 e0 00 00:04:04.000 READ DMA EXT
34 ff 01 00 00 00 e0 00 00:04:02.000 WRITE SECTORS(S) EXT
35 ff 01 cf 17 fa e0 00 00:04:02.000 WRITE DMA EXT
35 ff 01 ce 17 fa e0 00 00:04:02.000 WRITE DMA EXT
35 ff 01 cd 17 fa e0 00 00:04:02.000 WRITE DMA EXT

Error 4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 01 9f 45 a5 e0 Error: IDNF at LBA = 0x00a5459f = 10831263

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 ff 01 9f 45 a5 e0 00 00:04:01.900 READ SECTOR(S) EXT
25 ff 01 9f 45 a5 e0 00 00:04:01.800 READ DMA EXT
34 ff 01 00 00 00 e0 00 00:03:59.900 WRITE SECTORS(S) EXT
35 ff 01 4e 00 00 e0 00 00:03:59.900 WRITE DMA EXT
35 ff 01 4d 00 00 e0 00 00:03:59.900 WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Also available at:

http://bombadil.infradead.org/~mcgrof/logs/2009/12/smart-ctl-sda2.txt

> It could be also fs related issue that shows up only under specific
> conditions

OK -- I see, I used a fresh new ext3, did not make the jump to ext4.

> (i.e. almost full partition -- some file-systems starts to
> crawl when the amount of available free space gets low).

Got it, thanks, so partition has a lot of room.

mcgrof@tux ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 91G 43G 44G 50% /

Also ony have one partition.

Luis

2009-12-18 19:55:46

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 11:19 AM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Dec 18, 2009 at 9:38 AM, Bartlomiej Zolnierkiewicz
> <[email protected]> wrote:
>> On Friday 18 December 2009 06:26:29 pm Luis R. Rodriguez wrote:
>>
>>> on my kernel logs. Bewildered with this issue I set out to prove to
>>> myself this issue was not a 2.6.32 issue and booted other kernels,
>>> including Ubuntu's distro kernel on 2.6.31 and then later my own built
>>> fresh 2.6.27.41 kernel. The issue was reproducible on all three
>>> kernels!
>>>
>>> This lead me to believe this was a system / hard drive issue and
>>> embraced myself for a system fix. I yet needed to prove this was
>>
>> Just some hints for ruling out the system / hard drive problem.
>>
>> smartctl -a /dev/sdx is your friend for checking your disk (keep an eye
>> on anything suspicious like re-allocated sector count going up etc.)
>
> Sweet thanks, here's my current output, I'll try later after I get
> some day work done to pull linux-next and make it moan. Let me know if
> you see anything fishy.

<-- snip full log -->

> Also available at:
>
> http://bombadil.infradead.org/~mcgrof/logs/2009/12/smart-ctl-sda2.txt
>
>> It could be also fs related issue that shows up only under specific
>> conditions
>
> OK -- I see, I used a fresh new ext3, did not make the jump to ext4.
>
>> (i.e. almost full partition -- some file-systems starts to
>> crawl when the amount of available free space gets low).
>
> Got it, thanks, so partition has a lot of room.
>
> mcgrof@tux ~ $ df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda2              91G   43G   44G  50% /
>
> Also ony have one partition.

GSmartControl is very cool, just ran the short self test and it passed
without issues. I'll now run the extended self tests. I'll not that
right after the self test I had to checkout the 2.6.32.y branch on
hpa's tree and noticed similar type of slow down as I did with pulling
linux-next. Only thing with linux-next is it takes ages complete which
just makes waiting unbearable. This all makes me suspect its something
else. But lets seee what these results on the GSmartControl yield.

Luis

2009-12-18 20:51:34

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 11:55 AM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Dec 18, 2009 at 11:19 AM, Luis R. Rodriguez <[email protected]> wrote:
>> On Fri, Dec 18, 2009 at 9:38 AM, Bartlomiej Zolnierkiewicz
>> <[email protected]> wrote:
>>> On Friday 18 December 2009 06:26:29 pm Luis R. Rodriguez wrote:
>>>
>>>> on my kernel logs. Bewildered with this issue I set out to prove to
>>>> myself this issue was not a 2.6.32 issue and booted other kernels,
>>>> including Ubuntu's distro kernel on 2.6.31 and then later my own built
>>>> fresh 2.6.27.41 kernel. The issue was reproducible on all three
>>>> kernels!
>>>>
>>>> This lead me to believe this was a system / hard drive issue and
>>>> embraced myself for a system fix. I yet needed to prove this was
>>>
>>> Just some hints for ruling out the system / hard drive problem.
>>>
>>> smartctl -a /dev/sdx is your friend for checking your disk (keep an eye
>>> on anything suspicious like re-allocated sector count going up etc.)
>>
>> Sweet thanks, here's my current output, I'll try later after I get
>> some day work done to pull linux-next and make it moan. Let me know if
>> you see anything fishy.
>
> <-- snip full log -->
>
>> Also available at:
>>
>> http://bombadil.infradead.org/~mcgrof/logs/2009/12/smart-ctl-sda2.txt
>>
>>> It could be also fs related issue that shows up only under specific
>>> conditions
>>
>> OK -- I see, I used a fresh new ext3, did not make the jump to ext4.
>>
>>> (i.e. almost full partition -- some file-systems starts to
>>> crawl when the amount of available free space gets low).
>>
>> Got it, thanks, so partition has a lot of room.
>>
>> mcgrof@tux ~ $ df -h
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/sda2              91G   43G   44G  50% /
>>
>> Also ony have one partition.
>
> GSmartControl is very cool, just ran the short self test and it passed
> without issues. I'll now run the extended self tests. I'll not that
> right after the self test I had to checkout the 2.6.32.y branch on
> hpa's tree and noticed similar type of slow down as I did with pulling
> linux-next. Only thing with linux-next is it takes ages complete which
> just makes waiting unbearable. This all makes me suspect its something
> else. But lets seee what these results on the GSmartControl yield.

I tested the same exact git pull on the other T61 laptop I have and
was able to see the same crippling effects but not as bad as with my
main T61. Different between them is the one where I see the worst
issue has a Intel(R) Core(TM)2 Duo CPU T8100 @ 1.80GHz while the
other one has the same CPU but at 2.10GHz. The only thing I see
different between linux-next and say wireless-testing is linux-next
will have a lot more newer objects and the pull will end with a git
merge that will fail and require you to 'git reset --hard origin'. The
later part shouldn't be taken into the equation there though as I see
the issue creeping up early on during the pull, while git is counting
objects and even later compressing.

I'm starting to glare at CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
with suspicious looks. On both boxes the CPU kept itself @ 800 MHz
during most of the git pull, I did see the CPU idle hitting 0
frequently and the CPU wait time ~ 20 or 30.

My GSmartControl extensive test is almost done.

I'll test 2.6.33-rc1 once John gets it into his tree.

Luis

2009-12-18 21:13:37

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 12:51 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Dec 18, 2009 at 11:55 AM, Luis R. Rodriguez <[email protected]> wrote:
>> On Fri, Dec 18, 2009 at 11:19 AM, Luis R. Rodriguez <[email protected]> wrote:
>>> On Fri, Dec 18, 2009 at 9:38 AM, Bartlomiej Zolnierkiewicz
>>> <[email protected]> wrote:
>>>> On Friday 18 December 2009 06:26:29 pm Luis R. Rodriguez wrote:
>>>>
>>>>> on my kernel logs. Bewildered with this issue I set out to prove to
>>>>> myself this issue was not a 2.6.32 issue and booted other kernels,
>>>>> including Ubuntu's distro kernel on 2.6.31 and then later my own built
>>>>> fresh 2.6.27.41 kernel. The issue was reproducible on all three
>>>>> kernels!
>>>>>
>>>>> This lead me to believe this was a system / hard drive issue and
>>>>> embraced myself for a system fix. I yet needed to prove this was
>>>>
>>>> Just some hints for ruling out the system / hard drive problem.
>>>>
>>>> smartctl -a /dev/sdx is your friend for checking your disk (keep an eye
>>>> on anything suspicious like re-allocated sector count going up etc.)
>>>
>>> Sweet thanks, here's my current output, I'll try later after I get
>>> some day work done to pull linux-next and make it moan. Let me know if
>>> you see anything fishy.
>>
>> <-- snip full log -->
>>
>>> Also available at:
>>>
>>> http://bombadil.infradead.org/~mcgrof/logs/2009/12/smart-ctl-sda2.txt
>>>
>>>> It could be also fs related issue that shows up only under specific
>>>> conditions
>>>
>>> OK -- I see, I used a fresh new ext3, did not make the jump to ext4.
>>>
>>>> (i.e. almost full partition -- some file-systems starts to
>>>> crawl when the amount of available free space gets low).
>>>
>>> Got it, thanks, so partition has a lot of room.
>>>
>>> mcgrof@tux ~ $ df -h
>>> Filesystem            Size  Used Avail Use% Mounted on
>>> /dev/sda2              91G   43G   44G  50% /
>>>
>>> Also ony have one partition.
>>
>> GSmartControl is very cool, just ran the short self test and it passed
>> without issues. I'll now run the extended self tests. I'll not that
>> right after the self test I had to checkout the 2.6.32.y branch on
>> hpa's tree and noticed similar type of slow down as I did with pulling
>> linux-next. Only thing with linux-next is it takes ages complete which
>> just makes waiting unbearable. This all makes me suspect its something
>> else. But lets seee what these results on the GSmartControl yield.
>
> I tested the same exact git pull on the other T61 laptop I have and
> was able to see the same crippling effects but not as bad as with my
> main T61. Different between them is the one where I see the worst
> issue has a Intel(R) Core(TM)2 Duo CPU     T8100  @ 1.80GHz while the
> other one has the same CPU but at 2.10GHz. The only thing I see
> different between linux-next and say wireless-testing is linux-next
> will have a lot more newer objects and the pull will end with a git
> merge that will fail and require you to 'git reset --hard origin'. The
> later part shouldn't be taken into the equation there though as I see
> the issue creeping up early on during the pull, while git is counting
> objects and even later compressing.
>
> I'm starting to glare at CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
> with suspicious looks. On both boxes the CPU kept itself @ 800 MHz
> during most of the git pull, I did see the CPU idle hitting 0
> frequently and the CPU wait time ~ 20 or 30.
>
> My GSmartControl extensive test is almost done.

The test completed without any errors.

> I'll test 2.6.33-rc1 once John gets it into his tree.

Now to wait for this guy.

Luis

2009-12-18 21:56:53

by Stephen Rothwell

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

Hi Luis,

On Fri, 18 Dec 2009 09:26:29 -0800 "Luis R. Rodriguez" <[email protected]> wrote:
>
> I tend to always be on a 2.6.32 kernel + John's queued up patches for
> wireless for the next kernel release (I use wireless-testing). My
> system is a Thinkpad T61, userspace is Ubuntu 9.10 based (ships with
> git 1.6.3.3) and I kept an ext3 filesystem to be able to go back in
> time to 2.6.27 at will without issues. I git clone'd linux-next a few
> weeks ago. After a few days I then tried to git pull and my system
> became completely unusable, It took *ages* to open up a terminal and

The start of the daily linux-next boilerplate says:

> If you are tracking the linux-next tree using git, you should not use
> "git pull" to do so as that will try to merge the new linux-next release
> with the old one. You should use "git fetch" as mentioned in the FAQ on
> the wiki (see below).

(Unfortunately, the wiki seems to be unavailable at the moment)

I am guessing that the merge that git is attempting is killing your
laptop (though besides the number of common commits I am not sure why).
Please try using "get fetch" instead.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.21 kB)
(No filename) (198.00 B)
Download all attachments

2009-12-18 22:03:36

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 1:56 PM, Stephen Rothwell <[email protected]> wrote:
> Hi Luis,
>
> On Fri, 18 Dec 2009 09:26:29 -0800 "Luis R. Rodriguez" <[email protected]> wrote:
>>
>> I tend to always be on a 2.6.32 kernel + John's queued up patches for
>> wireless for the next kernel release (I use wireless-testing). My
>> system is a Thinkpad T61, userspace is Ubuntu 9.10 based (ships with
>> git 1.6.3.3) and I kept an ext3 filesystem to be able to go back in
>> time to 2.6.27 at will without issues.  I git clone'd linux-next a few
>> weeks ago. After a few days I then tried to git pull and my system
>> became completely unusable, It took *ages* to open up a terminal and
>
> The start of the daily linux-next boilerplate says:
>
>> If you are tracking the linux-next tree using git, you should not use
>> "git pull" to do so as that will try to merge the new linux-next release
>> with the old one.  You should use "git fetch" as mentioned in the FAQ on
>> the wiki (see below).
>
> (Unfortunately, the wiki seems to be unavailable at the moment)
>
> I am guessing that the merge that git is attempting is killing your
> laptop (though besides the number of common commits I am not sure why).
> Please try using "get fetch" instead.

Indeed, I learned my lesson now. Thanks for the details.

Now granted, even if 'git merge' is killing my laptop due to the
conflicts of the insane merge I was trying to do it *still* should not
make my box completely unresponsive for so long. And given that I'm
using mostly distribution specific kernel config options and my have
ruled out my hard drive it seems a general serious kernel issue even
down to 2.6.27. Whatever git is doing I'm sure other userspace
software can also end up generating and would make any user go
completely bananas. I was about to rip my hair out.

Luis

2009-12-23 15:27:23

by Denys Vlasenko

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Fri, Dec 18, 2009 at 11:03 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Dec 18, 2009 at 1:56 PM, Stephen Rothwell <[email protected]> wrote:
>> Hi Luis,
>>
>> On Fri, 18 Dec 2009 09:26:29 -0800 "Luis R. Rodriguez" <[email protected]> wrote:
>>>
>>> I tend to always be on a 2.6.32 kernel + John's queued up patches for
>>> wireless for the next kernel release (I use wireless-testing). My
>>> system is a Thinkpad T61, userspace is Ubuntu 9.10 based (ships with
>>> git 1.6.3.3) and I kept an ext3 filesystem to be able to go back in
>>> time to 2.6.27 at will without issues. ?I git clone'd linux-next a few
>>> weeks ago. After a few days I then tried to git pull and my system
>>> became completely unusable, It took *ages* to open up a terminal and
>>
>> The start of the daily linux-next boilerplate says:
>>
>>> If you are tracking the linux-next tree using git, you should not use
>>> "git pull" to do so as that will try to merge the new linux-next release
>>> with the old one. ?You should use "git fetch" as mentioned in the FAQ on
>>> the wiki (see below).
>>
>> (Unfortunately, the wiki seems to be unavailable at the moment)
>>
>> I am guessing that the merge that git is attempting is killing your
>> laptop (though besides the number of common commits I am not sure why).
>> Please try using "get fetch" instead.
>
> Indeed, I learned my lesson now. Thanks for the details.
>
> Now granted, even if 'git merge' is killing my laptop due to the
> conflicts of the insane merge I was trying to do it *still* should not
> make my box completely unresponsive for so long. And given that I'm
> using mostly distribution specific kernel config options and my have
> ruled out my hard drive it seems a general serious kernel issue even
> down to 2.6.27. Whatever git is doing I'm sure other userspace
> software can also end up generating and would make any user go
> completely bananas. I was about to rip my hair out.

Git gurus would know it by heart, but I am not one. So if I were you,
I would just do a generic diagnostic run. What is it git is doing
so that machine slows down that much? Is it spawning a lot
of running processes? Is it allocating/using so much memory
that your box goes into a severe swap storm?

I guess it is the latter. If it is, then it's not a kernel problem -
kernel can't magically make your system adequately handle a workload
which needs 3 GB for working set when the box only has 2 GB of RAM.
It _will_ be very slow.

--
vda

2009-12-23 16:21:14

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Wed, Dec 23, 2009 at 7:27 AM, Denys Vlasenko
<[email protected]> wrote:
> On Fri, Dec 18, 2009 at 11:03 PM, Luis R. Rodriguez <[email protected]> wrote:
>> On Fri, Dec 18, 2009 at 1:56 PM, Stephen Rothwell <[email protected]> wrote:
>>> Hi Luis,
>>>
>>> On Fri, 18 Dec 2009 09:26:29 -0800 "Luis R. Rodriguez" <[email protected]> wrote:
>>>>
>>>> I tend to always be on a 2.6.32 kernel + John's queued up patches for
>>>> wireless for the next kernel release (I use wireless-testing). My
>>>> system is a Thinkpad T61, userspace is Ubuntu 9.10 based (ships with
>>>> git 1.6.3.3) and I kept an ext3 filesystem to be able to go back in
>>>> time to 2.6.27 at will without issues.  I git clone'd linux-next a few
>>>> weeks ago. After a few days I then tried to git pull and my system
>>>> became completely unusable, It took *ages* to open up a terminal and
>>>
>>> The start of the daily linux-next boilerplate says:
>>>
>>>> If you are tracking the linux-next tree using git, you should not use
>>>> "git pull" to do so as that will try to merge the new linux-next release
>>>> with the old one.  You should use "git fetch" as mentioned in the FAQ on
>>>> the wiki (see below).
>>>
>>> (Unfortunately, the wiki seems to be unavailable at the moment)
>>>
>>> I am guessing that the merge that git is attempting is killing your
>>> laptop (though besides the number of common commits I am not sure why).
>>> Please try using "get fetch" instead.
>>
>> Indeed, I learned my lesson now. Thanks for the details.
>>
>> Now granted, even if 'git merge' is killing my laptop due to the
>> conflicts of the insane merge I was trying to do it *still* should not
>> make my box completely unresponsive for so long. And given that I'm
>> using mostly distribution specific kernel config options and my have
>> ruled out my hard drive it seems a general serious kernel issue even
>> down to 2.6.27. Whatever git is doing I'm sure other userspace
>> software can also end up generating and would make any user go
>> completely bananas. I was about to rip my hair out.
>
> Git gurus would know it by heart, but I am not one. So if I were you,
> I would just do a generic diagnostic run.

Right its the first thing I did, but its to the extent that even doing
that is not possible unless you're willing to wait 5-10 minutes for
some output. I'm not kidding.

> What is it git is doing
> so that machine slows down that much? Is it spawning a lot
> of running processes?

Doesn't seem like it, the only visible git process is get-merge, I
forgot to grep for all git processes though, but I think that was the
only one.

> Is it allocating/using so much memory
> that your box goes into a severe swap storm?

Could be, 979M virtual, 298M resident size (non swapped), 58665 shared.

Unfortunately when this happens I cannot log into my box and run good
diagnostics, that's how much of a pain in the bolas this is. Some
morning I had enough patience I did leave vmstat and iostat running
and didn't see much out of the ordinary except CPU wait time was
pretty high. I did manage to get at least htop running once and took a
screenshot (and this took me about 10 minutes to generate):

http://bombadil.infradead.org/~mcgrof/images/2009/12/git-merge.jpg

So if anything it could be the later, that of a swap storm.

What I should have running is sar, that way I can treck back in time
when I want to.

But even when compiling the kernel my machine becomes unusable for a
few seconds when the linking for vmlinux.o starts and in that case my
swap usage is about 45 - 125 M. A silly example, pandora reliably poos
out on firefox requiring a pkill on firefox to get it back while
vmlinux.o is linking.

> I guess it is the latter.

Only it seems to happen with some other things like compiling the kernel.

I'll see if I can upgrade the memory on this thing.

> If it is, then it's not a kernel problem -
> kernel can't magically make your system adequately handle a workload
> which needs 3 GB for working set when the box only has 2 GB of RAM.
> It _will_ be very slow.

Sure, I'll try to keep my eye out on swap overuse, I suppose it could be that.

I started to be suspicious about the CPU freq governor but I'll note
on both systems even if I set the freq static to the highest I still
had issues. I'll also note on both 2.6.27 and 2.6.32 I used:

CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y

I started testing CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE=y

but don't really notice an improvement.

Luis

2009-12-23 17:11:40

by Denys Vlasenko

[permalink] [raw]
Subject: Re: git pull on linux-next makes my system crawl to its knees and beg for mercy

On Wed, Dec 23, 2009 at 5:20 PM, Luis R. Rodriguez <[email protected]> wrote:
>> Is it allocating/using so much memory
>> that your box goes into a severe swap storm?
>
> Could be, 979M virtual, 298M resident size (non swapped), 58665 shared.
>
> Unfortunately when this happens I cannot log into my box and run good
> diagnostics, that's how much of a pain in the bolas this is.

Nicing it may make it easier to do diagnostic work.

> Some
> morning I had enough patience I did leave vmstat and iostat running
> and didn't see much out of the ordinary except CPU wait time was
> pretty high. I did manage to get at least htop running once and took a
> screenshot (and this took me about 10 minutes to generate):
>
> http://bombadil.infradead.org/~mcgrof/images/2009/12/git-merge.jpg

Looks like swap space is 2/3 used. This is an indication
of memory starvation.

It may be a residual condition - you have a lot of potentially
bloated programs running. What do you see if you reproduce
this situation soon after boot, with minimum of other running programs?
For one, definitely do not start web browser(s) and such.
Ideally, do not run X at all.

If you still see a lot of swap used, then this is it - git
requires more memory for this task. The possibility that
kernel has a bug where it needlessly swaps out is remote.

--
vda