2010-07-14 06:12:52

by Zeno Davatz

[permalink] [raw]
Subject: kmemleak, cpu usage jump out of nowhere

Hi

I got a new Intel core-8 i7 processor.

I am on kernel uname -a

Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux

Sometimes in the middle of nowhere all of a sudden all of my 8-cores
are at 100% CPU usage and my machine really lags and hangs and is not
useable anymore. Some random process just grabs a bunch CPUs according
to htop.

dmesg tell me that

kmemleak: 38 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

I am attaching you the file from /sys/kernel/debug/kmemleak

Let me know if you need anything else.

Best
Zeno


Attachments:
kmemleak (32.67 kB)

2010-07-14 08:06:05

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
> I got a new Intel core-8 i7 processor.
>
> I am on kernel uname -a
>
> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>
> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
> are at 100% CPU usage and my machine really lags and hangs and is not
> useable anymore. Some random process just grabs a bunch CPUs according
> to htop.

Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
likely the source of these pauses.

> dmesg tell me that
>
> kmemleak: 38 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>
> I am attaching you the file from /sys/kernel/debug/kmemleak

Zeno, can you post your dmesg and .config, please?

We have a bunch of suspected leaks here. The first class of leaks is
related to reserve_region():

unreferenced object 0xf6d80740 (size 64):
comm "swapper", pid 1, jiffies 4294892590 (age 57258.752s)
hex dump (first 32 bytes):
00 00 ee c7 00 00 00 00 ff b7 ee c7 00 00 00 00 ................
7c 09 52 c1 00 00 00 80 00 f2 5e c1 20 ac 6f c1 |.R.......^. .o.
backtrace:
[<c145d4eb>] kmemleak_alloc+0x27/0x4d
[<c10ad53f>] kmem_cache_alloc+0xa3/0xd4
[<c163b782>] __reserve_region_with_split+0x29/0x149
[<c163b86a>] __reserve_region_with_split+0x111/0x149
[<c163b89a>] __reserve_region_with_split+0x141/0x149
[<c163b89a>] __reserve_region_with_split+0x141/0x149
[<c163b89a>] __reserve_region_with_split+0x141/0x149
[<c163b8de>] reserve_region_with_split+0x3c/0x4f
[<c162e307>] e820_reserve_resources_late+0xea/0x108
[<c16504e6>] pcibios_resource_survey+0x23/0x2a
[<c1652022>] pcibios_init+0x61/0x73
[<c165172b>] pci_subsys_init+0x43/0x48
[<c1001114>] do_one_initcall+0x27/0x178
[<c162b357>] kernel_init+0x129/0x1c7
[<c10238b6>] kernel_thread_helper+0x6/0x10
[<ffffffff>] 0xffffffff

unreferenced object 0xf6d232a0 (size 32):
comm "swapper", pid 1, jiffies 4294892601 (age 57258.708s)
hex dump (first 32 bytes):
70 6e 70 20 30 30 3a 30 31 00 d2 f6 fa 00 0b c1 pnp 00:01.......
00 00 00 00 04 aa dc f6 2c 00 00 00 01 00 00 00 ........,.......
backtrace:
[<c145d4eb>] kmemleak_alloc+0x27/0x4d
[<c10ad53f>] kmem_cache_alloc+0xa3/0xd4
[<c123040b>] reserve_range+0x3b/0x13f
[<c1230597>] system_pnp_probe+0x88/0xb0
[<c122b0f7>] pnp_device_probe+0x67/0xaf
[<c12d5246>] driver_probe_device+0x5b/0x148
[<c12d539a>] __driver_attach+0x67/0x69
[<c12d4c33>] bus_for_each_dev+0x46/0x64
[<c12d512c>] driver_attach+0x19/0x1b
[<c12d46f5>] bus_add_driver+0x17a/0x225
[<c12d55b8>] driver_register+0x65/0x110
[<c122af44>] pnp_register_driver+0x17/0x19
[<c1647a91>] pnp_system_init+0xd/0xf
[<c1001114>] do_one_initcall+0x27/0x178
[<c162b357>] kernel_init+0x129/0x1c7
[<c10238b6>] kernel_thread_helper+0x6/0x10

I scanned through both call sites briefly but didn't find anything obvious.

The second class of leaks seems to be related to kobjects:

unreferenced object 0xf6951920 (size 32):
comm "swapper", pid 1, jiffies 4294892614 (age 57258.656s)
hex dump (first 32 bytes):
63 70 75 69 64 6c 65 00 2f 76 69 72 74 75 61 6c cpuidle./virtual
2f 67 72 61 70 68 69 63 73 2f 66 62 63 6f 6e 00 /graphics/fbcon.
backtrace:
[<c11e33c6>] kvasprintf+0x2a/0x47
[<c11db5d7>] kobject_set_name_vargs+0x17/0x52
[<c11db629>] kobject_add_varg+0x17/0x41
[<c11db67a>] kobject_init_and_add+0x27/0x2d
[<c1389b0c>] cpuidle_add_sysfs+0x3e/0x56
[<c138944e>] __cpuidle_register_device+0xfb/0x116
[<c13895fc>] cpuidle_register_device+0x18/0x54
[<c1645397>] intel_idle_init+0x2b9/0x327
[<c1001114>] do_one_initcall+0x27/0x178
[<c162b357>] kernel_init+0x129/0x1c7
[<c10238b6>] kernel_thread_helper+0x6/0x10
[<ffffffff>] 0xffffffff

unreferenced object 0xf60045c0 (size 32):
comm "swapper", pid 1, jiffies 4294893885 (age 57253.572s)
hex dump (first 32 bytes):
30 00 64 4b bc a3 bc a3 80 f5 80 f5 a7 15 a7 15 0.dK............
34 07 34 07 69 4f 69 4f f4 47 f4 47 ef 27 ef 27 4.4.iOiO.G.G.'.'
backtrace:
[<c145d4eb>] kmemleak_alloc+0x27/0x4d
[<c10adb0c>] __kmalloc+0xd4/0x10d
[<c11e33c6>] kvasprintf+0x2a/0x47
[<c11db5d7>] kobject_set_name_vargs+0x17/0x52
[<c11db629>] kobject_add_varg+0x17/0x41
[<c11db6ac>] kobject_add+0x2c/0x54
[<c138ad14>] add_sysfs_fw_map_entry+0x43/0x7c
[<c164f00f>] memmap_init+0x16/0x30
[<c1001114>] do_one_initcall+0x27/0x178
[<c162b357>] kernel_init+0x129/0x1c7
[<c10238b6>] kernel_thread_helper+0x6/0x10
[<ffffffff>] 0xffffffff

The third class of leaks is relateed to drm_setversion():

unreferenced object 0xf6b10620 (size 32):
comm "X", pid 2268, jiffies 4294894722 (age 57250.228s)
hex dump (first 32 bytes):
6e 6f 75 76 65 61 75 40 70 63 69 3a 30 30 30 30 nouveau@pci:0000
3a 30 35 3a 30 30 2e 30 00 00 00 00 00 00 00 00 :05:00.0........
backtrace:
[<c145d4eb>] kmemleak_alloc+0x27/0x4d
[<c10adb0c>] __kmalloc+0xd4/0x10d
[<c125315e>] drm_setversion+0x140/0x1bf
[<c12514f2>] drm_ioctl+0x258/0x3d7
[<c10bdd42>] vfs_ioctl+0x27/0x9b
[<c10bdee2>] do_vfs_ioctl+0x66/0x54b
[<c10be3fa>] sys_ioctl+0x33/0x4f
[<c102339c>] sysenter_do_call+0x12/0x2c
[<ffffffff>] 0xffffffff

for which I wasn't able to find the allocation call-site. Maybe Zeno
has some out-of-tree DRM module?

The fourth class of leaks is related to per-CPU allocations in the block layer:

unreferenced object 0xf6681400 (size 1024):
comm "async/2", pid 1307, jiffies 4294894138 (age 57252.564s)
hex dump (first 32 bytes):
80 87 ff ff c4 ff ff ff c4 ff ff ff c4 ff ff ff ................
fc ff ff ff fc ff ff ff fc ff ff ff fc ff ff ff ................
backtrace:
[<c145d4eb>] kmemleak_alloc+0x27/0x4d
[<c10adb0c>] __kmalloc+0xd4/0x10d
[<c10ae982>] pcpu_mem_alloc+0x18/0x3a
[<c10af239>] pcpu_extend_area_map+0x1a/0xad
[<c10af578>] pcpu_alloc+0x2ac/0x82b
[<c10afb10>] __alloc_percpu+0xa/0xc
[<c11d4518>] alloc_disk_node+0x2e/0xbf
[<c11d45b6>] alloc_disk+0xd/0xf
[<c130260c>] sd_probe+0x54/0x298
[<c12d5246>] driver_probe_device+0x5b/0x148
[<c12d53ca>] __device_attach+0x2e/0x32
[<c12d49f3>] bus_for_each_drv+0x46/0x64
[<c12d5449>] device_attach+0x5c/0x60
[<c12d484d>] bus_probe_device+0x1a/0x30
[<c12d358a>] device_add+0x448/0x509
[<c12fb881>] scsi_sysfs_add_sdev+0x54/0x212

for which I didn't find anything obvious that could explain it.

I suspect most of the reports are false positives. Catalin, what do
you make out of them?

Pekka

2010-07-14 08:28:05

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Dear Pekka

On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>> I got a new Intel core-8 i7 processor.
>>
>> I am on kernel uname -a
>>
>> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>>
>> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>> are at 100% CPU usage and my machine really lags and hangs and is not
>> useable anymore. Some random process just grabs a bunch CPUs according
>> to htop.
>
> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
> likely the source of these pauses.

Shall I disable that? I will do that and try again.

>> I am attaching you the file from /sys/kernel/debug/kmemleak
>
> Zeno, can you post your dmesg and .config, please?

Sure, see attached files.

> The third class of leaks is relateed to drm_setversion():
>
> unreferenced object 0xf6b10620 (size 32):
> ?comm "X", pid 2268, jiffies 4294894722 (age 57250.228s)
> ?hex dump (first 32 bytes):
> ? ?6e 6f 75 76 65 61 75 40 70 63 69 3a 30 30 30 30 ?nouveau@pci:0000
> ? ?3a 30 35 3a 30 30 2e 30 00 00 00 00 00 00 00 00 ?:05:00.0........
> ?backtrace:
> ? ?[<c145d4eb>] kmemleak_alloc+0x27/0x4d
> ? ?[<c10adb0c>] __kmalloc+0xd4/0x10d
> ? ?[<c125315e>] drm_setversion+0x140/0x1bf
> ? ?[<c12514f2>] drm_ioctl+0x258/0x3d7
> ? ?[<c10bdd42>] vfs_ioctl+0x27/0x9b
> ? ?[<c10bdee2>] do_vfs_ioctl+0x66/0x54b
> ? ?[<c10be3fa>] sys_ioctl+0x33/0x4f
> ? ?[<c102339c>] sysenter_do_call+0x12/0x2c
> ? ?[<ffffffff>] 0xffffffff
>
> for which I wasn't able to find the allocation call-site. Maybe Zeno
> has some out-of-tree DRM module?

I am using the nouveau drivers in the kernel as I got an Nvidia Graphics card.

05:00.0 VGA compatible controller: nVidia Corporation G98 [GeForce
8400 GS] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 8321
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
I/O ports at ec00 [size=128]
[virtual] Expansion ROM at fb000000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel <?>
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information <?>
Kernel driver in use: nouveau

Best
Zeno


Attachments:
.config (62.05 kB)
dmesg (15.11 kB)
Download all attachments

2010-07-14 08:34:44

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 10:31 AM, Damien Wyart <[email protected]> wrote:

>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>> > I got a new Intel core-8 i7 processor.
>
>> > I am on kernel uname -a
>
>> > Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>> > Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>
>> > Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>> > are at 100% CPU usage and my machine really lags and hangs and is not
>> > useable anymore. Some random process just grabs a bunch CPUs according
>> > to htop.
>
> * Pekka Enberg <[email protected]> [2010-07-14 11:05]:
>> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
>> likely the source of these pauses.
>
> I am seeing the same problem with a Core i7 920 and 2.6.35-rc5, and I do
> not have CONFIG_DEBUG_KMEMLEAK enabled, so I think this is not related.
>
> I do not see anything special in the logs, just the load becoming mad
> and almost preventing ssh access. I've been seeing that since the first
> 2.6.35 rc I tested (-rc2 or -rc3, I don't remember) and I did not have
> time to report it before but I was surprised nobody else did. No problem
> with 2.6.34 and 2.6.34.1.

same with me. My last build I tested was 2.6.34-rc7. No problems
there. No CPU jumps out of nowhere.

It is like any application all of a sudden use 400% CPU i.e. htop.

Best
Zeno

2010-07-14 08:38:33

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Zeno Davatz wrote:
> On Wed, Jul 14, 2010 at 10:31 AM, Damien Wyart <[email protected]> wrote:
>
>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>>>> I got a new Intel core-8 i7 processor.
>>>> I am on kernel uname -a
>>>> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>>>> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>>>> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>>>> are at 100% CPU usage and my machine really lags and hangs and is not
>>>> useable anymore. Some random process just grabs a bunch CPUs according
>>>> to htop.
>> * Pekka Enberg <[email protected]> [2010-07-14 11:05]:
>>> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
>>> likely the source of these pauses.
>> I am seeing the same problem with a Core i7 920 and 2.6.35-rc5, and I do
>> not have CONFIG_DEBUG_KMEMLEAK enabled, so I think this is not related.
>>
>> I do not see anything special in the logs, just the load becoming mad
>> and almost preventing ssh access. I've been seeing that since the first
>> 2.6.35 rc I tested (-rc2 or -rc3, I don't remember) and I did not have
>> time to report it before but I was surprised nobody else did. No problem
>> with 2.6.34 and 2.6.34.1.
>
> same with me. My last build I tested was 2.6.34-rc7. No problems
> there. No CPU jumps out of nowhere.
>
> It is like any application all of a sudden use 400% CPU i.e. htop.

Interesting. Lets CC some scheduler folks for help.

Pekka

2010-07-14 08:40:05

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Hi,

> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
> > I got a new Intel core-8 i7 processor.

> > I am on kernel uname -a

> > Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
> > Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux

> > Sometimes in the middle of nowhere all of a sudden all of my 8-cores
> > are at 100% CPU usage and my machine really lags and hangs and is not
> > useable anymore. Some random process just grabs a bunch CPUs according
> > to htop.

* Pekka Enberg <[email protected]> [2010-07-14 11:05]:
> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
> likely the source of these pauses.

I am seeing the same problem with a Core i7 920 and 2.6.35-rc5, and I do
not have CONFIG_DEBUG_KMEMLEAK enabled, so I think this is not related.

I do not see anything special in the logs, just the load becoming mad
and almost preventing ssh access. I've been seeing that since the first
2.6.35 rc I tested (-rc2 or -rc3, I don't remember) and I did not have
time to report it before but I was surprised nobody else did. No problem
with 2.6.34 and 2.6.34.1.

--
Damien

2010-07-14 08:54:24

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 10:38 AM, Pekka Enberg <[email protected]> wrote:
> Zeno Davatz wrote:
>>
>> On Wed, Jul 14, 2010 at 10:31 AM, Damien Wyart <[email protected]>
>> wrote:
>>
>>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>>>>>
>>>>> I got a new Intel core-8 i7 processor.
>>>>> I am on kernel uname -a
>>>>> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>>>>> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>>>>> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>>>>> are at 100% CPU usage and my machine really lags and hangs and is not
>>>>> useable anymore. Some random process just grabs a bunch CPUs according
>>>>> to htop.
>>>
>>> * Pekka Enberg <[email protected]> [2010-07-14 11:05]:
>>>>
>>>> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
>>>> likely the source of these pauses.
>>>
>>> I am seeing the same problem with a Core i7 920 and 2.6.35-rc5, and I do
>>> not have CONFIG_DEBUG_KMEMLEAK enabled, so I think this is not related.
>>>
>>> I do not see anything special in the logs, just the load becoming mad
>>> and almost preventing ssh access. I've been seeing that since the first
>>> 2.6.35 rc I tested (-rc2 or -rc3, I don't remember) and I did not have
>>> time to report it before but I was surprised nobody else did. No problem
>>> with 2.6.34 and 2.6.34.1.
>>
>> same with me. My last build I tested was 2.6.34-rc7. No problems
>> there. No CPU jumps out of nowhere.
>>
>> It is like any application all of a sudden use 400% CPU i.e. htop.
>
> Interesting. Lets CC some scheduler folks for help.

Once it is gdm, once it is firefox-bin once it is htop. Its really od.
Nothing crashes one of those just uses lots of CPU. The machine just
gets really slow and then calms down again and everything is back to
normal.

;)

Maybe a bad temper?

Best
Zeno

2010-07-14 08:57:47

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 11:54 AM, Zeno Davatz <[email protected]> wrote:
> On Wed, Jul 14, 2010 at 10:38 AM, Pekka Enberg <[email protected]> wrote:
>> Zeno Davatz wrote:
>>>
>>> On Wed, Jul 14, 2010 at 10:31 AM, Damien Wyart <[email protected]>
>>> wrote:
>>>
>>>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>>>>>>
>>>>>> I got a new Intel core-8 i7 processor.
>>>>>> I am on kernel uname -a
>>>>>> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>>>>>> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>>>>>> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>>>>>> are at 100% CPU usage and my machine really lags and hangs and is not
>>>>>> useable anymore. Some random process just grabs a bunch CPUs according
>>>>>> to htop.
>>>>
>>>> * Pekka Enberg <[email protected]> [2010-07-14 11:05]:
>>>>>
>>>>> Why did you enable CONFIG_DEBUG_KMEMLEAK? Memory leak scanning is
>>>>> likely the source of these pauses.
>>>>
>>>> I am seeing the same problem with a Core i7 920 and 2.6.35-rc5, and I do
>>>> not have CONFIG_DEBUG_KMEMLEAK enabled, so I think this is not related.
>>>>
>>>> I do not see anything special in the logs, just the load becoming mad
>>>> and almost preventing ssh access. I've been seeing that since the first
>>>> 2.6.35 rc I tested (-rc2 or -rc3, I don't remember) and I did not have
>>>> time to report it before but I was surprised nobody else did. No problem
>>>> with 2.6.34 and 2.6.34.1.
>>>
>>> same with me. My last build I tested was 2.6.34-rc7. No problems
>>> there. No CPU jumps out of nowhere.
>>>
>>> It is like any application all of a sudden use 400% CPU i.e. htop.
>>
>> Interesting. Lets CC some scheduler folks for help.
>
> Once it is gdm, once it is firefox-bin once it is htop. Its really od.
> Nothing crashes one of those just uses lots of CPU. The machine just
> gets really slow and then calms down again and everything is back to
> normal.

That's the part that makes me think it's scheduler and/or cpufreq related.

> ;)
>
> Maybe a bad temper?

Yeah, maybe Tux is having a bad day. :-)

2010-07-14 09:51:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
> > On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:

> >> I am attaching you the file from /sys/kernel/debug/kmemleak
> >
> > Zeno, can you post your dmesg and .config, please?
>
> Sure, see attached files.

It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
need to post it again on the list).


kmemleak: Add support for NO_BOOTMEM configurations

From: Catalin Marinas <[email protected]>

With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.

Signed-off-by: Catalin Marinas <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: H. Peter Anvin <[email protected]>
---
mm/bootmem.c | 2 ++
mm/page_alloc.c | 1 +
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 58c66cc..0747f68 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -435,6 +435,7 @@ void __init free_bootmem_node(pg_data_t *pgdat, unsigned long physaddr,
{
#ifdef CONFIG_NO_BOOTMEM
free_early(physaddr, physaddr + size);
+ kmemleak_free_part(__va(physaddr), size);
#else
unsigned long start, end;

@@ -460,6 +461,7 @@ void __init free_bootmem(unsigned long addr, unsigned long size)
{
#ifdef CONFIG_NO_BOOTMEM
free_early(addr, addr + size);
+ kmemleak_free_part(__va(addr), size);
#else
unsigned long start, end;

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..f29f00b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3659,6 +3659,7 @@ void * __init __alloc_memory_core_early(int nid, u64 size, u64 align,
ptr = phys_to_virt(addr);
memset(ptr, 0, size);
reserve_early_without_check(addr, addr + size, "BOOTMEM");
+ kmemleak_alloc(ptr, size, 1, 0);
return ptr;
}



--
Catalin

2010-07-14 09:56:07

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Catalin Marinas wrote:
> On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
>> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>
>>>> I am attaching you the file from /sys/kernel/debug/kmemleak
>>> Zeno, can you post your dmesg and .config, please?
>> Sure, see attached files.
>
> It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
> need to post it again on the list).
>
>
> kmemleak: Add support for NO_BOOTMEM configurations
>
> From: Catalin Marinas <[email protected]>
>
> With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
> friends use the early_res functions for memory management when
> NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
> corresponding code paths for bootmem allocations.
>
> Signed-off-by: Catalin Marinas <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: H. Peter Anvin <[email protected]>

Makes sense.

Acked-by: Pekka Enberg <[email protected]>

Zeno, this should fix the kmemleak false positives but not the big
pauses you're seeing.

2010-07-14 09:57:42

by Catalin Marinas

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Zeno Davatz <[email protected]> wrote:
> I got a new Intel core-8 i7 processor.
>
> I am on kernel uname -a
>
> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>
> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
> are at 100% CPU usage and my machine really lags and hangs and is not
> useable anymore. Some random process just grabs a bunch CPUs according
> to htop.
>
> dmesg tell me that
>
> kmemleak: 38 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

These may be related to the NO_BOOTMEM configuration (I sent a patch in
a separate reply).

But even when kmemleak scans the memory, it only uses a single thread
and you should only see a single CPU going to 100%. I don't think
kmemleak scanning can explain why all the 8 cores are going up to 100%.

--
Catalin

2010-07-14 10:00:07

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 11:55 AM, Pekka Enberg <[email protected]> wrote:
> Catalin Marinas wrote:
>>
>> On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
>>>
>>> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]>
>>> wrote:
>>>>
>>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>>
>>>>> I am attaching you the file from /sys/kernel/debug/kmemleak
>>>>
>>>> Zeno, can you post your dmesg and .config, please?
>>>
>>> Sure, see attached files.
>>
>> It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
>> need to post it again on the list).
>>
>>
>> kmemleak: Add support for NO_BOOTMEM configurations
>>
>> From: Catalin Marinas <[email protected]>
>>
>> With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
>> friends use the early_res functions for memory management when
>> NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
>> corresponding code paths for bootmem allocations.
>>
>> Signed-off-by: Catalin Marinas <[email protected]>
>> Cc: Yinghai Lu <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>
> Makes sense.
>
> Acked-by: Pekka Enberg <[email protected]>
>
> Zeno, this should fix the kmemleak false positives but not the big pauses
> you're seeing.

Thank for this detailed info Pekka! I will not apply the patch at the
moment. Will it be in with the next RC from Linus? Or do you recommend
I apply it?

What I want it is to tame the temper of Tux and restrict him from
eating my CPU-donuts (cores) at random. I need them for other
processes. ;)

Best
Zeno

2010-07-14 10:04:13

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Dear Catalin

On Wed, Jul 14, 2010 at 11:57 AM, Catalin Marinas
<[email protected]> wrote:
> Zeno Davatz <[email protected]> wrote:
>> I got a new Intel core-8 i7 processor.
>>
>> I am on kernel uname -a
>>
>> Linux zenogentoo 2.6.35-rc5 #97 SMP Tue Jul 13 16:13:25 CEST 2010 i686
>> Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux
>>
>> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>> are at 100% CPU usage and my machine really lags and hangs and is not
>> useable anymore. Some random process just grabs a bunch CPUs according
>> to htop.
>>
>> dmesg tell me that
>>
>> kmemleak: 38 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>> kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>> kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>
> These may be related to the NO_BOOTMEM configuration (I sent a patch in
> a separate reply).
>
> But even when kmemleak scans the memory, it only uses a single thread
> and you should only see a single CPU going to 100%. I don't think
> kmemleak scanning can explain why all the 8 cores are going up to 100%.

I am doing:

/usr/src/my2.6> sudo patch -p1 < patch_catalin
patching file mm/bootmem.c
Hunk #1 FAILED at 435.
Hunk #2 FAILED at 461.
2 out of 2 hunks FAILED -- saving rejects to file mm/bootmem.c.rej
patching file mm/page_alloc.c
Hunk #1 FAILED at 3659.
1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej

Any hints why it wont apply? Will this patch be in the next RC?

Best
Zeno

2010-07-14 11:54:11

by Catalin Marinas

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, 2010-07-14 at 11:04 +0100, Zeno Davatz wrote:
> On Wed, Jul 14, 2010 at 11:57 AM, Catalin Marinas
> <[email protected]> wrote:
> > Zeno Davatz <[email protected]> wrote:
> >> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
> >> are at 100% CPU usage and my machine really lags and hangs and is not
> >> useable anymore. Some random process just grabs a bunch CPUs according
> >> to htop.
> >
> > These may be related to the NO_BOOTMEM configuration (I sent a patch in
> > a separate reply).
> >
> > But even when kmemleak scans the memory, it only uses a single thread
> > and you should only see a single CPU going to 100%. I don't think
> > kmemleak scanning can explain why all the 8 cores are going up to 100%.
>
> I am doing:
>
> /usr/src/my2.6> sudo patch -p1 < patch_catalin
> patching file mm/bootmem.c
> Hunk #1 FAILED at 435.
> Hunk #2 FAILED at 461.
> 2 out of 2 hunks FAILED -- saving rejects to file mm/bootmem.c.rej
> patching file mm/page_alloc.c
> Hunk #1 FAILED at 3659.
> 1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej
>
> Any hints why it wont apply? Will this patch be in the next RC?

The patch is against 2.6.35-rc4. I'll send it to Linus and hopefully it
will get merged during rc.

BTW, you can disable kmemleak scanning by doing:

# echo scan=off > /sys/kernel/debug/kmemleak

Do you still get that high CPU usage?

--
Catalin

2010-07-14 11:59:07

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, Jul 14, 2010 at 1:54 PM, Catalin Marinas
<[email protected]> wrote:
> On Wed, 2010-07-14 at 11:04 +0100, Zeno Davatz wrote:
>> On Wed, Jul 14, 2010 at 11:57 AM, Catalin Marinas
>> <[email protected]> wrote:
>> > Zeno Davatz <[email protected]> wrote:
>> >> Sometimes in the middle of nowhere all of a sudden all of my 8-cores
>> >> are at 100% CPU usage and my machine really lags and hangs and is not
>> >> useable anymore. Some random process just grabs a bunch CPUs according
>> >> to htop.
>> >
>> > These may be related to the NO_BOOTMEM configuration (I sent a patch in
>> > a separate reply).
>> >
>> > But even when kmemleak scans the memory, it only uses a single thread
>> > and you should only see a single CPU going to 100%. I don't think
>> > kmemleak scanning can explain why all the 8 cores are going up to 100%.
>>
>> I am doing:
>>
>> /usr/src/my2.6> sudo patch -p1 < patch_catalin
>> patching file mm/bootmem.c
>> Hunk #1 FAILED at 435.
>> Hunk #2 FAILED at 461.
>> 2 out of 2 hunks FAILED -- saving rejects to file mm/bootmem.c.rej
>> patching file mm/page_alloc.c
>> Hunk #1 FAILED at 3659.
>> 1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej
>>
>> Any hints why it wont apply? Will this patch be in the next RC?
>
> The patch is against 2.6.35-rc4. I'll send it to Linus and hopefully it
> will get merged during rc.
>
> BTW, you can disable kmemleak scanning by doing:
>
> # echo scan=off > /sys/kernel/debug/kmemleak

Thank you for the hint!

> Do you still get that high CPU usage?

Not at the moment. I disabled

CONFIG_NO_BOOTMEM:

and rebooted onto the new bzImage. No "CPU-bad-mood" at the moment.

Best
Zeno

2010-07-15 15:02:12

by Catalin Marinas

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Wed, 2010-07-14 at 10:55 +0100, Pekka Enberg wrote:
> Catalin Marinas wrote:
> > On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
> >> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
> >>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
> >
> >>>> I am attaching you the file from /sys/kernel/debug/kmemleak
> >>> Zeno, can you post your dmesg and .config, please?
> >> Sure, see attached files.
> >
> > It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
> > need to post it again on the list).
> >
> >
> > kmemleak: Add support for NO_BOOTMEM configurations
> >
> > From: Catalin Marinas <[email protected]>
> >
> > With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
> > friends use the early_res functions for memory management when
> > NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
> > corresponding code paths for bootmem allocations.
> >
> > Signed-off-by: Catalin Marinas <[email protected]>
> > Cc: Yinghai Lu <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
>
> Makes sense.
>
> Acked-by: Pekka Enberg <[email protected]>

I'll post an updated patch since I missed a callback. I've been testing
it since yesterday and seems ok.

Thanks.

--
Catalin

2010-07-15 15:15:43

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Dear Catalin

On Thu, Jul 15, 2010 at 4:58 PM, Catalin Marinas
<[email protected]> wrote:
> On Wed, 2010-07-14 at 10:55 +0100, Pekka Enberg wrote:
>> Catalin Marinas wrote:
>> > On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
>> >> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
>> >>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>> >
>> >>>> I am attaching you the file from /sys/kernel/debug/kmemleak
>> >>> Zeno, can you post your dmesg and .config, please?
>> >> Sure, see attached files.
>> >
>> > It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
>> > need to post it again on the list).
>> >
>> >
>> > kmemleak: Add support for NO_BOOTMEM configurations
>> >
>> > From: Catalin Marinas <[email protected]>
>> >
>> > With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
>> > friends use the early_res functions for memory management when
>> > NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
>> > corresponding code paths for bootmem allocations.
>> >
>> > Signed-off-by: Catalin Marinas <[email protected]>
>> > Cc: Yinghai Lu <[email protected]>
>> > Cc: H. Peter Anvin <[email protected]>
>>
>> Makes sense.
>>
>> Acked-by: Pekka Enberg <[email protected]>
>
> I'll post an updated patch since I missed a callback. I've been testing
> it since yesterday and seems ok.

I also did not have anymore hangs and random bad moods of my CPUs that
all of a sudden grab 100% of all 8 cores of my CPU power across my
machine since I disabled

CONFIG_NO_BOOTMEM:

Best
Zeno

2010-07-15 15:55:12

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Zeno Davatz wrote:
> Dear Catalin
>
> On Thu, Jul 15, 2010 at 4:58 PM, Catalin Marinas
> <[email protected]> wrote:
>> On Wed, 2010-07-14 at 10:55 +0100, Pekka Enberg wrote:
>>> Catalin Marinas wrote:
>>>> On Wed, 2010-07-14 at 09:27 +0100, Zeno Davatz wrote:
>>>>> On Wed, Jul 14, 2010 at 10:05 AM, Pekka Enberg <[email protected]> wrote:
>>>>>> On Wed, Jul 14, 2010 at 9:12 AM, Zeno Davatz <[email protected]> wrote:
>>>>>>> I am attaching you the file from /sys/kernel/debug/kmemleak
>>>>>> Zeno, can you post your dmesg and .config, please?
>>>>> Sure, see attached files.
>>>> It looks like NO_BOOTMEM is enabled. You can try the attached patch (I
>>>> need to post it again on the list).
>>>>
>>>>
>>>> kmemleak: Add support for NO_BOOTMEM configurations
>>>>
>>>> From: Catalin Marinas <[email protected]>
>>>>
>>>> With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
>>>> friends use the early_res functions for memory management when
>>>> NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
>>>> corresponding code paths for bootmem allocations.
>>>>
>>>> Signed-off-by: Catalin Marinas <[email protected]>
>>>> Cc: Yinghai Lu <[email protected]>
>>>> Cc: H. Peter Anvin <[email protected]>
>>> Makes sense.
>>>
>>> Acked-by: Pekka Enberg <[email protected]>
>> I'll post an updated patch since I missed a callback. I've been testing
>> it since yesterday and seems ok.
>
> I also did not have anymore hangs and random bad moods of my CPUs that
> all of a sudden grab 100% of all 8 cores of my CPU power across my
> machine since I disabled
>
> CONFIG_NO_BOOTMEM:

Interesting. Damien, does disabling CONFIG_NO_BOOTMEM fix you problem too?

2010-07-15 16:28:11

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Hello,

> > I also did not have anymore hangs and random bad moods of my CPUs
> > that all of a sudden grab 100% of all 8 cores of my CPU power across
> > my machine since I disabled
> > CONFIG_NO_BOOTMEM:

* Pekka Enberg <[email protected]> [2010-07-15 18:54]:
> Interesting. Damien, does disabling CONFIG_NO_BOOTMEM fix you problem too?

I will test in the coming hours, and report back tomorrow... Just
recompiled 2.6.35-rc5-git1 with this option disabled.

--
Damien

2010-07-15 19:17:09

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

> > > I also did not have anymore hangs and random bad moods of my CPUs
> > > that all of a sudden grab 100% of all 8 cores of my CPU power across
> > > my machine since I disabled
> > > CONFIG_NO_BOOTMEM:

> * Pekka Enberg <[email protected]> [2010-07-15 18:54]:
> > Interesting. Damien, does disabling CONFIG_NO_BOOTMEM fix you problem too?

> I will test in the coming hours, and report back tomorrow... Just
> recompiled 2.6.35-rc5-git1 with this option disabled.

For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
with the option and rc5 the problem was happening quite quickly after
boot and normal use of the machine. So it seems I can confirme what Zeno
has seen and I hope this will give a hint to debug the problem. I guess
this has not been reported that much because many testers might not have
enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
benchmark with a kernel having this option enabled?

--
Damien

2010-07-15 19:50:09

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 10:16 PM, Damien Wyart <[email protected]> wrote:
>> > > I also did not have anymore hangs and random bad moods of my CPUs
>> > > that all of a sudden grab 100% of all 8 cores of my CPU power across
>> > > my machine since I disabled
>> > > CONFIG_NO_BOOTMEM:
>
>> * Pekka Enberg <[email protected]> [2010-07-15 18:54]:
>> > Interesting. Damien, does disabling CONFIG_NO_BOOTMEM fix you problem too?
>
>> I will test in the coming hours, and report back tomorrow... Just
>> recompiled 2.6.35-rc5-git1 with this option disabled.
>
> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
> with the option and rc5 the problem was happening quite quickly after
> boot and normal use of the machine. So it seems I can confirme what Zeno
> has seen and I hope this will give a hint to debug the problem. I guess
> this has not been reported that much because many testers might not have
> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
> benchmark with a kernel having this option enabled?

To be honest, the bug is bit odd. It's related to boot-time memory
allocator changes but yet it seems to manifest itself as a scheduling
problem. So if you have some spare time and want to speed up the
debugging process, please test v2.6.34 and v2.6.35-rc1 with
CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
if you can identify the offending commit with "git bisect."

Pekka

2010-07-15 20:00:17

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

> > For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
> > with the option and rc5 the problem was happening quite quickly after
> > boot and normal use of the machine. So it seems I can confirme what Zeno
> > has seen and I hope this will give a hint to debug the problem. I guess
> > this has not been reported that much because many testers might not have
> > enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
> > benchmark with a kernel having this option enabled?

* Pekka Enberg <[email protected]> [2010-07-15 22:50]:
> To be honest, the bug is bit odd. It's related to boot-time memory
> allocator changes but yet it seems to manifest itself as a scheduling
> problem. So if you have some spare time and want to speed up the
> debugging process, please test v2.6.34 and v2.6.35-rc1 with
> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
> if you can identify the offending commit with "git bisect."

Not sure I will have enough time in the coming days (doing that remotely
is fishy since ssh access is almost stuck when the problem occurs); if
Zeno can and would like to do it, maybe this could be done faster.

As the scheduler is now very well instrumented (many debugging features
are available), reproducing the bug on a test platform (it happens quite
quickly for me) might also give some hints. So testers, if you have
time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
if you can reproduce the problem!

--
Damien

2010-07-15 20:38:45

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:

>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>> with the option and rc5 the problem was happening quite quickly after
>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>> has seen and I hope this will give a hint to debug the problem. I guess
>>> this has not been reported that much because many testers might not have
>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>> benchmark with a kernel having this option enabled?
>
> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>> To be honest, the bug is bit odd. It's related to boot-time memory
>> allocator changes but yet it seems to manifest itself as a scheduling
>> problem. So if you have some spare time and want to speed up the
>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>> if you can identify the offending commit with "git bisect."
>
> Not sure I will have enough time in the coming days (doing that remotely
> is fishy since ssh access is almost stuck when the problem occurs); if
> Zeno can and would like to do it, maybe this could be done faster.
>
> As the scheduler is now very well instrumented (many debugging features
> are available), reproducing the bug on a test platform (it happens quite
> quickly for me) might also give some hints. So testers, if you have
> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
> if you can reproduce the problem!

Will try to do so. Can you point me to the git bisect howto with the versions you want.

Best
Zeno-

2010-07-15 20:50:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 11:38 PM, Zeno Davatz <[email protected]> wrote:
> Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:
>
>>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>>> with the option and rc5 the problem was happening quite quickly after
>>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>>> has seen and I hope this will give a hint to debug the problem. I guess
>>>> this has not been reported that much because many testers might not have
>>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>>> benchmark with a kernel having this option enabled?
>>
>> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>>> To be honest, the bug is bit odd. It's related to boot-time memory
>>> allocator changes but yet it seems to manifest itself as a scheduling
>>> problem. So if you have some spare time and want to speed up the
>>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>>> if you can identify the offending commit with "git bisect."
>>
>> Not sure I will have enough time in the coming days (doing that remotely
>> is fishy since ssh access is almost stuck when the problem occurs); if
>> Zeno can and would like to do it, maybe this could be done faster.
>>
>> As the scheduler is now very well instrumented (many debugging features
>> are available), reproducing the bug on a test platform (it happens quite
>> quickly for me) might also give some hints. So testers, if you have
>> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>> if you can reproduce the problem!
>
> Will try to do so. Can you point me to the git bisect howto with the versions you want.

Cool. So like I said, you first want to test 2.6.34 to find a known
good version. Please remember to make sure you have CONFIG_NO_BOOTMEM
enabled. You can also try to speed up the process by testing
2.6.35-rc1 which is likely to include the offending commit. That's not
strictly necessary as long as you are sure that you have some
2.6.35-rc kernel that's bad.

After that, bisecting is as simple as:

git bisect start
git bisect good v2.6.34
git bisect bad v2.6.31-rc1 # or some other kernel you know to be bad
<compile, boot, and try to trigger the problem>

then

git bisect bad # if you were able to trigger the problem

or

git bisect good # if the problem doesn't exist

git will then find the next revision to test after which you do

<compile, boot, and try to trigger the problem>

and repeat the "git bisect good/bad" step until git tells you it has
found the offending commit.

There's more information on the git bisect man pages:

http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html

Let me know if you need more help with this.

Pekka

2010-07-15 20:52:31

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 11:00 PM, Damien Wyart <[email protected]> wrote:
>> > For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>> > with the option and rc5 the problem was happening quite quickly after
>> > boot and normal use of the machine. So it seems I can confirme what Zeno
>> > has seen and I hope this will give a hint to debug the problem. I guess
>> > this has not been reported that much because many testers might not have
>> > enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>> > benchmark with a kernel having this option enabled?
>
> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>> To be honest, the bug is bit odd. It's related to boot-time memory
>> allocator changes but yet it seems to manifest itself as a scheduling
>> problem. So if you have some spare time and want to speed up the
>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>> if you can identify the offending commit with "git bisect."
>
> Not sure I will have enough time in the coming days (doing that remotely
> is fishy since ssh access is almost stuck when the problem occurs); if
> Zeno can and would like to do it, maybe this could be done faster.
>
> As the scheduler is now very well instrumented (many debugging features
> are available), reproducing the bug on a test platform (it happens quite
> quickly for me) might also give some hints. So testers, if you have
> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
> if you can reproduce the problem!

Yeah, there's "perf sched" tool available for that:

http://lwn.net/Articles/353295/

The only problem is that we'd need a scheduler hacker to decipher the
report and all of them seem to be missing at the moment (probably at
OLS). Anyway, like I said, git bisect will probably speed up the
debugging process, that's all.

Pekka

2010-07-15 20:57:21

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Am 15.07.2010 um 22:50 schrieb Pekka Enberg <[email protected]>:

> On Thu, Jul 15, 2010 at 11:38 PM, Zeno Davatz <[email protected]> wrote:
>> Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:
>>
>>>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>>>> with the option and rc5 the problem was happening quite quickly after
>>>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>>>> has seen and I hope this will give a hint to debug the problem. I guess
>>>>> this has not been reported that much because many testers might not have
>>>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>>>> benchmark with a kernel having this option enabled?
>>>
>>> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>>>> To be honest, the bug is bit odd. It's related to boot-time memory
>>>> allocator changes but yet it seems to manifest itself as a scheduling
>>>> problem. So if you have some spare time and want to speed up the
>>>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>>>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>>>> if you can identify the offending commit with "git bisect."
>>>
>>> Not sure I will have enough time in the coming days (doing that remotely
>>> is fishy since ssh access is almost stuck when the problem occurs); if
>>> Zeno can and would like to do it, maybe this could be done faster.
>>>
>>> As the scheduler is now very well instrumented (many debugging features
>>> are available), reproducing the bug on a test platform (it happens quite
>>> quickly for me) might also give some hints. So testers, if you have
>>> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>>> if you can reproduce the problem!
>>
>> Will try to do so. Can you point me to the git bisect howto with the versions you want.
>
> Cool. So like I said, you first want to test 2.6.34 to find a known
> good version. Please remember to make sure you have CONFIG_NO_BOOTMEM
> enabled. You can also try to speed up the process by testing
> 2.6.35-rc1 which is likely to include the offending commit. That's not
> strictly necessary as long as you are sure that you have some
> 2.6.35-rc kernel that's bad.
>
> After that, bisecting is as simple as:
>
> git bisect start
> git bisect good v2.6.34
> git bisect bad v2.6.31-rc1 # or some other kernel you know to be bad
> <compile, boot, and try to trigger the problem>
>
> then
>
> git bisect bad # if you were able to trigger the problem
>
> or
>
> git bisect good # if the problem doesn't exist
>
> git will then find the next revision to test after which you do
>
> <compile, boot, and try to trigger the problem>
>
> and repeat the "git bisect good/bad" step until git tells you it has
> found the offending commit.
>
> There's more information on the git bisect man pages:
>
> http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
>
> Let me know if you need more help with this.

Ok, thanks for the guidance, will start some time tomorrow. Hope to make it in the morning.

Best
Zeno-

2010-07-16 07:12:51

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 10:50 PM, Pekka Enberg <[email protected]> wrote:
> On Thu, Jul 15, 2010 at 11:38 PM, Zeno Davatz <[email protected]> wrote:
>> Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:
>>
>>>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>>>> with the option and rc5 the problem was happening quite quickly after
>>>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>>>> has seen and I hope this will give a hint to debug the problem. I guess
>>>>> this has not been reported that much because many testers might not have
>>>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>>>> benchmark with a kernel having this option enabled?
>>>
>>> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>>>> To be honest, the bug is bit odd. It's related to boot-time memory
>>>> allocator changes but yet it seems to manifest itself as a scheduling
>>>> problem. So if you have some spare time and want to speed up the
>>>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>>>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>>>> if you can identify the offending commit with "git bisect."
>>>
>>> Not sure I will have enough time in the coming days (doing that remotely
>>> is fishy since ssh access is almost stuck when the problem occurs); if
>>> Zeno can and would like to do it, maybe this could be done faster.
>>>
>>> As the scheduler is now very well instrumented (many debugging features
>>> are available), reproducing the bug on a test platform (it happens quite
>>> quickly for me) might also give some hints. So testers, if you have
>>> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>>> if you can reproduce the problem!
>>
>> Will try to do so. Can you point me to the git bisect howto with the versions you want.
>
> Cool. So like I said, you first want to test 2.6.34 to find a known
> good version. Please remember to make sure you have CONFIG_NO_BOOTMEM
> enabled. You can also try to speed up the process by testing
> 2.6.35-rc1 which is likely to include the offending commit. That's not
> strictly necessary as long as you are sure that you have some
> 2.6.35-rc kernel that's bad.
>
> After that, bisecting is as simple as:
>
> ?git bisect start
> ?git bisect good v2.6.34
> ?git bisect bad v2.6.31-rc1 # or some other kernel you know to be bad
> ?<compile, boot, and try to trigger the problem>
>
> then
>
> ?git bisect bad # if you were able to trigger the problem
>
> or
>
> ?git bisect good # if the problem doesn't exist
>
> git will then find the next revision to test after which you do
>
> ?<compile, boot, and try to trigger the problem>
>
> and repeat the "git bisect good/bad" step until git tells you it has
> found the offending commit.
>
> There's more information on the git bisect man pages:
>
> http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
>
> Let me know if you need more help with this.

Ok, something sure is wrong with 2.6.34-rc8 I could not boot after I
done the normal bit bisect, cp bzImage and then running lilo -v

http://www.flickr.com/photos/zrr/4798077725/

I am gonna continue bisecting. 2.6.34-rc7 is fine. No CPU eaters around.

Best
Zeno

2010-07-16 07:29:45

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 10:50 PM, Pekka Enberg <[email protected]> wrote:
> On Thu, Jul 15, 2010 at 11:38 PM, Zeno Davatz <[email protected]> wrote:
>> Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:
>>
>>>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>>>> with the option and rc5 the problem was happening quite quickly after
>>>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>>>> has seen and I hope this will give a hint to debug the problem. I guess
>>>>> this has not been reported that much because many testers might not have
>>>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>>>> benchmark with a kernel having this option enabled?
>>>
>>> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>>>> To be honest, the bug is bit odd. It's related to boot-time memory
>>>> allocator changes but yet it seems to manifest itself as a scheduling
>>>> problem. So if you have some spare time and want to speed up the
>>>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>>>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>>>> if you can identify the offending commit with "git bisect."
>>>
>>> Not sure I will have enough time in the coming days (doing that remotely
>>> is fishy since ssh access is almost stuck when the problem occurs); if
>>> Zeno can and would like to do it, maybe this could be done faster.
>>>
>>> As the scheduler is now very well instrumented (many debugging features
>>> are available), reproducing the bug on a test platform (it happens quite
>>> quickly for me) might also give some hints. So testers, if you have
>>> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>>> if you can reproduce the problem!
>>
>> Will try to do so. Can you point me to the git bisect howto with the versions you want.
>
> Cool. So like I said, you first want to test 2.6.34 to find a known
> good version. Please remember to make sure you have CONFIG_NO_BOOTMEM
> enabled. You can also try to speed up the process by testing
> 2.6.35-rc1 which is likely to include the offending commit. That's not
> strictly necessary as long as you are sure that you have some
> 2.6.35-rc kernel that's bad.
>
> After that, bisecting is as simple as:
>
> ?git bisect start
> ?git bisect good v2.6.34
> ?git bisect bad v2.6.31-rc1 # or some other kernel you know to be bad
> ?<compile, boot, and try to trigger the problem>
>
> then
>
> ?git bisect bad # if you were able to trigger the problem
>
> or
>
> ?git bisect good # if the problem doesn't exist
>
> git will then find the next revision to test after which you do
>
> ?<compile, boot, and try to trigger the problem>
>
> and repeat the "git bisect good/bad" step until git tells you it has
> found the offending commit.
>
> There's more information on the git bisect man pages:
>
> http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
>
> Let me know if you need more help with this.

This one also causes a panic:

http://www.flickr.com/photos/zrr/4798092747/in/photostream/

but this version boots just fine again:

Linux zenogentoo 2.6.34-05459-gac3ee84 #102 SMP Fri Jul 16 09:22:25
CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel
GNU/Linux

Best
Zeno

2010-07-16 07:37:53

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, Jul 15, 2010 at 10:50 PM, Pekka Enberg <[email protected]> wrote:
> On Thu, Jul 15, 2010 at 11:38 PM, Zeno Davatz <[email protected]> wrote:
>> Am 15.07.2010 um 22:00 schrieb Damien Wyart <[email protected]>:
>>
>>>>> For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>>>>> with the option and rc5 the problem was happening quite quickly after
>>>>> boot and normal use of the machine. So it seems I can confirme what Zeno
>>>>> has seen and I hope this will give a hint to debug the problem. I guess
>>>>> this has not been reported that much because many testers might not have
>>>>> enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>>>>> benchmark with a kernel having this option enabled?
>>>
>>> * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>>>> To be honest, the bug is bit odd. It's related to boot-time memory
>>>> allocator changes but yet it seems to manifest itself as a scheduling
>>>> problem. So if you have some spare time and want to speed up the
>>>> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>>>> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>>>> if you can identify the offending commit with "git bisect."
>>>
>>> Not sure I will have enough time in the coming days (doing that remotely
>>> is fishy since ssh access is almost stuck when the problem occurs); if
>>> Zeno can and would like to do it, maybe this could be done faster.
>>>
>>> As the scheduler is now very well instrumented (many debugging features
>>> are available), reproducing the bug on a test platform (it happens quite
>>> quickly for me) might also give some hints. So testers, if you have
>>> time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>>> if you can reproduce the problem!
>>
>> Will try to do so. Can you point me to the git bisect howto with the versions you want.
>
> Cool. So like I said, you first want to test 2.6.34 to find a known
> good version. Please remember to make sure you have CONFIG_NO_BOOTMEM
> enabled. You can also try to speed up the process by testing
> 2.6.35-rc1 which is likely to include the offending commit. That's not
> strictly necessary as long as you are sure that you have some
> 2.6.35-rc kernel that's bad.
>
> After that, bisecting is as simple as:
>
> ?git bisect start
> ?git bisect good v2.6.34
> ?git bisect bad v2.6.31-rc1 # or some other kernel you know to be bad
> ?<compile, boot, and try to trigger the problem>
>
> then
>
> ?git bisect bad # if you were able to trigger the problem
>
> or
>
> ?git bisect good # if the problem doesn't exist
>
> git will then find the next revision to test after which you do
>
> ?<compile, boot, and try to trigger the problem>
>
> and repeat the "git bisect good/bad" step until git tells you it has
> found the offending commit.
>
> There's more information on the git bisect man pages:
>
> http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
>
> Let me know if you need more help with this.

The next RC again hangs on me:

http://www.flickr.com/photos/zrr/4798744700/sizes/l/

Gruss
Zeno

2010-07-16 07:50:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, Jul 16, 2010 at 10:37 AM, Zeno Davatz <[email protected]> wrote:
>> Let me know if you need more help with this.
>
> The next RC again hangs on me:
>
> http://www.flickr.com/photos/zrr/4798744700/sizes/l/

Doesn't look like a kernel bug to me. Maybe some Gentoo person knows
better but the 'root' parameter you pass to the kernel in your lilo
configuration looks a little strange. You should try passing
"/dev/sdXX" to it where XX is whatever partition your root filesystems
is on.

2010-07-16 09:17:59

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, Jul 16, 2010 at 9:50 AM, Pekka Enberg <[email protected]> wrote:
> On Fri, Jul 16, 2010 at 10:37 AM, Zeno Davatz <[email protected]> wrote:
>>> Let me know if you need more help with this.
>>
>> The next RC again hangs on me:
>>
>> http://www.flickr.com/photos/zrr/4798744700/sizes/l/
>
> Doesn't look like a kernel bug to me. Maybe some Gentoo person knows
> better but the 'root' parameter you pass to the kernel in your lilo
> configuration looks a little strange. You should try passing
> "/dev/sdXX" to it where XX is whatever partition your root filesystems
> is on.

This version has some problem with the DRM but no CPU eater yet.

http://www.flickr.com/photos/zrr/4798885756/

This version boots again just fine:

Linux zenogentoo 2.6.34-rc5-00059-gc2b4127 #105 SMP Fri Jul 16
11:13:21 CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
GenuineIntel GNU/Linux

As I understand I am bisecting upwards. Every time it does not boot
correctly I do

git bisect bad after the next boot.

Best
Zeno

2010-07-16 09:32:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Hi Zeno,

Zeno Davatz wrote:
> This version has some problem with the DRM but no CPU eater yet.
>
> http://www.flickr.com/photos/zrr/4798885756/
>
> This version boots again just fine:
>
> Linux zenogentoo 2.6.34-rc5-00059-gc2b4127 #105 SMP Fri Jul 16
> 11:13:21 CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
> GenuineIntel GNU/Linux

You're going into the wrong direction. If 2.6.34-rc7 works just fine,
you shouldn't be testing 2.6.34-rc5.

>
> As I understand I am bisecting upwards. Every time it does not boot
> correctly I do
>
> git bisect bad after the next boot.

No, you should only do "git bisect bad" if you find a CPU eater and "git
bisect good" if you don't. For the non-booting kernels you should do
"git bisect skip"; otherwise git gets confused as we can see here.

Did you test v2.6.35-rc1? Does it have the CPU eater problem? If yes,
please just reset your bisection

git bisect reset
git bisect start
git bisect good v2.6.34-rc7
git bisect bad v2.6.35-rc1

and use 'git bisect skip' for kernels that don't boot or build.

Pekka

2010-07-16 09:42:41

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, Jul 16, 2010 at 11:32 AM, Pekka Enberg <[email protected]> wrote:

> No, you should only do "git bisect bad" if you find a CPU eater and "git
> bisect good" if you don't. For the non-booting kernels you should do "git
> bisect skip"; otherwise git gets confused as we can see here.
>
> Did you test v2.6.35-rc1? Does it have the CPU eater problem? If yes, please
> just reset your bisection
>
> ?git bisect reset
> ?git bisect start
> ?git bisect good v2.6.34-rc7
> ?git bisect bad v2.6.35-rc1
>
> and use 'git bisect skip' for kernels that don't boot or build.

Ok I done above and reboot onto the new bzImage:

This version looks fine.

Linux zenogentoo 2.6.34-04401-gf896546 #106 SMP Fri Jul 16 11:37:04
CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel
GNU/Linux

Best
Zeno

2010-07-16 09:47:42

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, Jul 16, 2010 at 11:32 AM, Pekka Enberg <[email protected]> wrote:

> Zeno Davatz wrote:
>>
>> This version has some problem with the DRM but no CPU eater yet.
>>
>> http://www.flickr.com/photos/zrr/4798885756/
>>
>> This version boots again just fine:
>>
>> Linux zenogentoo 2.6.34-rc5-00059-gc2b4127 #105 SMP Fri Jul 16
>> 11:13:21 CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz
>> GenuineIntel GNU/Linux
>
> You're going into the wrong direction. If 2.6.34-rc7 works just fine, you
> shouldn't be testing 2.6.34-rc5.
>
>>
>> As I understand I am bisecting upwards. Every time it does not boot
>> correctly I do
>>
>> git bisect bad after the next boot.
>
> No, you should only do "git bisect bad" if you find a CPU eater and "git
> bisect good" if you don't. For the non-booting kernels you should do "git
> bisect skip"; otherwise git gets confused as we can see here.
>
> Did you test v2.6.35-rc1? Does it have the CPU eater problem? If yes, please
> just reset your bisection
>
> ?git bisect reset
> ?git bisect start
> ?git bisect good v2.6.34-rc7
> ?git bisect bad v2.6.35-rc1
>
> and use 'git bisect skip' for kernels that don't boot or build.
This one looks good to:

Linux zenogentoo 2.6.34-06562-gd79df0b #107 SMP Fri Jul 16 11:44:36
CEST 2010 i686 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel
GNU/Linux

Best
Zeno

2010-07-16 18:32:39

by Yinghai Lu

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

it seems that you are using 32bit kernel. please check if this one help.

Thanks

Yinghai

Subject: [PATCH -v3] x86,mm: fix 32bit numa sparsemem

Borislav Petkov <[email protected]> reported his 32bit numa has problem:

[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499
[ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df
[ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb

looks like it allocate much high address for bootmem.

try to cut limit with get_max_mapped()

-v3: make alloc_bootmem_node could fallback to other node.
just like old alloc_bootmem_node did

need this patch for 2.6.34 and 2.6.35

Reported-by: Borislav Petkov <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: [email protected]

---
mm/bootmem.c | 24 ++++++++++++++++++++----
mm/page_alloc.c | 3 +++
2 files changed, 23 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -3634,6 +3634,9 @@ void * __init __alloc_memory_core_early(
int i;
void *ptr;

+ if (limit > get_max_mapped())
+ limit = get_max_mapped();
+
/* need to go over early_node_map to find out good range for node */
for_each_active_range_index_in_nid(i, nid) {
u64 addr;
Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -833,15 +833,24 @@ static void * __init ___alloc_bootmem_no
void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
+ void *ptr;
+
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);

#ifdef CONFIG_NO_BOOTMEM
- return __alloc_memory_core_early(pgdat->node_id, size, align,
+ ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
+ goal, -1ULL);
+ if (ptr)
+ return ptr;
+
+ ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align,
goal, -1ULL);
#else
- return ___alloc_bootmem_node(pgdat->bdata, size, align, goal, 0);
+ ptr = ___alloc_bootmem_node(pgdat->bdata, size, align, goal, 0);
#endif
+
+ return ptr;
}

void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size,
@@ -977,14 +986,21 @@ void * __init __alloc_bootmem_low(unsign
void * __init __alloc_bootmem_low_node(pg_data_t *pgdat, unsigned long size,
unsigned long align, unsigned long goal)
{
+ void *ptr;
+
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);

#ifdef CONFIG_NO_BOOTMEM
- return __alloc_memory_core_early(pgdat->node_id, size, align,
+ ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
+ goal, ARCH_LOW_ADDRESS_LIMIT);
+ if (ptr)
+ return ptr;
+ ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align,
goal, ARCH_LOW_ADDRESS_LIMIT);
#else
- return ___alloc_bootmem_node(pgdat->bdata, size, align,
+ ptr = ___alloc_bootmem_node(pgdat->bdata, size, align,
goal, ARCH_LOW_ADDRESS_LIMIT);
#endif
+ return ptr;
}

2010-07-16 20:30:01

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere


Am 16.07.2010 um 20:27 schrieb Yinghai Lu <[email protected]>:

> it seems that you are using 32bit kernel. please check if this one help.

Thanks! What RC should I patch? 2.6.35-rc5?

Best
Zeno

> Subject: [PATCH -v3] x86,mm: fix 32bit numa sparsemem
>
> Borislav Petkov <[email protected]> reported his 32bit numa has problem:
>
> [ 0.000000] Reserving total of 4c00 pages for numa KVA remap
> [ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
> [ 0.000000] max_pfn = 238000
> [ 0.000000] 8202MB HIGHMEM available.
> [ 0.000000] 885MB LOWMEM available.
> [ 0.000000] mapped low ram: 0 - 375fe000
> [ 0.000000] low ram: 0 - 375fe000
> [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
> [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
> [ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
> [ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
> [ 0.000000] BUG: unable to handle kernel paging request at 40000000
> [ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
> [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
> ...
> [ 0.000000] Call Trace:
> [ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
> [ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
> [ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499
> [ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df
> [ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
>
> looks like it allocate much high address for bootmem.
>
> try to cut limit with get_max_mapped()
>
> -v3: make alloc_bootmem_node could fallback to other node.
> just like old alloc_bootmem_node did
>
> need this patch for 2.6.34 and 2.6.35
>
> Reported-by: Borislav Petkov <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: [email protected]
>
> ---
> mm/bootmem.c | 24 ++++++++++++++++++++----
> mm/page_alloc.c | 3 +++
> 2 files changed, 23 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -3634,6 +3634,9 @@ void * __init __alloc_memory_core_early(
> int i;
> void *ptr;
>
> + if (limit > get_max_mapped())
> + limit = get_max_mapped();
> +
> /* need to go over early_node_map to find out good range for node */
> for_each_active_range_index_in_nid(i, nid) {
> u64 addr;
> Index: linux-2.6/mm/bootmem.c
> ===================================================================
> --- linux-2.6.orig/mm/bootmem.c
> +++ linux-2.6/mm/bootmem.c
> @@ -833,15 +833,24 @@ static void * __init ___alloc_bootmem_no
> void * __init __alloc_bootmem_node(pg_data_t *pgdat, unsigned long size,
> unsigned long align, unsigned long goal)
> {
> + void *ptr;
> +
> if (WARN_ON_ONCE(slab_is_available()))
> return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
>
> #ifdef CONFIG_NO_BOOTMEM
> - return __alloc_memory_core_early(pgdat->node_id, size, align,
> + ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
> + goal, -1ULL);
> + if (ptr)
> + return ptr;
> +
> + ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align,
> goal, -1ULL);
> #else
> - return ___alloc_bootmem_node(pgdat->bdata, size, align, goal, 0);
> + ptr = ___alloc_bootmem_node(pgdat->bdata, size, align, goal, 0);
> #endif
> +
> + return ptr;
> }
>
> void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size,
> @@ -977,14 +986,21 @@ void * __init __alloc_bootmem_low(unsign
> void * __init __alloc_bootmem_low_node(pg_data_t *pgdat, unsigned long size,
> unsigned long align, unsigned long goal)
> {
> + void *ptr;
> +
> if (WARN_ON_ONCE(slab_is_available()))
> return kzalloc_node(size, GFP_NOWAIT, pgdat->node_id);
>
> #ifdef CONFIG_NO_BOOTMEM
> - return __alloc_memory_core_early(pgdat->node_id, size, align,
> + ptr = __alloc_memory_core_early(pgdat->node_id, size, align,
> + goal, ARCH_LOW_ADDRESS_LIMIT);
> + if (ptr)
> + return ptr;
> + ptr = __alloc_memory_core_early(MAX_NUMNODES, size, align,
> goal, ARCH_LOW_ADDRESS_LIMIT);
> #else
> - return ___alloc_bootmem_node(pgdat->bdata, size, align,
> + ptr = ___alloc_bootmem_node(pgdat->bdata, size, align,
> goal, ARCH_LOW_ADDRESS_LIMIT);
> #endif
> + return ptr;
> }

2010-07-16 21:03:43

by Yinghai Lu

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On 07/16/2010 01:29 PM, Zeno Davatz wrote:
>
> Am 16.07.2010 um 20:27 schrieb Yinghai Lu <[email protected]>:
>
>> it seems that you are using 32bit kernel. please check if this one help.
>
> Thanks! What RC should I patch? 2.6.35-rc5?

current linus tree or 2.6.35-rc5.

Thanks

Yinghai

2010-07-17 08:46:30

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, Jul 16, 2010 at 10:59 PM, Yinghai Lu <[email protected]> wrote:
> On 07/16/2010 01:29 PM, Zeno Davatz wrote:
>>
>> Am 16.07.2010 um 20:27 schrieb Yinghai Lu <[email protected]>:
>>
>>> it seems that you are using 32bit kernel. please check if this one help.
>>
>> Thanks! What RC should I patch? 2.6.35-rc5?
>
> current linus tree or 2.6.35-rc5.

Tried to patch 2.6.35-rc5 but I get:

/usr/src/my2.6> sudo patch -p1 < patch_yinghai
patching file mm/page_alloc.c
Hunk #1 FAILED at 3634.
1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej
patching file mm/bootmem.c
Hunk #1 FAILED at 833.
Hunk #2 FAILED at 986.
2 out of 2 hunks FAILED -- saving rejects to file mm/bootmem.c.rej

Any hints?

It seems that 2.6.35-rc1 does not have the CPU eater problem. I still
running it without any interference (though I was away from the
computer for some time).

Best
Zeno

2010-08-03 09:06:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Thu, 2010-07-15 at 23:52 +0300, Pekka Enberg wrote:
> On Thu, Jul 15, 2010 at 11:00 PM, Damien Wyart <[email protected]> wrote:
> >> > For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
> >> > with the option and rc5 the problem was happening quite quickly after
> >> > boot and normal use of the machine. So it seems I can confirme what Zeno
> >> > has seen and I hope this will give a hint to debug the problem. I guess
> >> > this has not been reported that much because many testers might not have
> >> > enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
> >> > benchmark with a kernel having this option enabled?
> >
> > * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
> >> To be honest, the bug is bit odd. It's related to boot-time memory
> >> allocator changes but yet it seems to manifest itself as a scheduling
> >> problem. So if you have some spare time and want to speed up the
> >> debugging process, please test v2.6.34 and v2.6.35-rc1 with
> >> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
> >> if you can identify the offending commit with "git bisect."
> >
> > Not sure I will have enough time in the coming days (doing that remotely
> > is fishy since ssh access is almost stuck when the problem occurs); if
> > Zeno can and would like to do it, maybe this could be done faster.
> >
> > As the scheduler is now very well instrumented (many debugging features
> > are available), reproducing the bug on a test platform (it happens quite
> > quickly for me) might also give some hints. So testers, if you have
> > time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
> > if you can reproduce the problem!
>
> Yeah, there's "perf sched" tool available for that:
>
> http://lwn.net/Articles/353295/
>
> The only problem is that we'd need a scheduler hacker to decipher the
> report and all of them seem to be missing at the moment (probably at
> OLS). Anyway, like I said, git bisect will probably speed up the
> debugging process, that's all.

Vacation.. but now I'm back ;-)

Even something simple as: perf top -r 1 (make sure you're root in order
to run with real-time prios) could give a clue as to what is consuming
all your cpu-time.

Or did the issue get sorted already?

2010-08-03 09:11:11

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Tue, Aug 3, 2010 at 11:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-07-15 at 23:52 +0300, Pekka Enberg wrote:
>> On Thu, Jul 15, 2010 at 11:00 PM, Damien Wyart <[email protected]> wrote:
>> >> > For now, I can't reproduce the problem with CONFIG_NO_BOOTMEM disabled ;
>> >> > with the option and rc5 the problem was happening quite quickly after
>> >> > boot and normal use of the machine. So it seems I can confirme what Zeno
>> >> > has seen and I hope this will give a hint to debug the problem. I guess
>> >> > this has not been reported that much because many testers might not have
>> >> > enabled CONFIG_NO_BOOTMEM... Maybe the scheduler folks could test their
>> >> > benchmark with a kernel having this option enabled?
>> >
>> > * Pekka Enberg <[email protected]> [2010-07-15 22:50]:
>> >> To be honest, the bug is bit odd. It's related to boot-time memory
>> >> allocator changes but yet it seems to manifest itself as a scheduling
>> >> problem. So if you have some spare time and want to speed up the
>> >> debugging process, please test v2.6.34 and v2.6.35-rc1 with
>> >> CONFIG_NO_BOOTMEM and if former is good and latter is bad, try to see
>> >> if you can identify the offending commit with "git bisect."
>> >
>> > Not sure I will have enough time in the coming days (doing that remotely
>> > is fishy since ssh access is almost stuck when the problem occurs); if
>> > Zeno can and would like to do it, maybe this could be done faster.
>> >
>> > As the scheduler is now very well instrumented (many debugging features
>> > are available), reproducing the bug on a test platform (it happens quite
>> > quickly for me) might also give some hints. So testers, if you have
>> > time, please test 2.6.35-rc5 with CONFIG_NO_BOOTMEM on a Core i7 and see
>> > if you can reproduce the problem!
>>
>> Yeah, there's "perf sched" tool available for that:
>>
>> ? http://lwn.net/Articles/353295/
>>
>> The only problem is that we'd need a scheduler hacker to decipher the
>> report and all of them seem to be missing at the moment (probably at
>> OLS). Anyway, like I said, git bisect will probably speed up the
>> debugging process, that's all.
>
> Vacation.. but now I'm back ;-)
>
> Even something simple as: perf top -r 1 (make sure you're root in order
> to run with real-time prios) could give a clue as to what is consuming
> all your cpu-time.
>
> Or did the issue get sorted already?

Thank you for the hint.

I am on 2.6.35 now and all seems to be fine again.

Best
Zeno

2010-08-03 09:15:37

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

> > Vacation.. but now I'm back ;-)
> >
> > Even something simple as: perf top -r 1 (make sure you're root in order
> > to run with real-time prios) could give a clue as to what is consuming
> > all your cpu-time.
> >
> > Or did the issue get sorted already?
>
> Thank you for the hint.
>
> I am on 2.6.35 now and all seems to be fine again.

Are you 100% sure you compiled it with CONFIG_NO_BOOTMEM enabled?

I did not test 2.6.35 yet but I did not see anything related to this bug
commited since the discussion so I am very surprised the problem disappeared by
itself...

Will be on vacation very soon, so not sure I will have time to test 2.6.35
before leaving.

Damien

2010-08-03 09:18:14

by Zeno Davatz

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Tue, Aug 3, 2010 at 11:15 AM, <[email protected]> wrote:
>> > Vacation.. but now I'm back ;-)
>> >
>> > Even something simple as: perf top -r 1 (make sure you're root in order
>> > to run with real-time prios) could give a clue as to what is consuming
>> > all your cpu-time.
>> >
>> > Or did the issue get sorted already?
>>
>> Thank you for the hint.
>>
>> I am on 2.6.35 now and all seems to be fine again.
>
> Are you 100% sure you compiled it with CONFIG_NO_BOOTMEM enabled?
>
> I did not test 2.6.35 yet but I did not see anything related to this bug
> commited since the discussion so I am very surprised the problem disappeared by
> itself...
>
> Will be on vacation very soon, so not sure I will have time to test 2.6.35
> before leaving.

Yes: I got:

# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set

in my .config.

Linux zenogentoo 2.6.35 #122 SMP Mon Aug 2 10:26:05 CEST 2010 i686
Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux

Best
Zeno

2010-08-20 09:40:10

by Damien Wyart

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

Hi,

> > > Vacation.. but now I'm back ;-)

> > > Even something simple as: perf top -r 1 (make sure you're root in
> > > order to run with real-time prios) could give a clue as to what is
> > > consuming all your cpu-time.

> > > Or did the issue get sorted already?

> > Thank you for the hint.

> > I am on 2.6.35 now and all seems to be fine again.

> Are you 100% sure you compiled it with CONFIG_NO_BOOTMEM enabled?

> I did not test 2.6.35 yet but I did not see anything related to this
> bug commited since the discussion so I am very surprised the problem
> disappeared by itself...

> Will be on vacation very soon, so not sure I will have time to test 2.6.35
> before leaving.

After a few days of running 2.6.35.2 without problem, I got the same
huge slowness for a few tens of seconds yesterday. Did not have time to
run perf (and the system was almost unresponsive), but I will try to do
so if the problem occurs again. Anyway, even if less frequent than
during the -rcs, and as nothing had been commited to fix it before
final, the problem is still there...

It seems that the problem occurs after some CPU intensive tasks have
been run for some time (ie compiling a kernel) and only on Core i7
machines with NO_BOOTMEM. I am surprised so few people reported it and
that it has not been seen on test machines running CPU/scheduler
benchmark tools.

--
Damien Wyart

2010-08-20 09:40:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: kmemleak, cpu usage jump out of nowhere

On Fri, 2010-08-20 at 11:32 +0200, Damien Wyart wrote:

> After a few days of running 2.6.35.2 without problem,

> I am surprised so few people reported it and
> that it has not been seen on test machines running CPU/scheduler
> benchmark tools.
>
My machines are lucky if a kernel has hours of runtime, days almost
never happens, there's always the next kernel to test ;-)

But yeah, you'd expect more people to run into something like this..

Most odd thing..