2011-04-24 18:22:17

by Bruno Prémont

[permalink] [raw]
Subject: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On an older system I've been running Gentoo's revdep-rebuild to check
for system linking/*.la consistency and after doing most of the work the
system starved more or less, just complaining about stuck tasks now and
then.
Memory usage graph as seen from userspace showed sudden quick increase of
memory usage though only a very few MB were swapped out (c.f. attached RRD
graph).


Sample soft-lockup trace:
[ 3683.000651] BUG: soft lockup - CPU#0 stuck for 64s! [kswapd0:365]
[ 3683.000720] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer pcspkr snd snd_page_alloc
[ 3683.001331] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer pcspkr snd snd_page_alloc
[ 3683.001941]
[ 3683.001988] Pid: 365, comm: kswapd0 Tainted: G W 2.6.39-rc4-jupiter-00187-g686c4cb #1 NVIDIA Corporation. nFORCE-MCP/MS-6373
[ 3683.002163] EIP: 0060:[<c1092cd0>] EFLAGS: 00000246 CPU: 0
[ 3683.002221] EIP is at shrink_inactive_list+0x110/0x340
[ 3683.002272] EAX: 00000000 EBX: c1546620 ECX: 00000001 EDX: 00000000
[ 3683.002324] ESI: dd587f78 EDI: dd587e68 EBP: dd587e88 ESP: dd587e38
[ 3683.002376] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 3683.002427] Process kswapd0 (pid: 365, ti=dd586000 task=dd532700 task.ti=dd586000)
[ 3683.002486] Stack:
[ 3683.002532] dd587e78 00000001 00000000 00000001 00000054 dd532700 c15468d8 c154688c
[ 3683.002871] c15468dc 00000000 dd587e5c 00000000 dd587e68 dd587e68 dd587e6c 0000007b
[ 3683.003212] 00000000 00000002 00000000 00000000 dd587f14 c10932a0 00000002 00000001
[ 3683.003548] Call Trace:
[ 3683.003602] [<c10932a0>] shrink_zone+0x3a0/0x500
[ 3683.003653] [<c1093b4e>] kswapd+0x74e/0x8d0
[ 3683.003705] [<c1093400>] ? shrink_zone+0x500/0x500
[ 3683.003757] [<c1047794>] kthread+0x74/0x80
[ 3683.003807] [<c1047720>] ? kthreadd+0xb0/0xb0
[ 3683.003864] [<c13918b6>] kernel_thread_helper+0x6/0xd
[ 3683.003913] Code: 0e 04 74 5f 89 da 2b 93 e0 02 00 00 c1 fa 02 69 d2 95 fa 53 ea 01 04 95 e8 55 52 c1 8b 55 d4 85 d2 75 5f fb c7 45 dc 00 00 00 00 <8b> 45 dc 83 c4 44 5b 5e 5f c9 c3 90 8d 74 26 00 83 7d 08 09 0f
[ 3683.006264] Call Trace:
[ 3683.006312] [<c10932a0>] shrink_zone+0x3a0/0x500
[ 3683.006379] [<c1093b4e>] kswapd+0x74e/0x8d0
[ 3683.006430] [<c1093400>] ? shrink_zone+0x500/0x500
[ 3683.006480] [<c1047794>] kthread+0x74/0x80
[ 3683.006529] [<c1047720>] ? kthreadd+0xb0/0xb0
[ 3683.006580] [<c13918b6>] kernel_thread_helper+0x6/0xd



A meminfo (sysreq + m) in between the traces produced the following weird
looking information:
[ 6605.162375] Mem-Info:
[ 6605.162421] DMA per-cpu:
[ 6605.162468] CPU 0: hi: 0, btch: 1 usd: 0
[ 6605.162517] Normal per-cpu:
[ 6605.162565] CPU 0: hi: 186, btch: 31 usd: 56
[ 6605.162618] active_anon:56 inactive_anon:86 isolated_anon:32
[ 6605.162620] active_file:1114 inactive_file:1101 isolated_file:32
[ 6605.162623] unevictable:8 dirty:0 writeback:0 unstable:0
[ 6605.162624] free:1195 slab_reclaimable:14698 slab_unreclaimable:34339
[ 6605.162626] mapped:135 shmem:0 pagetables:87 bounce:0
[ 6605.162875] DMA free:1936kB min:88kB low:108kB high:132kB active_anon:4kB inactive_anon:4294967292kB active_file:4294967284kB inac
tive_file:4294967252kB unevictable:0kB isolated(anon):4kB isolated(file):56kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB map
ped:0kB shmem:0kB slab_reclaimable:180kB slab_unreclaimable:4688kB kernel_stack:9104kB pagetables:0kB unstable:0kB bounce:0kB writeba
ck_tmp:0kB pages_scanned:2 all_unreclaimable? no
[ 6605.163011] lowmem_reserve[]: 0 460 460
[ 6605.163207] Normal free:2836kB min:2696kB low:3368kB high:4044kB active_anon:220kB inactive_anon:348kB active_file:4468kB inactive
_file:4448kB unevictable:32kB isolated(anon):124kB isolated(file):72kB present:471360kB mlocked:32kB dirty:0kB writeback:0kB mapped:5
40kB shmem:0kB slab_reclaimable:58612kB slab_unreclaimable:132676kB kernel_stack:256640kB pagetables:348kB unstable:0kB bounce:0kB wr
iteback_tmp:0kB pages_scanned:871 all_unreclaimable? no
[ 6605.163337] lowmem_reserve[]: 0 0 0
[ 6605.163526] DMA: 96*4kB 178*8kB 8*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1936kB
[ 6605.164012] Normal: 539*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 2836kB
[ 6605.164500] 2360 total pagecache pages
[ 6605.164547] 113 pages in swap cache
[ 6605.164595] Swap cache stats: add 135546, delete 135433, find 121976/158215
[ 6605.164647] Free swap = 518472kB
[ 6605.164694] Total swap = 524284kB
[ 6605.170001] 122848 pages RAM
[ 6605.170001] 2699 pages reserved
[ 6605.170001] 2129 pages shared
[ 6605.170001] 83744 pages non-shared


dmesg output (stripped part of the soft-lockup traces to keep log
small) and config attached.

That state suddenly "fixed" itself when revdep-rebuild ended up in
STOPPED state (lots of minutes after hitting CTRL+Z on its terminal),
though system didn't survive very long after that (died in network
stack handling RX IRQ for IPv6 packet.

System has mostly reiserfs filesystems mounted (one ext3 but that one
should not have been touched).


Previously I've been running kernels up to 2.6.38 without seeing this
kind of issue. This was first run of 2.6.39-rc kernel on this system.

If there is some more info I can provide, please tell!
Bruno


Attachments:
(No filename) (5.27 kB)
vm-madness.config (16.29 kB)
vm-madness.dmesg (48.76 kB)
memory-exhausted.png (10.73 kB)
Download all attachments

2011-04-24 21:59:45

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Sun, 24 April 2011 Bruno Prémont <[email protected]> wrote:
> On an older system I've been running Gentoo's revdep-rebuild to check
> for system linking/*.la consistency and after doing most of the work the
> system starved more or less, just complaining about stuck tasks now and
> then.
> Memory usage graph as seen from userspace showed sudden quick increase of
> memory usage though only a very few MB were swapped out (c.f. attached RRD
> graph).

Seems I've hit it once again (though detected before system was fully
stalled by trying to reclaim memory without success).

This time it was during simple compiling...
Gathered info below:

/proc/meminfo:
MemTotal: 480660 kB
MemFree: 64948 kB
Buffers: 10304 kB
Cached: 6924 kB
SwapCached: 4220 kB
Active: 11100 kB
Inactive: 15732 kB
Active(anon): 4732 kB
Inactive(anon): 4876 kB
Active(file): 6368 kB
Inactive(file): 10856 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 524284 kB
SwapFree: 456432 kB
Dirty: 80 kB
Writeback: 0 kB
AnonPages: 6268 kB
Mapped: 2604 kB
Shmem: 4 kB
Slab: 250632 kB
SReclaimable: 51144 kB
SUnreclaim: 199488 kB <--- look big as well...
KernelStack: 131032 kB <--- what???
PageTables: 920 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 764612 kB
Committed_AS: 132632 kB
VmallocTotal: 548548 kB
VmallocUsed: 18500 kB
VmallocChunk: 525952 kB
AnonHugePages: 0 kB
DirectMap4k: 32704 kB
DirectMap4M: 458752 kB

sysrq+m:
[ 3908.107287] SysRq : Show Memory
[ 3908.109324] Mem-Info:
[ 3908.111266] DMA per-cpu:
[ 3908.113164] CPU 0: hi: 0, btch: 1 usd: 0
[ 3908.115061] Normal per-cpu:
[ 3908.116914] CPU 0: hi: 186, btch: 31 usd: 172
[ 3908.117253] active_anon:1989 inactive_anon:2057 isolated_anon:0
[ 3908.117253] active_file:1762 inactive_file:1841 isolated_file:0
[ 3908.117253] unevictable:8 dirty:0 writeback:0 unstable:0
[ 3908.117253] free:15704 slab_reclaimable:12672 slab_unreclaimable:49606
[ 3908.117253] mapped:518 shmem:0 pagetables:214 bounce:0
[ 3908.117253] DMA free:1936kB min:88kB low:108kB high:132kB active_anon:84kB inactive_anon:128kB active_file:4kB inactive_file:68kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:140kB slab_unreclaimable:4960kB kernel_stack:8592kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:467 all_unreclaimable? yes
[ 3908.117253] lowmem_reserve[]: 0 460 460
[ 3908.117253] Normal free:60880kB min:2696kB low:3368kB high:4044kB active_anon:7872kB inactive_anon:8100kB active_file:7044kB inactive_file:7296kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:471360kB mlocked:32kB dirty:0kB writeback:0kB mapped:2072kB shmem:0kB slab_reclaimable:50548kB slab_unreclaimable:193472kB kernel_stack:122384kB pagetables:856kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3908.117253] lowmem_reserve[]: 0 0 0
[ 3908.117253] DMA: 52*4kB 216*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1936kB
[ 3908.117253] Normal: 14858*4kB 181*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 60880kB
[ 3908.117253] 5093 total pagecache pages
[ 3908.117253] 1490 pages in swap cache
[ 3908.117253] Swap cache stats: add 55685, delete 54195, find 25271/28670
[ 3908.117253] Free swap = 458944kB
[ 3908.117253] Total swap = 524284kB
[ 3908.117253] 122848 pages RAM
[ 3908.117253] 2699 pages reserved
[ 3908.117253] 4346 pages shared
[ 3908.117253] 84248 pages non-shared

ps auxf:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S 22:39 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 22:39 0:00 \_ [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S 22:39 0:00 \_ [kworker/u:0]
root 6 0.0 0.0 0 0 ? R 22:39 0:01 \_ [rcu_kthread]
root 7 0.0 0.0 0 0 ? R 22:39 0:00 \_ [watchdog/0]
root 8 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [khelper]
root 138 0.0 0.0 0 0 ? S 22:39 0:00 \_ [sync_supers]
root 140 0.0 0.0 0 0 ? S 22:39 0:00 \_ [bdi-default]
root 142 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [kblockd]
root 230 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [ata_sff]
root 237 0.0 0.0 0 0 ? S 22:39 0:00 \_ [khubd]
root 365 0.0 0.0 0 0 ? S 22:39 0:01 \_ [kswapd0]
root 429 0.0 0.0 0 0 ? S 22:39 0:00 \_ [fsnotify_mark]
root 438 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [xfs_mru_cache]
root 439 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [xfslogd]
root 440 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [xfsdatad]
root 441 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [xfsconvertd]
root 497 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_0]
root 500 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_1]
root 514 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_2]
root 517 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_3]
root 521 0.0 0.0 0 0 ? S 22:39 0:00 \_ [kworker/u:5]
root 530 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_4]
root 533 0.0 0.0 0 0 ? S 22:39 0:00 \_ [scsi_eh_5]
root 585 0.0 0.0 0 0 ? S< 22:39 0:00 \_ [kpsmoused]
root 659 0.0 0.0 0 0 ? S< 22:40 0:00 \_ [reiserfs]
root 1436 0.0 0.0 0 0 ? S 22:40 0:00 \_ [flush-8:0]
root 1642 0.0 0.0 0 0 ? S< 22:40 0:00 \_ [rpciod]
root 1643 0.0 0.0 0 0 ? S< 22:40 0:00 \_ [nfsiod]
root 1647 0.0 0.0 0 0 ? S 22:40 0:00 \_ [lockd]
root 21739 0.0 0.0 0 0 ? S< 23:05 0:00 \_ [ttm_swap]
root 1760 0.0 0.0 0 0 ? S 23:22 0:00 \_ [kworker/0:2]
root 13497 0.0 0.0 0 0 ? S 23:27 0:00 \_ [kworker/0:0]
root 14071 0.0 0.0 0 0 ? S 23:36 0:00 \_ [kworker/0:3]
root 15923 0.0 0.0 0 0 ? S 23:44 0:00 \_ [flush-0:18]
root 15924 0.0 0.0 0 0 ? S 23:44 0:00 \_ [flush-0:19]
root 15925 0.0 0.0 0 0 ? S 23:44 0:00 \_ [flush-0:20]
root 15926 0.0 0.0 0 0 ? S 23:44 0:00 \_ [flush-0:21]
root 1 0.0 0.0 1740 72 ? Ss 22:39 0:00 init [3]
root 759 0.0 0.0 2228 8 ? S<s 22:40 0:00 /sbin/udevd --daemon
root 1723 0.0 0.0 2224 8 ? S< 22:45 0:00 \_ /sbin/udevd --daemon
root 1327 0.0 0.0 4876 8 tty2 Ss+ 22:40 0:00 -bash
root 6122 0.4 0.0 34204 8 tty2 TN 23:24 0:07 \_ /usr/bin/python2.7 /usr/bin/emerge --oneshot media-gfx/gimp
portage 27988 0.0 0.0 5928 8 tty2 TN 23:28 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
portage 28231 0.0 0.0 6064 8 tty2 TN 23:28 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
portage 28245 0.0 0.0 4880 8 tty2 TN 23:28 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild-helpers/emake
portage 28250 0.0 0.0 3860 8 tty2 TN 23:28 0:00 \_ make -j2
portage 28251 0.0 0.0 3864 8 tty2 TN 23:28 0:00 \_ make all-recursive
portage 28252 0.0 0.0 4752 8 tty2 TN 23:28 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \?
portage 12569 0.0 0.0 4752 8 tty2 TN 23:33 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \
portage 12570 0.0 0.0 3864 8 tty2 TN 23:33 0:00 \_ make all
portage 12571 0.0 0.0 4752 8 tty2 TN 23:33 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!
portage 15218 0.0 0.0 4752 8 tty2 TN 23:40 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* |
portage 15219 0.0 0.0 3884 8 tty2 TN 23:40 0:00 \_ make all
portage 15912 0.0 0.0 1924 8 tty2 TN 23:42 0:00 \_ /usr/i686-pc-linux-gnu/gcc-bin/4.4.5/i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.
portage 15913 0.0 0.0 12724 8 tty2 TN 23:42 0:00 | \_ /usr/libexec/gcc/i686-pc-linux-gnu/4.4.5/cc1 -quiet -I. -I../.. -I../.. -I../.
portage 15914 0.0 0.0 5284 8 tty2 TN 23:42 0:00 | \_ /usr/lib/gcc/i686-pc-linux-gnu/4.4.5/../../../../i686-pc-linux-gnu/bin/as -Qy
portage 15916 0.0 0.0 1924 8 tty2 TN 23:43 0:00 \_ /usr/i686-pc-linux-gnu/gcc-bin/4.4.5/i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.
portage 15917 0.0 0.0 11540 8 tty2 TN 23:43 0:00 \_ /usr/libexec/gcc/i686-pc-linux-gnu/4.4.5/cc1 -quiet -I. -I../.. -I../.. -I../.
portage 15918 0.0 0.0 5284 8 tty2 TN 23:43 0:00 \_ /usr/lib/gcc/i686-pc-linux-gnu/4.4.5/../../../../i686-pc-linux-gnu/bin/as -Qy
root 1328 0.0 0.0 4876 8 tty3 Ss+ 22:40 0:00 -bash
root 13085 0.7 0.0 34128 92 tty3 TN 23:34 0:06 \_ /usr/bin/python2.7 /usr/bin/emerge --oneshot collectd
portage 13877 0.0 0.0 5924 8 tty3 TN 23:36 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh prepare
portage 13899 0.0 0.0 5928 8 tty3 TN 23:36 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh prepare
portage 15904 0.0 0.0 5928 8 tty3 TN 23:42 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh prepare
portage 15911 0.0 0.0 4752 8 tty3 TN 23:42 0:00 \_ /bin/sh /usr/bin/autoconf --trace=AC_PROG_LIBTOOL
root 1329 0.0 0.0 1892 8 tty4 Ss+ 22:40 0:00 /sbin/agetty 38400 tty4 linux
root 1330 0.0 0.0 1892 8 tty5 Ss+ 22:40 0:00 /sbin/agetty 38400 tty5 linux
root 1331 0.0 0.0 1892 8 tty6 Ss+ 22:40 0:00 /sbin/agetty 38400 tty6 linux
root 1471 0.0 0.0 1928 8 ? Ss 22:40 0:00 dhcpcd -m 2 eth0
root 1512 0.0 0.0 5128 8 ? S 22:40 0:00 supervising syslog-ng
root 1513 0.0 0.0 5408 32 ? Ss 22:40 0:00 \_ /usr/sbin/syslog-ng
ntp 1537 0.0 0.0 4360 236 ? Ss 22:40 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u ntp:ntp
collectd 1555 0.1 0.2 45048 1032 ? SNLsl 22:40 0:05 /usr/sbin/collectd -P /var/run/collectd/collectd.pid -C /etc/collectd.conf
root 1613 0.0 0.0 2116 96 ? Ss 22:40 0:00 /sbin/rpcbind
root 1627 0.0 0.0 2188 8 ? Ss 22:40 0:00 /sbin/rpc.statd --no-notify
root 1687 0.0 0.0 4204 8 ? Ss 22:40 0:00 /usr/sbin/sshd
root 15929 0.1 0.0 7004 312 ? Ss 23:47 0:00 \_ sshd: root@pts/2
root 15931 0.0 0.1 4876 808 pts/2 Ss 23:47 0:00 \_ -bash
root 15949 0.0 0.2 4124 972 pts/2 R+ 23:50 0:00 \_ ps auxf
root 1715 0.0 0.0 1892 8 tty1 Ss+ 22:40 0:00 /sbin/agetty 38400 tty1 linux
root 1716 0.0 0.0 1892 8 ttyS0 Ss+ 22:40 0:00 /sbin/agetty 115200 ttyS0 vt100
root 28160 0.0 0.0 1944 8 ? Ss 23:21 0:00 /usr/sbin/gpm -m /dev/input/mice -t ps2

/proc/slabinfo:
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
squashfs_inode_cache 4240 4240 384 10 1 : tunables 0 0 0 : slabdata 424 424 0
nfs_direct_cache 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
nfs_read_data 72 72 448 9 1 : tunables 0 0 0 : slabdata 8 8 0
nfs_inode_cache 252 252 568 14 2 : tunables 0 0 0 : slabdata 18 18 0
rpc_inode_cache 36 36 448 9 1 : tunables 0 0 0 : slabdata 4 4 0
RAWv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
UDPLITEv6 0 0 672 12 2 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
tw_sock_TCPv6 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
TCPv6 12 12 1312 12 4 : tunables 0 0 0 : slabdata 1 1 0
mqueue_inode_cache 8 8 480 8 1 : tunables 0 0 0 : slabdata 1 1 0
xfs_inode 0 0 608 13 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_efd_item 0 0 288 14 1 : tunables 0 0 0 : slabdata 0 0 0
xfs_trans 0 0 224 18 1 : tunables 0 0 0 : slabdata 0 0 0
xfs_da_state 0 0 336 12 1 : tunables 0 0 0 : slabdata 0 0 0
xfs_log_ticket 0 0 176 23 1 : tunables 0 0 0 : slabdata 0 0 0
reiser_inode_cache 38160 38160 392 10 1 : tunables 0 0 0 : slabdata 3816 3816 0
configfs_dir_cache 73 73 56 73 1 : tunables 0 0 0 : slabdata 1 1 0
inotify_inode_mark 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 0 0 120 34 1 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 512 8 1 : tunables 0 0 0 : slabdata 0 0 0
UDP 16 16 512 8 1 : tunables 0 0 0 : slabdata 2 2 0
tw_sock_TCP 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
TCP 13 13 1184 13 4 : tunables 0 0 0 : slabdata 1 1 0
sgpool-128 12 12 2560 12 8 : tunables 0 0 0 : slabdata 1 1 0
sgpool-64 12 12 1280 12 4 : tunables 0 0 0 : slabdata 1 1 0
sgpool-32 12 12 640 12 2 : tunables 0 0 0 : slabdata 1 1 0
sgpool-16 12 12 320 12 1 : tunables 0 0 0 : slabdata 1 1 0
blkdev_queue 17 17 920 17 4 : tunables 0 0 0 : slabdata 1 1 0
blkdev_requests 27 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
blkdev_ioc 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
biovec-256 10 10 3072 10 8 : tunables 0 0 0 : slabdata 1 1 0
biovec-128 0 0 1536 10 4 : tunables 0 0 0 : slabdata 0 0 0
biovec-64 10 10 768 10 2 : tunables 0 0 0 : slabdata 1 1 0
sock_inode_cache 63 66 352 11 1 : tunables 0 0 0 : slabdata 6 6 0
skbuff_fclone_cache 11 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
file_lock_cache 39 39 104 39 1 : tunables 0 0 0 : slabdata 1 1 0
shmem_inode_cache 920 920 400 10 1 : tunables 0 0 0 : slabdata 92 92 0
proc_inode_cache 33216 33216 336 12 1 : tunables 0 0 0 : slabdata 2768 2768 0
sigqueue 28 28 144 28 1 : tunables 0 0 0 : slabdata 1 1 0
bdev_cache 13 18 448 9 1 : tunables 0 0 0 : slabdata 2 2 0
sysfs_dir_cache 13260 13260 48 85 1 : tunables 0 0 0 : slabdata 156 156 0
mnt_cache 99 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
inode_cache 3757 3757 312 13 1 : tunables 0 0 0 : slabdata 289 289 0
dentry 123232 123232 128 32 1 : tunables 0 0 0 : slabdata 3851 3851 0
buffer_head 2589 30003 56 73 1 : tunables 0 0 0 : slabdata 411 411 0
vm_area_struct 1792 1794 88 46 1 : tunables 0 0 0 : slabdata 39 39 0
mm_struct 68 76 416 19 2 : tunables 0 0 0 : slabdata 4 4 0
sighand_cache 85 96 1312 12 4 : tunables 0 0 0 : slabdata 8 8 0
task_struct 16410 16410 832 19 4 : tunables 0 0 0 : slabdata 885 885 0
anon_vma_chain 2352 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
anon_vma 1097 1190 24 170 1 : tunables 0 0 0 : slabdata 7 7 0
radix_tree_node 19019 19019 304 13 1 : tunables 0 0 0 : slabdata 1463 1463 0
idr_layer_cache 325 338 152 26 1 : tunables 0 0 0 : slabdata 13 13 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 8 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 8 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 8 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 12 12 8192 4 8 : tunables 0 0 0 : slabdata 3 3 0
kmalloc-4096 293 296 4096 8 8 : tunables 0 0 0 : slabdata 37 37 0
kmalloc-2048 597 606 2048 8 4 : tunables 0 0 0 : slabdata 78 78 0
kmalloc-1024 6399 6400 1024 8 2 : tunables 0 0 0 : slabdata 800 800 0
kmalloc-512 18558 18560 512 8 1 : tunables 0 0 0 : slabdata 2320 2320 0
kmalloc-256 56 64 256 16 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-128 1258587 1258592 128 32 1 : tunables 0 0 0 : slabdata 39331 39331 0
^^^^^^^^^^^^^^^
How may I find out who is using this many 128-byte
blocks?
kmalloc-64 25086 25088 64 64 1 : tunables 0 0 0 : slabdata 392 392 0
kmalloc-32 9720 9728 32 128 1 : tunables 0 0 0 : slabdata 76 76 0
kmalloc-16 2542 4864 16 256 1 : tunables 0 0 0 : slabdata 19 19 0
kmalloc-8 3580 3584 8 512 1 : tunables 0 0 0 : slabdata 7 7 0
kmalloc-192 10925 10941 192 21 1 : tunables 0 0 0 : slabdata 521 521 0
kmalloc-96 63462 63462 96 42 1 : tunables 0 0 0 : slabdata 1511 1511 0
kmem_cache 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
kmem_cache_node 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0

Bruno

2011-04-25 02:42:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

> On Sun, 24 April 2011 Bruno Prémont <[email protected]> wrote:
> > On an older system I've been running Gentoo's revdep-rebuild to check
> > for system linking/*.la consistency and after doing most of the work the
> > system starved more or less, just complaining about stuck tasks now and
> > then.
> > Memory usage graph as seen from userspace showed sudden quick increase of
> > memory usage though only a very few MB were swapped out (c.f. attached RRD
> > graph).
>
> Seems I've hit it once again (though detected before system was fully
> stalled by trying to reclaim memory without success).
>
> This time it was during simple compiling...
> Gathered info below:
>
> /proc/meminfo:
> MemTotal: 480660 kB
> MemFree: 64948 kB
> Buffers: 10304 kB
> Cached: 6924 kB
> SwapCached: 4220 kB
> Active: 11100 kB
> Inactive: 15732 kB
> Active(anon): 4732 kB
> Inactive(anon): 4876 kB
> Active(file): 6368 kB
> Inactive(file): 10856 kB
> Unevictable: 32 kB
> Mlocked: 32 kB
> SwapTotal: 524284 kB
> SwapFree: 456432 kB
> Dirty: 80 kB
> Writeback: 0 kB
> AnonPages: 6268 kB
> Mapped: 2604 kB
> Shmem: 4 kB
> Slab: 250632 kB
> SReclaimable: 51144 kB
> SUnreclaim: 199488 kB <--- look big as well...
> KernelStack: 131032 kB <--- what???

KernelStack is used 8K bytes per thread. then, your system should have
16000 threads. but your ps only showed about 80 processes.
Hmm... stack leak?


> PageTables: 920 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 764612 kB
> Committed_AS: 132632 kB
> VmallocTotal: 548548 kB
> VmallocUsed: 18500 kB
> VmallocChunk: 525952 kB
> AnonHugePages: 0 kB
> DirectMap4k: 32704 kB
> DirectMap4M: 458752 kB
>

2011-04-25 07:47:31

by Mike Frysinger

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
>> On Sun, 24 April 2011 Bruno Prémont wrote:
>> > On an older system I've been running Gentoo's revdep-rebuild to check
>> > for system linking/*.la consistency and after doing most of the work the
>> > system starved more or less, just complaining about stuck tasks now and
>> > then.
>> > Memory usage graph as seen from userspace showed sudden quick increase of
>> > memory usage though only a very few MB were swapped out (c.f. attached RRD
>> > graph).
>>
>> Seems I've hit it once again (though detected before system was fully
>> stalled by trying to reclaim memory without success).
>>
>> This time it was during simple compiling...
>> Gathered info below:
>>
>> /proc/meminfo:
>> MemTotal:         480660 kB
>> MemFree:           64948 kB
>> Buffers:           10304 kB
>> Cached:             6924 kB
>> SwapCached:         4220 kB
>> Active:            11100 kB
>> Inactive:          15732 kB
>> Active(anon):       4732 kB
>> Inactive(anon):     4876 kB
>> Active(file):       6368 kB
>> Inactive(file):    10856 kB
>> Unevictable:          32 kB
>> Mlocked:              32 kB
>> SwapTotal:        524284 kB
>> SwapFree:         456432 kB
>> Dirty:                80 kB
>> Writeback:             0 kB
>> AnonPages:          6268 kB
>> Mapped:             2604 kB
>> Shmem:                 4 kB
>> Slab:             250632 kB
>> SReclaimable:      51144 kB
>> SUnreclaim:       199488 kB   <--- look big as well...
>> KernelStack:      131032 kB   <--- what???
>
> KernelStack is used 8K bytes per thread. then, your system should have
> 16000 threads. but your ps only showed about 80 processes.
> Hmm... stack leak?

i might have a similar report for 2.6.39-rc4 (seems to be working fine
in 2.6.38.4), but for embedded Blackfin systems running gdbserver
processes over and over (so lots of short lived forks)

i wonder if you have a lot of zombies or otherwise unclaimed resources
? does `ps aux` show anything unusual ?
-mike

2011-04-25 09:17:21

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Mike Frysinger wrote:
> On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
> >> On Sun, 24 April 2011 Bruno Prémont wrote:
> >> > On an older system I've been running Gentoo's revdep-rebuild to check
> >> > for system linking/*.la consistency and after doing most of the work the
> >> > system starved more or less, just complaining about stuck tasks now and
> >> > then.
> >> > Memory usage graph as seen from userspace showed sudden quick increase of
> >> > memory usage though only a very few MB were swapped out (c.f. attached RRD
> >> > graph).
> >>
> >> Seems I've hit it once again (though detected before system was fully
> >> stalled by trying to reclaim memory without success).
> >>
> >> This time it was during simple compiling...
> >> Gathered info below:
> >>
> >> /proc/meminfo:
> >> MemTotal:         480660 kB
> >> MemFree:           64948 kB
> >> Buffers:           10304 kB
> >> Cached:             6924 kB
> >> SwapCached:         4220 kB
> >> Active:            11100 kB
> >> Inactive:          15732 kB
> >> Active(anon):       4732 kB
> >> Inactive(anon):     4876 kB
> >> Active(file):       6368 kB
> >> Inactive(file):    10856 kB
> >> Unevictable:          32 kB
> >> Mlocked:              32 kB
> >> SwapTotal:        524284 kB
> >> SwapFree:         456432 kB
> >> Dirty:                80 kB
> >> Writeback:             0 kB
> >> AnonPages:          6268 kB
> >> Mapped:             2604 kB
> >> Shmem:                 4 kB
> >> Slab:             250632 kB
> >> SReclaimable:      51144 kB
> >> SUnreclaim:       199488 kB   <--- look big as well...
> >> KernelStack:      131032 kB   <--- what???
> >
> > KernelStack is used 8K bytes per thread. then, your system should have
> > 16000 threads. but your ps only showed about 80 processes.
> > Hmm... stack leak?
>
> i might have a similar report for 2.6.39-rc4 (seems to be working fine
> in 2.6.38.4), but for embedded Blackfin systems running gdbserver
> processes over and over (so lots of short lived forks)
>
> i wonder if you have a lot of zombies or otherwise unclaimed resources
> ? does `ps aux` show anything unusual ?

I've not seen anything special (no big amount of threads behind my about 80
processes, even after kernel oom-killed nearly all processes the hogged
memory has not been freed. And no, there are no zombies around).

Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
(re)emerging gimp and running revdep-rebuild -pi in another terminal.
This produces a fork rate of about 100-300 per second.

Suddenly kmalloc-128 slabs stop being freed and things degrade.

Trying to trace some of the kmalloc-128 slab allocations I end up seeing
lots of allocations like this:

[ 1338.554429] TRACE kmalloc-128 alloc 0xc294ff00 inuse=30 fp=0xc294ff00
[ 1338.554434] Pid: 1573, comm: collectd Tainted: G W 2.6.39-rc4-jupiter-00187-g686c4cb #1
[ 1338.554437] Call Trace:
[ 1338.554442] [<c10aef47>] trace+0x57/0xa0
[ 1338.554447] [<c10b07b3>] alloc_debug_processing+0xf3/0x140
[ 1338.554452] [<c10b0972>] T.999+0x172/0x1a0
[ 1338.554455] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.554459] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.554464] [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
[ 1338.554468] [<c10c08b5>] ? path_put+0x15/0x20
[ 1338.554472] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.554476] [<c10b95d8>] get_empty_filp+0x58/0xc0
[ 1338.554481] [<c10c323f>] path_openat+0x1f/0x320
[ 1338.554485] [<c10a0a4e>] ? __access_remote_vm+0x19e/0x1d0
[ 1338.554490] [<c10c3620>] do_filp_open+0x30/0x80
[ 1338.554495] [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
[ 1338.554500] [<c10c16f8>] ? getname_flags+0x28/0xe0
[ 1338.554505] [<c10cd522>] ? alloc_fd+0x62/0xe0
[ 1338.554509] [<c10c1731>] ? getname_flags+0x61/0xe0
[ 1338.554514] [<c10b781d>] do_sys_open+0xed/0x1e0
[ 1338.554519] [<c10b7979>] sys_open+0x29/0x40
[ 1338.554524] [<c1391390>] sysenter_do_call+0x12/0x26
[ 1338.556764] TRACE kmalloc-128 alloc 0xc294ff80 inuse=31 fp=0xc294ff80
[ 1338.556774] Pid: 1332, comm: bash Tainted: G W 2.6.39-rc4-jupiter-00187-g686c4cb #1
[ 1338.556779] Call Trace:
[ 1338.556794] [<c10aef47>] trace+0x57/0xa0
[ 1338.556802] [<c10b07b3>] alloc_debug_processing+0xf3/0x140
[ 1338.556807] [<c10b0972>] T.999+0x172/0x1a0
[ 1338.556812] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.556817] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.556821] [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
[ 1338.556826] [<c10b95d8>] ? get_empty_filp+0x58/0xc0
[ 1338.556830] [<c10b95d8>] get_empty_filp+0x58/0xc0
[ 1338.556841] [<c121fca8>] ? tty_ldisc_deref+0x8/0x10
[ 1338.556849] [<c10c323f>] path_openat+0x1f/0x320
[ 1338.556857] [<c11e2b3e>] ? fbcon_cursor+0xfe/0x180
[ 1338.556863] [<c10c3620>] do_filp_open+0x30/0x80
[ 1338.556868] [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
[ 1338.556873] [<c10c5e8e>] ? do_vfs_ioctl+0x7e/0x580
[ 1338.556878] [<c10c16f8>] ? getname_flags+0x28/0xe0
[ 1338.556886] [<c10cd522>] ? alloc_fd+0x62/0xe0
[ 1338.556891] [<c10c1731>] ? getname_flags+0x61/0xe0
[ 1338.556898] [<c10b781d>] do_sys_open+0xed/0x1e0
[ 1338.556903] [<c10b7979>] sys_open+0x29/0x40
[ 1338.556913] [<c1391390>] sysenter_do_call+0x12/0x26

Collectd is system monitoring daemon that counts processes, memory
usage an much more, reading lots of files under /proc every 10
seconds.
Maybe it opens a process related file at a racy moment and thus
prevents the 128 slabs and kernel stacks from being released?

Replaying the scenario I'm at:
Slab: 43112 kB
SReclaimable: 25396 kB
SUnreclaim: 17716 kB
KernelStack: 16432 kB
PageTables: 1320 kB

with
kmalloc-256 55 64 256 16 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-128 66656 66656 128 32 1 : tunables 0 0 0 : slabdata 2083 2083 0
kmalloc-64 3902 3904 64 64 1 : tunables 0 0 0 : slabdata 61 61 0

(and compiling process tree now SIGSTOPped in order to have system
not starve immediately so I can look around for information)

If I resume one of the compiling process trees both KernelStack and
slab (kmalloc-128) usage increase quite quickly (and seems to never
get down anymore) - probably at same rate as processes get born (no
matter when they end).

Bruno

> -mike

2011-04-25 09:25:15

by Pekka Enberg

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 12:17 PM, Bruno Pr?mont
<[email protected]> wrote:
> On Mon, 25 April 2011 Mike Frysinger wrote:
>> On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
>> >> On Sun, 24 April 2011 Bruno Pr?mont wrote:
>> >> > On an older system I've been running Gentoo's revdep-rebuild to check
>> >> > for system linking/*.la consistency and after doing most of the work the
>> >> > system starved more or less, just complaining about stuck tasks now and
>> >> > then.
>> >> > Memory usage graph as seen from userspace showed sudden quick increase of
>> >> > memory usage though only a very few MB were swapped out (c.f. attached RRD
>> >> > graph).
>> >>
>> >> Seems I've hit it once again (though detected before system was fully
>> >> stalled by trying to reclaim memory without success).
>> >>
>> >> This time it was during simple compiling...
>> >> Gathered info below:
>> >>
>> >> /proc/meminfo:
>> >> MemTotal: ? ? ? ? 480660 kB
>> >> MemFree: ? ? ? ? ? 64948 kB
>> >> Buffers: ? ? ? ? ? 10304 kB
>> >> Cached: ? ? ? ? ? ? 6924 kB
>> >> SwapCached: ? ? ? ? 4220 kB
>> >> Active: ? ? ? ? ? ?11100 kB
>> >> Inactive: ? ? ? ? ?15732 kB
>> >> Active(anon): ? ? ? 4732 kB
>> >> Inactive(anon): ? ? 4876 kB
>> >> Active(file): ? ? ? 6368 kB
>> >> Inactive(file): ? ?10856 kB
>> >> Unevictable: ? ? ? ? ?32 kB
>> >> Mlocked: ? ? ? ? ? ? ?32 kB
>> >> SwapTotal: ? ? ? ?524284 kB
>> >> SwapFree: ? ? ? ? 456432 kB
>> >> Dirty: ? ? ? ? ? ? ? ?80 kB
>> >> Writeback: ? ? ? ? ? ? 0 kB
>> >> AnonPages: ? ? ? ? ?6268 kB
>> >> Mapped: ? ? ? ? ? ? 2604 kB
>> >> Shmem: ? ? ? ? ? ? ? ? 4 kB
>> >> Slab: ? ? ? ? ? ? 250632 kB
>> >> SReclaimable: ? ? ?51144 kB
>> >> SUnreclaim: ? ? ? 199488 kB ? <--- look big as well...
>> >> KernelStack: ? ? ?131032 kB ? <--- what???
>> >
>> > KernelStack is used 8K bytes per thread. then, your system should have
>> > 16000 threads. but your ps only showed about 80 processes.
>> > Hmm... stack leak?
>>
>> i might have a similar report for 2.6.39-rc4 (seems to be working fine
>> in 2.6.38.4), but for embedded Blackfin systems running gdbserver
>> processes over and over (so lots of short lived forks)
>>
>> i wonder if you have a lot of zombies or otherwise unclaimed resources
>> ? ?does `ps aux` show anything unusual ?
>
> I've not seen anything special (no big amount of threads behind my about 80
> processes, even after kernel oom-killed nearly all processes the hogged
> memory has not been freed. And no, there are no zombies around).
>
> Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
> (re)emerging gimp and running revdep-rebuild -pi in another terminal.
> This produces a fork rate of about 100-300 per second.
>
> Suddenly kmalloc-128 slabs stop being freed and things degrade.
>
> Trying to trace some of the kmalloc-128 slab allocations I end up seeing
> lots of allocations like this:
>
> [ 1338.554429] TRACE kmalloc-128 alloc 0xc294ff00 inuse=30 fp=0xc294ff00
> [ 1338.554434] Pid: 1573, comm: collectd Tainted: G ? ? ? ?W ? 2.6.39-rc4-jupiter-00187-g686c4cb #1
> [ 1338.554437] Call Trace:
> [ 1338.554442] ?[<c10aef47>] trace+0x57/0xa0
> [ 1338.554447] ?[<c10b07b3>] alloc_debug_processing+0xf3/0x140
> [ 1338.554452] ?[<c10b0972>] T.999+0x172/0x1a0
> [ 1338.554455] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.554459] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.554464] ?[<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> [ 1338.554468] ?[<c10c08b5>] ? path_put+0x15/0x20
> [ 1338.554472] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.554476] ?[<c10b95d8>] get_empty_filp+0x58/0xc0
> [ 1338.554481] ?[<c10c323f>] path_openat+0x1f/0x320
> [ 1338.554485] ?[<c10a0a4e>] ? __access_remote_vm+0x19e/0x1d0
> [ 1338.554490] ?[<c10c3620>] do_filp_open+0x30/0x80
> [ 1338.554495] ?[<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> [ 1338.554500] ?[<c10c16f8>] ? getname_flags+0x28/0xe0
> [ 1338.554505] ?[<c10cd522>] ? alloc_fd+0x62/0xe0
> [ 1338.554509] ?[<c10c1731>] ? getname_flags+0x61/0xe0
> [ 1338.554514] ?[<c10b781d>] do_sys_open+0xed/0x1e0
> [ 1338.554519] ?[<c10b7979>] sys_open+0x29/0x40
> [ 1338.554524] ?[<c1391390>] sysenter_do_call+0x12/0x26
> [ 1338.556764] TRACE kmalloc-128 alloc 0xc294ff80 inuse=31 fp=0xc294ff80
> [ 1338.556774] Pid: 1332, comm: bash Tainted: G ? ? ? ?W ? 2.6.39-rc4-jupiter-00187-g686c4cb #1
> [ 1338.556779] Call Trace:
> [ 1338.556794] ?[<c10aef47>] trace+0x57/0xa0
> [ 1338.556802] ?[<c10b07b3>] alloc_debug_processing+0xf3/0x140
> [ 1338.556807] ?[<c10b0972>] T.999+0x172/0x1a0
> [ 1338.556812] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.556817] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.556821] ?[<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> [ 1338.556826] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
> [ 1338.556830] ?[<c10b95d8>] get_empty_filp+0x58/0xc0
> [ 1338.556841] ?[<c121fca8>] ? tty_ldisc_deref+0x8/0x10
> [ 1338.556849] ?[<c10c323f>] path_openat+0x1f/0x320
> [ 1338.556857] ?[<c11e2b3e>] ? fbcon_cursor+0xfe/0x180
> [ 1338.556863] ?[<c10c3620>] do_filp_open+0x30/0x80
> [ 1338.556868] ?[<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> [ 1338.556873] ?[<c10c5e8e>] ? do_vfs_ioctl+0x7e/0x580
> [ 1338.556878] ?[<c10c16f8>] ? getname_flags+0x28/0xe0
> [ 1338.556886] ?[<c10cd522>] ? alloc_fd+0x62/0xe0
> [ 1338.556891] ?[<c10c1731>] ? getname_flags+0x61/0xe0
> [ 1338.556898] ?[<c10b781d>] do_sys_open+0xed/0x1e0
> [ 1338.556903] ?[<c10b7979>] sys_open+0x29/0x40
> [ 1338.556913] ?[<c1391390>] sysenter_do_call+0x12/0x26
>
> Collectd is system monitoring daemon that counts processes, memory
> usage an much more, reading lots of files under /proc every 10
> seconds.
> Maybe it opens a process related file at a racy moment and thus
> prevents the 128 slabs and kernel stacks from being released?
>
> Replaying the scenario I'm at:
> Slab: ? ? ? ? ? ? ?43112 kB
> SReclaimable: ? ? ?25396 kB
> SUnreclaim: ? ? ? ?17716 kB
> KernelStack: ? ? ? 16432 kB
> PageTables: ? ? ? ? 1320 kB
>
> with
> kmalloc-256 ? ? ? ? ? 55 ? ? 64 ? ?256 ? 16 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? ? ?4 ? ? ?4 ? ? ?0
> kmalloc-128 ? ? ? ?66656 ?66656 ? ?128 ? 32 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? 2083 ? 2083 ? ? ?0
> kmalloc-64 ? ? ? ? ?3902 ? 3904 ? ? 64 ? 64 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? ? 61 ? ? 61 ? ? ?0
>
> (and compiling process tree now SIGSTOPped in order to have system
> not starve immediately so I can look around for information)
>
> If I resume one of the compiling process trees both KernelStack and
> slab (kmalloc-128) usage increase quite quickly (and seems to never
> get down anymore) - probably at same rate as processes get born (no
> matter when they end).

Looks like it might be a leak in VFS. You could try kmemleak to narrow
it down some more. See Documentation/kmemleak.txt for details.

Pekka

2011-04-25 10:34:59

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Pekka Enberg <[email protected]> wrote:

> On Mon, Apr 25, 2011 at 12:17 PM, Bruno Prémont
> <[email protected]> wrote:
> > On Mon, 25 April 2011 Mike Frysinger wrote:
> >> On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
> >> >> On Sun, 24 April 2011 Bruno Prémont wrote:
> >> >> > On an older system I've been running Gentoo's revdep-rebuild to check
> >> >> > for system linking/*.la consistency and after doing most of the work the
> >> >> > system starved more or less, just complaining about stuck tasks now and
> >> >> > then.
> >> >> > Memory usage graph as seen from userspace showed sudden quick increase of
> >> >> > memory usage though only a very few MB were swapped out (c.f. attached RRD
> >> >> > graph).
> >> >>
> >> >> Seems I've hit it once again (though detected before system was fully
> >> >> stalled by trying to reclaim memory without success).
> >> >>
> >> >> This time it was during simple compiling...
> >> >> Gathered info below:
> >> >>
> >> >> /proc/meminfo:
> >> >> MemTotal:         480660 kB
> >> >> MemFree:           64948 kB
> >> >> Buffers:           10304 kB
> >> >> Cached:             6924 kB
> >> >> SwapCached:         4220 kB
> >> >> Active:            11100 kB
> >> >> Inactive:          15732 kB
> >> >> Active(anon):       4732 kB
> >> >> Inactive(anon):     4876 kB
> >> >> Active(file):       6368 kB
> >> >> Inactive(file):    10856 kB
> >> >> Unevictable:          32 kB
> >> >> Mlocked:              32 kB
> >> >> SwapTotal:        524284 kB
> >> >> SwapFree:         456432 kB
> >> >> Dirty:                80 kB
> >> >> Writeback:             0 kB
> >> >> AnonPages:          6268 kB
> >> >> Mapped:             2604 kB
> >> >> Shmem:                 4 kB
> >> >> Slab:             250632 kB
> >> >> SReclaimable:      51144 kB
> >> >> SUnreclaim:       199488 kB   <--- look big as well...
> >> >> KernelStack:      131032 kB   <--- what???
> >> >
> >> > KernelStack is used 8K bytes per thread. then, your system should have
> >> > 16000 threads. but your ps only showed about 80 processes.
> >> > Hmm... stack leak?
> >>
> >> i might have a similar report for 2.6.39-rc4 (seems to be working fine
> >> in 2.6.38.4), but for embedded Blackfin systems running gdbserver
> >> processes over and over (so lots of short lived forks)
> >>
> >> i wonder if you have a lot of zombies or otherwise unclaimed resources
> >> ?  does `ps aux` show anything unusual ?
> >
> > I've not seen anything special (no big amount of threads behind my about 80
> > processes, even after kernel oom-killed nearly all processes the hogged
> > memory has not been freed. And no, there are no zombies around).
> >
> > Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
> > (re)emerging gimp and running revdep-rebuild -pi in another terminal.
> > This produces a fork rate of about 100-300 per second.
> >
> > Suddenly kmalloc-128 slabs stop being freed and things degrade.
> >
> > Trying to trace some of the kmalloc-128 slab allocations I end up seeing
> > lots of allocations like this:
> >
> > [ 1338.554429] TRACE kmalloc-128 alloc 0xc294ff00 inuse=30 fp=0xc294ff00
> > [ 1338.554434] Pid: 1573, comm: collectd Tainted: G        W   2.6.39-rc4-jupiter-00187-g686c4cb #1
> > [ 1338.554437] Call Trace:
> > [ 1338.554442]  [<c10aef47>] trace+0x57/0xa0
> > [ 1338.554447]  [<c10b07b3>] alloc_debug_processing+0xf3/0x140
> > [ 1338.554452]  [<c10b0972>] T.999+0x172/0x1a0
> > [ 1338.554455]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.554459]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.554464]  [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> > [ 1338.554468]  [<c10c08b5>] ? path_put+0x15/0x20
> > [ 1338.554472]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.554476]  [<c10b95d8>] get_empty_filp+0x58/0xc0
> > [ 1338.554481]  [<c10c323f>] path_openat+0x1f/0x320
> > [ 1338.554485]  [<c10a0a4e>] ? __access_remote_vm+0x19e/0x1d0
> > [ 1338.554490]  [<c10c3620>] do_filp_open+0x30/0x80
> > [ 1338.554495]  [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> > [ 1338.554500]  [<c10c16f8>] ? getname_flags+0x28/0xe0
> > [ 1338.554505]  [<c10cd522>] ? alloc_fd+0x62/0xe0
> > [ 1338.554509]  [<c10c1731>] ? getname_flags+0x61/0xe0
> > [ 1338.554514]  [<c10b781d>] do_sys_open+0xed/0x1e0
> > [ 1338.554519]  [<c10b7979>] sys_open+0x29/0x40
> > [ 1338.554524]  [<c1391390>] sysenter_do_call+0x12/0x26
> > [ 1338.556764] TRACE kmalloc-128 alloc 0xc294ff80 inuse=31 fp=0xc294ff80
> > [ 1338.556774] Pid: 1332, comm: bash Tainted: G        W   2.6.39-rc4-jupiter-00187-g686c4cb #1
> > [ 1338.556779] Call Trace:
> > [ 1338.556794]  [<c10aef47>] trace+0x57/0xa0
> > [ 1338.556802]  [<c10b07b3>] alloc_debug_processing+0xf3/0x140
> > [ 1338.556807]  [<c10b0972>] T.999+0x172/0x1a0
> > [ 1338.556812]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.556817]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.556821]  [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> > [ 1338.556826]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > [ 1338.556830]  [<c10b95d8>] get_empty_filp+0x58/0xc0
> > [ 1338.556841]  [<c121fca8>] ? tty_ldisc_deref+0x8/0x10
> > [ 1338.556849]  [<c10c323f>] path_openat+0x1f/0x320
> > [ 1338.556857]  [<c11e2b3e>] ? fbcon_cursor+0xfe/0x180
> > [ 1338.556863]  [<c10c3620>] do_filp_open+0x30/0x80
> > [ 1338.556868]  [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> > [ 1338.556873]  [<c10c5e8e>] ? do_vfs_ioctl+0x7e/0x580
> > [ 1338.556878]  [<c10c16f8>] ? getname_flags+0x28/0xe0
> > [ 1338.556886]  [<c10cd522>] ? alloc_fd+0x62/0xe0
> > [ 1338.556891]  [<c10c1731>] ? getname_flags+0x61/0xe0
> > [ 1338.556898]  [<c10b781d>] do_sys_open+0xed/0x1e0
> > [ 1338.556903]  [<c10b7979>] sys_open+0x29/0x40
> > [ 1338.556913]  [<c1391390>] sysenter_do_call+0x12/0x26
> >
> > Collectd is system monitoring daemon that counts processes, memory
> > usage an much more, reading lots of files under /proc every 10
> > seconds.
> > Maybe it opens a process related file at a racy moment and thus
> > prevents the 128 slabs and kernel stacks from being released?
> >
> > Replaying the scenario I'm at:
> > Slab:              43112 kB
> > SReclaimable:      25396 kB
> > SUnreclaim:        17716 kB
> > KernelStack:       16432 kB
> > PageTables:         1320 kB
> >
> > with
> > kmalloc-256           55     64    256   16    1 : tunables    0    0    0 : slabdata      4      4      0
> > kmalloc-128        66656  66656    128   32    1 : tunables    0    0    0 : slabdata   2083   2083      0
> > kmalloc-64          3902   3904     64   64    1 : tunables    0    0    0 : slabdata     61     61      0
> >
> > (and compiling process tree now SIGSTOPped in order to have system
> > not starve immediately so I can look around for information)
> >
> > If I resume one of the compiling process trees both KernelStack and
> > slab (kmalloc-128) usage increase quite quickly (and seems to never
> > get down anymore) - probably at same rate as processes get born (no
> > matter when they end).
>
> Looks like it might be a leak in VFS. You could try kmemleak to narrow
> it down some more. See Documentation/kmemleak.txt for details.

Hm, seems not to be willing to let me run kmemleak... each time I put
on my load scenario I get "BUG: unable to handle kernel " on console
as a last breath from the system. (the rest of the trace never shows up)

Going to try harder to get at least a complete trace...

Bruno

> Pekka

2011-04-25 11:42:00

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Bruno Prémont wrote:
> On Mon, 25 April 2011 Pekka Enberg wrote:
> > On Mon, Apr 25, 2011 at 12:17 PM, Bruno Prémont wrote:
> > > On Mon, 25 April 2011 Mike Frysinger wrote:
> > >> On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
> > >> >> On Sun, 24 April 2011 Bruno Prémont wrote:
> > >> >> > On an older system I've been running Gentoo's revdep-rebuild to check
> > >> >> > for system linking/*.la consistency and after doing most of the work the
> > >> >> > system starved more or less, just complaining about stuck tasks now and
> > >> >> > then.
> > >> >> > Memory usage graph as seen from userspace showed sudden quick increase of
> > >> >> > memory usage though only a very few MB were swapped out (c.f. attached RRD
> > >> >> > graph).
> > >> >>
> > >> >> Seems I've hit it once again (though detected before system was fully
> > >> >> stalled by trying to reclaim memory without success).
> > >> >>
> > >> >> This time it was during simple compiling...
> > >> >> Gathered info below:
> > >> >>
> > >> >> /proc/meminfo:
> > >> >> MemTotal:         480660 kB
> > >> >> MemFree:           64948 kB
> > >> >> Buffers:           10304 kB
> > >> >> Cached:             6924 kB
> > >> >> SwapCached:         4220 kB
> > >> >> Active:            11100 kB
> > >> >> Inactive:          15732 kB
> > >> >> Active(anon):       4732 kB
> > >> >> Inactive(anon):     4876 kB
> > >> >> Active(file):       6368 kB
> > >> >> Inactive(file):    10856 kB
> > >> >> Unevictable:          32 kB
> > >> >> Mlocked:              32 kB
> > >> >> SwapTotal:        524284 kB
> > >> >> SwapFree:         456432 kB
> > >> >> Dirty:                80 kB
> > >> >> Writeback:             0 kB
> > >> >> AnonPages:          6268 kB
> > >> >> Mapped:             2604 kB
> > >> >> Shmem:                 4 kB
> > >> >> Slab:             250632 kB
> > >> >> SReclaimable:      51144 kB
> > >> >> SUnreclaim:       199488 kB   <--- look big as well...
> > >> >> KernelStack:      131032 kB   <--- what???
> > >> >
> > >> > KernelStack is used 8K bytes per thread. then, your system should have
> > >> > 16000 threads. but your ps only showed about 80 processes.
> > >> > Hmm... stack leak?
> > >>
> > >> i might have a similar report for 2.6.39-rc4 (seems to be working fine
> > >> in 2.6.38.4), but for embedded Blackfin systems running gdbserver
> > >> processes over and over (so lots of short lived forks)
> > >>
> > >> i wonder if you have a lot of zombies or otherwise unclaimed resources
> > >> ?  does `ps aux` show anything unusual ?
> > >
> > > I've not seen anything special (no big amount of threads behind my about 80
> > > processes, even after kernel oom-killed nearly all processes the hogged
> > > memory has not been freed. And no, there are no zombies around).
> > >
> > > Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
> > > (re)emerging gimp and running revdep-rebuild -pi in another terminal.
> > > This produces a fork rate of about 100-300 per second.
> > >
> > > Suddenly kmalloc-128 slabs stop being freed and things degrade.
> > >
> > > Trying to trace some of the kmalloc-128 slab allocations I end up seeing
> > > lots of allocations like this:
> > >
> > > [ 1338.554429] TRACE kmalloc-128 alloc 0xc294ff00 inuse=30 fp=0xc294ff00
> > > [ 1338.554434] Pid: 1573, comm: collectd Tainted: G        W   2.6.39-rc4-jupiter-00187-g686c4cb #1
> > > [ 1338.554437] Call Trace:
> > > [ 1338.554442]  [<c10aef47>] trace+0x57/0xa0
> > > [ 1338.554447]  [<c10b07b3>] alloc_debug_processing+0xf3/0x140
> > > [ 1338.554452]  [<c10b0972>] T.999+0x172/0x1a0
> > > [ 1338.554455]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.554459]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.554464]  [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> > > [ 1338.554468]  [<c10c08b5>] ? path_put+0x15/0x20
> > > [ 1338.554472]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.554476]  [<c10b95d8>] get_empty_filp+0x58/0xc0
> > > [ 1338.554481]  [<c10c323f>] path_openat+0x1f/0x320
> > > [ 1338.554485]  [<c10a0a4e>] ? __access_remote_vm+0x19e/0x1d0
> > > [ 1338.554490]  [<c10c3620>] do_filp_open+0x30/0x80
> > > [ 1338.554495]  [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> > > [ 1338.554500]  [<c10c16f8>] ? getname_flags+0x28/0xe0
> > > [ 1338.554505]  [<c10cd522>] ? alloc_fd+0x62/0xe0
> > > [ 1338.554509]  [<c10c1731>] ? getname_flags+0x61/0xe0
> > > [ 1338.554514]  [<c10b781d>] do_sys_open+0xed/0x1e0
> > > [ 1338.554519]  [<c10b7979>] sys_open+0x29/0x40
> > > [ 1338.554524]  [<c1391390>] sysenter_do_call+0x12/0x26
> > > [ 1338.556764] TRACE kmalloc-128 alloc 0xc294ff80 inuse=31 fp=0xc294ff80
> > > [ 1338.556774] Pid: 1332, comm: bash Tainted: G        W   2.6.39-rc4-jupiter-00187-g686c4cb #1
> > > [ 1338.556779] Call Trace:
> > > [ 1338.556794]  [<c10aef47>] trace+0x57/0xa0
> > > [ 1338.556802]  [<c10b07b3>] alloc_debug_processing+0xf3/0x140
> > > [ 1338.556807]  [<c10b0972>] T.999+0x172/0x1a0
> > > [ 1338.556812]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.556817]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.556821]  [<c10b0a52>] kmem_cache_alloc+0xb2/0x100
> > > [ 1338.556826]  [<c10b95d8>] ? get_empty_filp+0x58/0xc0
> > > [ 1338.556830]  [<c10b95d8>] get_empty_filp+0x58/0xc0
> > > [ 1338.556841]  [<c121fca8>] ? tty_ldisc_deref+0x8/0x10
> > > [ 1338.556849]  [<c10c323f>] path_openat+0x1f/0x320
> > > [ 1338.556857]  [<c11e2b3e>] ? fbcon_cursor+0xfe/0x180
> > > [ 1338.556863]  [<c10c3620>] do_filp_open+0x30/0x80
> > > [ 1338.556868]  [<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
> > > [ 1338.556873]  [<c10c5e8e>] ? do_vfs_ioctl+0x7e/0x580
> > > [ 1338.556878]  [<c10c16f8>] ? getname_flags+0x28/0xe0
> > > [ 1338.556886]  [<c10cd522>] ? alloc_fd+0x62/0xe0
> > > [ 1338.556891]  [<c10c1731>] ? getname_flags+0x61/0xe0
> > > [ 1338.556898]  [<c10b781d>] do_sys_open+0xed/0x1e0
> > > [ 1338.556903]  [<c10b7979>] sys_open+0x29/0x40
> > > [ 1338.556913]  [<c1391390>] sysenter_do_call+0x12/0x26
> > >
> > > Collectd is system monitoring daemon that counts processes, memory
> > > usage an much more, reading lots of files under /proc every 10
> > > seconds.
> > > Maybe it opens a process related file at a racy moment and thus
> > > prevents the 128 slabs and kernel stacks from being released?
> > >
> > > Replaying the scenario I'm at:
> > > Slab:              43112 kB
> > > SReclaimable:      25396 kB
> > > SUnreclaim:        17716 kB
> > > KernelStack:       16432 kB
> > > PageTables:         1320 kB
> > >
> > > with
> > > kmalloc-256           55     64    256   16    1 : tunables    0    0    0 : slabdata      4      4      0
> > > kmalloc-128        66656  66656    128   32    1 : tunables    0    0    0 : slabdata   2083   2083      0
> > > kmalloc-64          3902   3904     64   64    1 : tunables    0    0    0 : slabdata     61     61      0
> > >
> > > (and compiling process tree now SIGSTOPped in order to have system
> > > not starve immediately so I can look around for information)
> > >
> > > If I resume one of the compiling process trees both KernelStack and
> > > slab (kmalloc-128) usage increase quite quickly (and seems to never
> > > get down anymore) - probably at same rate as processes get born (no
> > > matter when they end).
> >
> > Looks like it might be a leak in VFS. You could try kmemleak to narrow
> > it down some more. See Documentation/kmemleak.txt for details.
>
> Hm, seems not to be willing to let me run kmemleak... each time I put
> on my load scenario I get "BUG: unable to handle kernel " on console
> as a last breath from the system. (the rest of the trace never shows up)
>
> Going to try harder to get at least a complete trace...

After many attempts I got something from kmemleak (running on VESAfb
instead of vgacon or nouveau KMS), netconsole disabled.
For the crashes my screen is just too small to display the interesting
part of it (maybe I can get it via serial console at a later attempt)

What kmemcheck finds does look very repetitive:
unreferenced object 0xdd294540 (size 328):
comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
hex dump (first 32 bytes):
40 57 d2 dc 00 00 00 00 00 00 00 00 00 00 00 00 @W..............
00 00 00 00 00 00 00 00 6d 41 00 00 00 00 00 00 ........mA......
backtrace:
[<c138aae7>] kmemleak_alloc+0x27/0x50
[<c10b0b28>] kmem_cache_alloc+0x88/0x120
[<c10f452e>] proc_alloc_inode+0x1e/0x90
[<c10cd0ec>] alloc_inode+0x1c/0x80
[<c10cd162>] new_inode+0x12/0x40
[<c10f54bc>] proc_pid_make_inode+0xc/0xa0
[<c10f6835>] proc_pident_instantiate+0x15/0xa0
[<c10f693d>] proc_pident_lookup+0x7d/0xb0
[<c10f69a7>] proc_tgid_base_lookup+0x17/0x20
[<c10c1f52>] d_alloc_and_lookup+0x32/0x60
[<c10c23b4>] do_lookup+0xa4/0x250
[<c10c3653>] do_last+0xe3/0x700
[<c10c4882>] path_openat+0x92/0x320
[<c10c4bf0>] do_filp_open+0x30/0x80
[<c10b8ded>] do_sys_open+0xed/0x1e0
[<c10b8f49>] sys_open+0x29/0x40
unreferenced object 0xdd0fa180 (size 128):
comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
hex dump (first 32 bytes):
1c c0 00 00 04 00 00 00 00 00 00 00 00 02 20 00 .............. .
00 5e 24 dd 65 f6 12 00 03 00 00 00 a4 a1 0f dd .^$.e...........
backtrace:
[<c138aae7>] kmemleak_alloc+0x27/0x50
[<c10b0b28>] kmem_cache_alloc+0x88/0x120
[<c10cb95e>] d_alloc+0x1e/0x180
[<c10f5027>] proc_fill_cache+0xd7/0x140
[<c10f7b27>] proc_task_readdir+0x237/0x300
[<c10c7cf4>] vfs_readdir+0x84/0xa0
[<c10c7d74>] sys_getdents64+0x64/0xb0
[<c13945d0>] sysenter_do_call+0x12/0x26
[<ffffffff>] 0xffffffff
unreferenced object 0xdd294690 (size 328):
comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
hex dump (first 32 bytes):
40 57 d2 dc 00 00 00 00 00 00 00 00 00 00 00 00 @W..............
00 00 00 00 00 00 00 00 6d 41 00 00 00 00 00 00 ........mA......
backtrace:
[<c138aae7>] kmemleak_alloc+0x27/0x50
[<c10b0b28>] kmem_cache_alloc+0x88/0x120
[<c10f452e>] proc_alloc_inode+0x1e/0x90
[<c10cd0ec>] alloc_inode+0x1c/0x80
[<c10cd162>] new_inode+0x12/0x40
[<c10f54bc>] proc_pid_make_inode+0xc/0xa0
[<c10f6791>] proc_task_instantiate+0x11/0xa0
[<c10f506d>] proc_fill_cache+0x11d/0x140
[<c10f7b27>] proc_task_readdir+0x237/0x300
[<c10c7cf4>] vfs_readdir+0x84/0xa0
[<c10c7d74>] sys_getdents64+0x64/0xb0
[<c13945d0>] sysenter_do_call+0x12/0x26
[<ffffffff>] 0xffffffff
unreferenced object 0xdd22df80 (size 128):
comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
hex dump (first 32 bytes):
1c c0 00 00 04 00 00 00 00 00 00 00 00 02 20 00 .............. .
80 2c 13 dd 23 c5 6f d6 06 00 00 00 a4 df 22 dd .,..#.o.......".
backtrace:
[<c138aae7>] kmemleak_alloc+0x27/0x50
[<c10b0b28>] kmem_cache_alloc+0x88/0x120
[<c10cb95e>] d_alloc+0x1e/0x180
[<c10c1f40>] d_alloc_and_lookup+0x20/0x60
[<c10c23b4>] do_lookup+0xa4/0x250
[<c10c3653>] do_last+0xe3/0x700
[<c10c4882>] path_openat+0x92/0x320
[<c10c4bf0>] do_filp_open+0x30/0x80
[<c10b8ded>] do_sys_open+0xed/0x1e0
[<c10b8f49>] sys_open+0x29/0x40
[<c13945d0>] sysenter_do_call+0x12/0x26
[<ffffffff>] 0xffffffff

All I could fetch from the system (300k, expands to ~16MB
for some portion of announced 6k entries):
http://homepage.internet.lu/BrunoP/jupiter.kmemleak.bz2

Bruno

2011-04-25 11:47:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 2:41 PM, Bruno Pr?mont
<[email protected]> wrote:
> On Mon, 25 April 2011 Bruno Pr?mont wrote:
>> On Mon, 25 April 2011 Pekka Enberg wrote:
>> > On Mon, Apr 25, 2011 at 12:17 PM, Bruno Pr?mont wrote:
>> > > On Mon, 25 April 2011 Mike Frysinger wrote:
>> > >> On Sun, Apr 24, 2011 at 22:42, KOSAKI Motohiro wrote:
>> > >> >> On Sun, 24 April 2011 Bruno Pr?mont wrote:
>> > >> >> > On an older system I've been running Gentoo's revdep-rebuild to check
>> > >> >> > for system linking/*.la consistency and after doing most of the work the
>> > >> >> > system starved more or less, just complaining about stuck tasks now and
>> > >> >> > then.
>> > >> >> > Memory usage graph as seen from userspace showed sudden quick increase of
>> > >> >> > memory usage though only a very few MB were swapped out (c.f. attached RRD
>> > >> >> > graph).
>> > >> >>
>> > >> >> Seems I've hit it once again (though detected before system was fully
>> > >> >> stalled by trying to reclaim memory without success).
>> > >> >>
>> > >> >> This time it was during simple compiling...
>> > >> >> Gathered info below:
>> > >> >>
>> > >> >> /proc/meminfo:
>> > >> >> MemTotal: ? ? ? ? 480660 kB
>> > >> >> MemFree: ? ? ? ? ? 64948 kB
>> > >> >> Buffers: ? ? ? ? ? 10304 kB
>> > >> >> Cached: ? ? ? ? ? ? 6924 kB
>> > >> >> SwapCached: ? ? ? ? 4220 kB
>> > >> >> Active: ? ? ? ? ? ?11100 kB
>> > >> >> Inactive: ? ? ? ? ?15732 kB
>> > >> >> Active(anon): ? ? ? 4732 kB
>> > >> >> Inactive(anon): ? ? 4876 kB
>> > >> >> Active(file): ? ? ? 6368 kB
>> > >> >> Inactive(file): ? ?10856 kB
>> > >> >> Unevictable: ? ? ? ? ?32 kB
>> > >> >> Mlocked: ? ? ? ? ? ? ?32 kB
>> > >> >> SwapTotal: ? ? ? ?524284 kB
>> > >> >> SwapFree: ? ? ? ? 456432 kB
>> > >> >> Dirty: ? ? ? ? ? ? ? ?80 kB
>> > >> >> Writeback: ? ? ? ? ? ? 0 kB
>> > >> >> AnonPages: ? ? ? ? ?6268 kB
>> > >> >> Mapped: ? ? ? ? ? ? 2604 kB
>> > >> >> Shmem: ? ? ? ? ? ? ? ? 4 kB
>> > >> >> Slab: ? ? ? ? ? ? 250632 kB
>> > >> >> SReclaimable: ? ? ?51144 kB
>> > >> >> SUnreclaim: ? ? ? 199488 kB ? <--- look big as well...
>> > >> >> KernelStack: ? ? ?131032 kB ? <--- what???
>> > >> >
>> > >> > KernelStack is used 8K bytes per thread. then, your system should have
>> > >> > 16000 threads. but your ps only showed about 80 processes.
>> > >> > Hmm... stack leak?
>> > >>
>> > >> i might have a similar report for 2.6.39-rc4 (seems to be working fine
>> > >> in 2.6.38.4), but for embedded Blackfin systems running gdbserver
>> > >> processes over and over (so lots of short lived forks)
>> > >>
>> > >> i wonder if you have a lot of zombies or otherwise unclaimed resources
>> > >> ? ?does `ps aux` show anything unusual ?
>> > >
>> > > I've not seen anything special (no big amount of threads behind my about 80
>> > > processes, even after kernel oom-killed nearly all processes the hogged
>> > > memory has not been freed. And no, there are no zombies around).
>> > >
>> > > Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
>> > > (re)emerging gimp and running revdep-rebuild -pi in another terminal.
>> > > This produces a fork rate of about 100-300 per second.
>> > >
>> > > Suddenly kmalloc-128 slabs stop being freed and things degrade.
>> > >
>> > > Trying to trace some of the kmalloc-128 slab allocations I end up seeing
>> > > lots of allocations like this:
>> > >
>> > > [ 1338.554429] TRACE kmalloc-128 alloc 0xc294ff00 inuse=30 fp=0xc294ff00
>> > > [ 1338.554434] Pid: 1573, comm: collectd Tainted: G ? ? ? ?W ? 2.6.39-rc4-jupiter-00187-g686c4cb #1
>> > > [ 1338.554437] Call Trace:
>> > > [ 1338.554442] ?[<c10aef47>] trace+0x57/0xa0
>> > > [ 1338.554447] ?[<c10b07b3>] alloc_debug_processing+0xf3/0x140
>> > > [ 1338.554452] ?[<c10b0972>] T.999+0x172/0x1a0
>> > > [ 1338.554455] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.554459] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.554464] ?[<c10b0a52>] kmem_cache_alloc+0xb2/0x100
>> > > [ 1338.554468] ?[<c10c08b5>] ? path_put+0x15/0x20
>> > > [ 1338.554472] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.554476] ?[<c10b95d8>] get_empty_filp+0x58/0xc0
>> > > [ 1338.554481] ?[<c10c323f>] path_openat+0x1f/0x320
>> > > [ 1338.554485] ?[<c10a0a4e>] ? __access_remote_vm+0x19e/0x1d0
>> > > [ 1338.554490] ?[<c10c3620>] do_filp_open+0x30/0x80
>> > > [ 1338.554495] ?[<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
>> > > [ 1338.554500] ?[<c10c16f8>] ? getname_flags+0x28/0xe0
>> > > [ 1338.554505] ?[<c10cd522>] ? alloc_fd+0x62/0xe0
>> > > [ 1338.554509] ?[<c10c1731>] ? getname_flags+0x61/0xe0
>> > > [ 1338.554514] ?[<c10b781d>] do_sys_open+0xed/0x1e0
>> > > [ 1338.554519] ?[<c10b7979>] sys_open+0x29/0x40
>> > > [ 1338.554524] ?[<c1391390>] sysenter_do_call+0x12/0x26
>> > > [ 1338.556764] TRACE kmalloc-128 alloc 0xc294ff80 inuse=31 fp=0xc294ff80
>> > > [ 1338.556774] Pid: 1332, comm: bash Tainted: G ? ? ? ?W ? 2.6.39-rc4-jupiter-00187-g686c4cb #1
>> > > [ 1338.556779] Call Trace:
>> > > [ 1338.556794] ?[<c10aef47>] trace+0x57/0xa0
>> > > [ 1338.556802] ?[<c10b07b3>] alloc_debug_processing+0xf3/0x140
>> > > [ 1338.556807] ?[<c10b0972>] T.999+0x172/0x1a0
>> > > [ 1338.556812] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.556817] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.556821] ?[<c10b0a52>] kmem_cache_alloc+0xb2/0x100
>> > > [ 1338.556826] ?[<c10b95d8>] ? get_empty_filp+0x58/0xc0
>> > > [ 1338.556830] ?[<c10b95d8>] get_empty_filp+0x58/0xc0
>> > > [ 1338.556841] ?[<c121fca8>] ? tty_ldisc_deref+0x8/0x10
>> > > [ 1338.556849] ?[<c10c323f>] path_openat+0x1f/0x320
>> > > [ 1338.556857] ?[<c11e2b3e>] ? fbcon_cursor+0xfe/0x180
>> > > [ 1338.556863] ?[<c10c3620>] do_filp_open+0x30/0x80
>> > > [ 1338.556868] ?[<c10b0a30>] ? kmem_cache_alloc+0x90/0x100
>> > > [ 1338.556873] ?[<c10c5e8e>] ? do_vfs_ioctl+0x7e/0x580
>> > > [ 1338.556878] ?[<c10c16f8>] ? getname_flags+0x28/0xe0
>> > > [ 1338.556886] ?[<c10cd522>] ? alloc_fd+0x62/0xe0
>> > > [ 1338.556891] ?[<c10c1731>] ? getname_flags+0x61/0xe0
>> > > [ 1338.556898] ?[<c10b781d>] do_sys_open+0xed/0x1e0
>> > > [ 1338.556903] ?[<c10b7979>] sys_open+0x29/0x40
>> > > [ 1338.556913] ?[<c1391390>] sysenter_do_call+0x12/0x26
>> > >
>> > > Collectd is system monitoring daemon that counts processes, memory
>> > > usage an much more, reading lots of files under /proc every 10
>> > > seconds.
>> > > Maybe it opens a process related file at a racy moment and thus
>> > > prevents the 128 slabs and kernel stacks from being released?
>> > >
>> > > Replaying the scenario I'm at:
>> > > Slab: ? ? ? ? ? ? ?43112 kB
>> > > SReclaimable: ? ? ?25396 kB
>> > > SUnreclaim: ? ? ? ?17716 kB
>> > > KernelStack: ? ? ? 16432 kB
>> > > PageTables: ? ? ? ? 1320 kB
>> > >
>> > > with
>> > > kmalloc-256 ? ? ? ? ? 55 ? ? 64 ? ?256 ? 16 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? ? ?4 ? ? ?4 ? ? ?0
>> > > kmalloc-128 ? ? ? ?66656 ?66656 ? ?128 ? 32 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? 2083 ? 2083 ? ? ?0
>> > > kmalloc-64 ? ? ? ? ?3902 ? 3904 ? ? 64 ? 64 ? ?1 : tunables ? ?0 ? ?0 ? ?0 : slabdata ? ? 61 ? ? 61 ? ? ?0
>> > >
>> > > (and compiling process tree now SIGSTOPped in order to have system
>> > > not starve immediately so I can look around for information)
>> > >
>> > > If I resume one of the compiling process trees both KernelStack and
>> > > slab (kmalloc-128) usage increase quite quickly (and seems to never
>> > > get down anymore) - probably at same rate as processes get born (no
>> > > matter when they end).
>> >
>> > Looks like it might be a leak in VFS. You could try kmemleak to narrow
>> > it down some more. See Documentation/kmemleak.txt for details.
>>
>> Hm, seems not to be willing to let me run kmemleak... each time I put
>> on my load scenario I get "BUG: unable to handle kernel " on console
>> as a last breath from the system. (the rest of the trace never shows up)
>>
>> Going to try harder to get at least a complete trace...
>
> After many attempts I got something from kmemleak (running on VESAfb
> instead of vgacon or nouveau KMS), netconsole disabled.
> For the crashes my screen is just too small to display the interesting
> part of it (maybe I can get it via serial console at a later attempt)
>
> What kmemcheck finds does look very repetitive:
> unreferenced object 0xdd294540 (size 328):
> ?comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
> ?hex dump (first 32 bytes):
> ? ?40 57 d2 dc 00 00 00 00 00 00 00 00 00 00 00 00 ?@W..............
> ? ?00 00 00 00 00 00 00 00 6d 41 00 00 00 00 00 00 ?........mA......
> ?backtrace:
> ? ?[<c138aae7>] kmemleak_alloc+0x27/0x50
> ? ?[<c10b0b28>] kmem_cache_alloc+0x88/0x120
> ? ?[<c10f452e>] proc_alloc_inode+0x1e/0x90
> ? ?[<c10cd0ec>] alloc_inode+0x1c/0x80
> ? ?[<c10cd162>] new_inode+0x12/0x40
> ? ?[<c10f54bc>] proc_pid_make_inode+0xc/0xa0
> ? ?[<c10f6835>] proc_pident_instantiate+0x15/0xa0
> ? ?[<c10f693d>] proc_pident_lookup+0x7d/0xb0
> ? ?[<c10f69a7>] proc_tgid_base_lookup+0x17/0x20
> ? ?[<c10c1f52>] d_alloc_and_lookup+0x32/0x60
> ? ?[<c10c23b4>] do_lookup+0xa4/0x250
> ? ?[<c10c3653>] do_last+0xe3/0x700
> ? ?[<c10c4882>] path_openat+0x92/0x320
> ? ?[<c10c4bf0>] do_filp_open+0x30/0x80
> ? ?[<c10b8ded>] do_sys_open+0xed/0x1e0
> ? ?[<c10b8f49>] sys_open+0x29/0x40
> unreferenced object 0xdd0fa180 (size 128):
> ?comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
> ?hex dump (first 32 bytes):
> ? ?1c c0 00 00 04 00 00 00 00 00 00 00 00 02 20 00 ?.............. .
> ? ?00 5e 24 dd 65 f6 12 00 03 00 00 00 a4 a1 0f dd ?.^$.e...........
> ?backtrace:
> ? ?[<c138aae7>] kmemleak_alloc+0x27/0x50
> ? ?[<c10b0b28>] kmem_cache_alloc+0x88/0x120
> ? ?[<c10cb95e>] d_alloc+0x1e/0x180
> ? ?[<c10f5027>] proc_fill_cache+0xd7/0x140
> ? ?[<c10f7b27>] proc_task_readdir+0x237/0x300
> ? ?[<c10c7cf4>] vfs_readdir+0x84/0xa0
> ? ?[<c10c7d74>] sys_getdents64+0x64/0xb0
> ? ?[<c13945d0>] sysenter_do_call+0x12/0x26
> ? ?[<ffffffff>] 0xffffffff
> unreferenced object 0xdd294690 (size 328):
> ?comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
> ?hex dump (first 32 bytes):
> ? ?40 57 d2 dc 00 00 00 00 00 00 00 00 00 00 00 00 ?@W..............
> ? ?00 00 00 00 00 00 00 00 6d 41 00 00 00 00 00 00 ?........mA......
> ?backtrace:
> ? ?[<c138aae7>] kmemleak_alloc+0x27/0x50
> ? ?[<c10b0b28>] kmem_cache_alloc+0x88/0x120
> ? ?[<c10f452e>] proc_alloc_inode+0x1e/0x90
> ? ?[<c10cd0ec>] alloc_inode+0x1c/0x80
> ? ?[<c10cd162>] new_inode+0x12/0x40
> ? ?[<c10f54bc>] proc_pid_make_inode+0xc/0xa0
> ? ?[<c10f6791>] proc_task_instantiate+0x11/0xa0
> ? ?[<c10f506d>] proc_fill_cache+0x11d/0x140
> ? ?[<c10f7b27>] proc_task_readdir+0x237/0x300
> ? ?[<c10c7cf4>] vfs_readdir+0x84/0xa0
> ? ?[<c10c7d74>] sys_getdents64+0x64/0xb0
> ? ?[<c13945d0>] sysenter_do_call+0x12/0x26
> ? ?[<ffffffff>] 0xffffffff
> unreferenced object 0xdd22df80 (size 128):
> ?comm "collectd", pid 1541, jiffies 4294940278 (age 699.510s)
> ?hex dump (first 32 bytes):
> ? ?1c c0 00 00 04 00 00 00 00 00 00 00 00 02 20 00 ?.............. .
> ? ?80 2c 13 dd 23 c5 6f d6 06 00 00 00 a4 df 22 dd ?.,..#.o.......".
> ?backtrace:
> ? ?[<c138aae7>] kmemleak_alloc+0x27/0x50
> ? ?[<c10b0b28>] kmem_cache_alloc+0x88/0x120
> ? ?[<c10cb95e>] d_alloc+0x1e/0x180
> ? ?[<c10c1f40>] d_alloc_and_lookup+0x20/0x60
> ? ?[<c10c23b4>] do_lookup+0xa4/0x250
> ? ?[<c10c3653>] do_last+0xe3/0x700
> ? ?[<c10c4882>] path_openat+0x92/0x320
> ? ?[<c10c4bf0>] do_filp_open+0x30/0x80
> ? ?[<c10b8ded>] do_sys_open+0xed/0x1e0
> ? ?[<c10b8f49>] sys_open+0x29/0x40
> ? ?[<c13945d0>] sysenter_do_call+0x12/0x26
> ? ?[<ffffffff>] 0xffffffff
>
> All I could fetch from the system (300k, expands to ~16MB
> for some portion of announced 6k entries):
> ?http://homepage.internet.lu/BrunoP/jupiter.kmemleak.bz2

VFS and procfs are all over the traces - I'm adding some more people
to CC. Btw, did you manage to grab any kmemleak related crashes? It
would be good to get them fixed as well.

Pekka

2011-04-25 12:11:29

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Pekka Enberg wrote:
> On Mon, Apr 25, 2011 at 2:41 PM, Bruno Prémont wrote:
> >> Hm, seems not to be willing to let me run kmemleak... each time I put
> >> on my load scenario I get "BUG: unable to handle kernel " on console
> >> as a last breath from the system. (the rest of the trace never shows up)
> >>
> >> Going to try harder to get at least a complete trace...
> >
> > After many attempts I got something from kmemleak (running on VESAfb
> > instead of vgacon or nouveau KMS), netconsole disabled.
> > For the crashes my screen is just too small to display the interesting
> > part of it (maybe I can get it via serial console at a later attempt)
> >

...

> Btw, did you manage to grab any kmemleak related crashes? It
> would be good to get them fixed as well.

(after plugging in serial cable and hooking it to minicom)
With serial console I got the crash (unless more are waiting behind):

[ 290.477295] cc1 used greatest stack depth: 4972 bytes left
[ 304.476261] cc1plus used greatest stack depth: 4916 bytes left
[ 314.573703] BUG: unable to handle kernel NULL pointer dereference at 00000001
[ 314.580013] IP: [<c10b0aea>] kmem_cache_alloc+0x4a/0x120
[ 314.580013] *pde = 00000000
[ 314.580013] Oops: 0000 [#1]
[ 314.580013] last sysfs file: /sys/devices/platform/w83627hf.656/temp3_input
[ 314.580013] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd pcspkr snd_page_alloc
[ 314.580013]
[ 314.580013] Pid: 2119, comm: configure Tainted: G W 2.6.39-rc4-jupiter-00187-g686c4cb #3 NVIDIA Corporation. nFORCE-MCP/MS-6373
[ 314.580013] EIP: 0060:[<c10b0aea>] EFLAGS: 00210246 CPU: 0
[ 314.580013] EIP is at kmem_cache_alloc+0x4a/0x120
[ 314.580013] EAX: ddc25718 EBX: dd406100 ECX: c10b75f9 EDX: 00000000
[ 314.580013] ESI: 00000001 EDI: 000112d0 EBP: db1ebe34 ESP: db1ebe08
[ 314.580013] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 314.580013] Process configure (pid: 2119, ti=db1ea000 task=db144d00 task.ti=db1ea000)
[ 314.580013] Stack:
[ 314.580013] dc688510 c6df1690 c6df16a4 c10b75f9 db1ebe4c 00200286 00000000 001aa464
[ 314.580013] 000000d0 00000001 db31a738 db1ebe68 c10b75f9 00000000 000000d0 dc688510
[ 314.580013] 00000010 db1ebe5c c138aae7 000000d0 dd406280 000000d0 db31a738 000000d0
[ 314.580013] Call Trace:
[ 314.580013] [<c10b75f9>] ? create_object+0x29/0x210
[ 314.580013] [<c10b75f9>] create_object+0x29/0x210
[ 314.580013] [<c138aae7>] ? kmemleak_alloc+0x27/0x50
[ 314.580013] [<c138aae7>] kmemleak_alloc+0x27/0x50
[ 314.580013] [<c10b0b28>] kmem_cache_alloc+0x88/0x120
[ 314.580013] [<c10a60a0>] ? anon_vma_fork+0x50/0xe0
[ 314.580013] [<c10a6022>] ? anon_vma_clone+0x82/0xb0
[ 314.580013] [<c10a60a0>] anon_vma_fork+0x50/0xe0
[ 314.580013] [<c102c411>] dup_mm+0x1d1/0x440
[ 314.580013] [<c102d11d>] copy_process+0x98d/0xcc0
[ 314.580013] [<c102d4a7>] do_fork+0x57/0x2e0
[ 314.580013] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 314.580013] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 314.580013] [<c1008b6f>] sys_clone+0x2f/0x40
[ 314.580013] [<c139469d>] ptregs_clone+0x15/0x38
[ 314.580013] [<c13945d0>] ? sysenter_do_call+0x12/0x26
[ 314.580013] Code: 0f 85 8b 00 00 00 8b 03 8b 50 04 89 55 f0 8b 30 85 f6 0f 84 97 00 00 00 8b 03 8b 10 39 d6 75 e8 8b 50 04 39 55 f0 75 e0 8b 53 14 <8b> 14 16 89 10 8b 55 f0 8b 03 42 89 50 04 85 f6 89 f8 0f 95 c2
[ 314.580013] EIP: [<c10b0aea>] kmem_cache_alloc+0x4a/0x120 SS:ESP 0068:db1ebe08
[ 314.580013] CR2: 0000000000000001
[ 315.060947] BUG: unable to handle kernel NULL pointer dereference at 00000001
[ 315.070927] IP: [<c10b0aea>] kmem_cache_alloc+0x4a/0x120
[ 315.070927] *pde = 00000000
[ 315.070927] Oops: 0000 [#2]
[ 315.070927] last sysfs file: /sys/devices/platform/w83627hf.656/temp3_input
[ 315.070927] Modules linked in: squashfs zlib_inflate nfs lockd nfs_acl sunrpc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd pcspkr snd_page_alloc
[ 315.070927]
[ 315.070927] Pid: 2119, comm: configure Tainted: G D W 2.6.39-rc4-jupiter-00187-g686c4cb #3 NVIDIA Corporation. nFORCE-MCP/MS-6373
[ 315.070927] EIP: 0060:[<c10b0aea>] EFLAGS: 00210046 CPU: 0
[ 315.070927] EIP is at kmem_cache_alloc+0x4a/0x120
[ 315.070927] EAX: ddc25718 EBX: dd406100 ECX: c10b75f9 EDX: 00000000
[ 315.070927] ESI: 00000001 EDI: 00011220 EBP: db1ebad0 ESP: db1ebaa4
[ 315.070927] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 315.070927] Process configure (pid: 2119, ti=db1ea000 task=db144d00 task.ti=db1ea000)
[ 315.070927] Stack:
[ 315.070927] 1d424de4 00000048 0060a459 00000000 49e0f2ff 00000007 0000000e 001aa464
[ 315.070927] 00000020 00000001 dc5d8630 db1ebb04 c10b75f9 00a8b6a4 00000000 69e595ce
[ 315.070927] 00000090 db144d00 db4248a0 db1ebb14 c1025d9f 00000020 dc5d8630 00000020
[ 315.070927] Call Trace:
[ 315.070927] [<c10b75f9>] create_object+0x29/0x210
[ 315.070927] [<c1025d9f>] ? check_preempt_wakeup+0xcf/0x160
[ 315.070927] [<c138aae7>] kmemleak_alloc+0x27/0x50
[ 315.070927] [<c10b0b28>] kmem_cache_alloc+0x88/0x120
[ 315.070927] [<c103c755>] __sigqueue_alloc+0x45/0xc0
[ 315.070927] [<c103d4cd>] T.792+0x9d/0x290
[ 315.070927] [<c103e234>] do_send_sig_info+0x44/0x60
[ 315.070927] [<c103e53a>] group_send_sig_info+0x3a/0x50
[ 315.070927] [<c103e60f>] kill_pid_info+0x2f/0x50
[ 315.070927] [<c1031843>] it_real_fn+0x33/0x80
[ 315.070927] [<c1031810>] ? alarm_setitimer+0x60/0x60
[ 315.070927] [<c104b1c4>] __run_hrtimer+0x64/0x1a0
[ 315.070927] [<c10503c5>] ? ktime_get+0x55/0xf0
[ 315.070927] [<c104b555>] hrtimer_interrupt+0x115/0x250
[ 315.070927] [<c104d415>] ? sched_clock_cpu+0x95/0x110
[ 315.070927] [<c10187a1>] smp_apic_timer_interrupt+0x41/0x80
[ 315.070927] [<c139417e>] apic_timer_interrupt+0x2a/0x30
[ 315.070927] [<c100519a>] ? oops_end+0x4a/0xb0
[ 315.070927] [<c101eb6e>] no_context+0xbe/0x150
[ 315.070927] [<c101ec8f>] __bad_area_nosemaphore+0x8f/0x130
[ 315.070927] [<c108b81d>] ? __alloc_pages_nodemask+0xdd/0x730
[ 315.070927] [<c105b222>] ? search_module_extables+0x62/0x80
[ 315.070927] [<c101ed42>] bad_area_nosemaphore+0x12/0x20
[ 315.070927] [<c101f2d1>] do_page_fault+0x2f1/0x3d0
[ 315.070927] [<c105a57b>] ? __module_text_address+0xb/0x50
[ 315.070927] [<c105a5c8>] ? is_module_text_address+0x8/0x10
[ 315.070927] [<c1045207>] ? __kernel_text_address+0x47/0x70
[ 315.070927] [<c1005441>] ? print_context_stack+0x41/0xb0
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c139436c>] error_code+0x58/0x60
[ 315.070927] [<c10b75f9>] ? create_object+0x29/0x210
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c10b0aea>] ? kmem_cache_alloc+0x4a/0x120
[ 315.070927] [<c10b75f9>] ? create_object+0x29/0x210
[ 315.070927] [<c10b75f9>] create_object+0x29/0x210
[ 315.070927] [<c138aae7>] ? kmemleak_alloc+0x27/0x50
[ 315.070927] [<c138aae7>] kmemleak_alloc+0x27/0x50
[ 315.070927] [<c10b0b28>] kmem_cache_alloc+0x88/0x120
[ 315.070927] [<c10a60a0>] ? anon_vma_fork+0x50/0xe0
[ 315.070927] [<c10a6022>] ? anon_vma_clone+0x82/0xb0
[ 315.070927] [<c10a60a0>] anon_vma_fork+0x50/0xe0
[ 315.070927] [<c102c411>] dup_mm+0x1d1/0x440
[ 315.070927] [<c102d11d>] copy_process+0x98d/0xcc0
[ 315.070927] [<c102d4a7>] do_fork+0x57/0x2e0
[ 315.070927] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 315.070927] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 315.070927] [<c1008b6f>] sys_clone+0x2f/0x40
[ 315.070927] [<c139469d>] ptregs_clone+0x15/0x38
[ 315.070927] [<c13945d0>] ? sysenter_do_call+0x12/0x26
[ 315.070927] Code: 0f 85 8b 00 00 00 8b 03 8b 50 04 89 55 f0 8b 30 85 f6 0f 84 97 00 00 00 8b 03 8b 10 39 d6 75 e8 8b 50 04 39 55 f0 75 e0 8b 53 14 <8b> 14 16 89 10 8b 55 f0 8b 03 42 89 50 04 85 f6 89 f8 0f 95 c2
[ 315.070927] EIP: [<c10b0aea>] kmem_cache_alloc+0x4a/0x120 SS:ESP 0068:db1ebaa4
[ 315.070927] CR2: 0000000000000001
[ 315.070927] ---[ end trace 009f60096033f2b2 ]---
[ 315.070927] Kernel panic - not syncing: Fatal exception in interrupt
[ 315.070927] Pid: 2119, comm: configure Tainted: G D W 2.6.39-rc4-jupiter-00187-g686c4cb #3
[ 315.070927] Call Trace:
[ 315.070927] [<c139244c>] panic+0x57/0x14c
[ 315.070927] [<c10051fb>] oops_end+0xab/0xb0
[ 315.070927] [<c101eb6e>] no_context+0xbe/0x150
[ 315.070927] [<c101ec8f>] __bad_area_nosemaphore+0x8f/0x130
[ 315.070927] [<c101ed42>] bad_area_nosemaphore+0x12/0x20
[ 315.070927] [<c101f234>] do_page_fault+0x254/0x3d0
[ 315.070927] [<c11e7a52>] ? bit_putcs+0x2a2/0x430
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c139436c>] error_code+0x58/0x60
[ 315.070927] [<c10b75f9>] ? create_object+0x29/0x210
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c10b0aea>] ? kmem_cache_alloc+0x4a/0x120
[ 315.070927] [<c10b75f9>] create_object+0x29/0x210
[ 315.070927] [<c1025d9f>] ? check_preempt_wakeup+0xcf/0x160
[ 315.070927] [<c138aae7>] kmemleak_alloc+0x27/0x50
[ 315.070927] [<c10b0b28>] kmem_cache_alloc+0x88/0x120
[ 315.070927] [<c103c755>] __sigqueue_alloc+0x45/0xc0
[ 315.070927] [<c103d4cd>] T.792+0x9d/0x290
[ 315.070927] [<c103e234>] do_send_sig_info+0x44/0x60
[ 315.070927] [<c103e53a>] group_send_sig_info+0x3a/0x50
[ 315.070927] [<c103e60f>] kill_pid_info+0x2f/0x50
[ 315.070927] [<c1031843>] it_real_fn+0x33/0x80
[ 315.070927] [<c1031810>] ? alarm_setitimer+0x60/0x60
[ 315.070927] [<c104b1c4>] __run_hrtimer+0x64/0x1a0
[ 315.070927] [<c10503c5>] ? ktime_get+0x55/0xf0
[ 315.070927] [<c104b555>] hrtimer_interrupt+0x115/0x250
[ 315.070927] [<c104d415>] ? sched_clock_cpu+0x95/0x110
[ 315.070927] [<c10187a1>] smp_apic_timer_interrupt+0x41/0x80
[ 315.070927] [<c139417e>] apic_timer_interrupt+0x2a/0x30
[ 315.070927] [<c100519a>] ? oops_end+0x4a/0xb0
[ 315.070927] [<c101eb6e>] no_context+0xbe/0x150
[ 315.070927] [<c101ec8f>] __bad_area_nosemaphore+0x8f/0x130
[ 315.070927] [<c108b81d>] ? __alloc_pages_nodemask+0xdd/0x730
[ 315.070927] [<c105b222>] ? search_module_extables+0x62/0x80
[ 315.070927] [<c101ed42>] bad_area_nosemaphore+0x12/0x20
[ 315.070927] [<c101f2d1>] do_page_fault+0x2f1/0x3d0
[ 315.070927] [<c105a57b>] ? __module_text_address+0xb/0x50
[ 315.070927] [<c105a5c8>] ? is_module_text_address+0x8/0x10
[ 315.070927] [<c1045207>] ? __kernel_text_address+0x47/0x70
[ 315.070927] [<c1005441>] ? print_context_stack+0x41/0xb0
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c139436c>] error_code+0x58/0x60
[ 315.070927] [<c10b75f9>] ? create_object+0x29/0x210
[ 315.070927] [<c101efe0>] ? vmalloc_sync_all+0x100/0x100
[ 315.070927] [<c10b0aea>] ? kmem_cache_alloc+0x4a/0x120
[ 315.070927] [<c10b75f9>] ? create_object+0x29/0x210
[ 315.070927] [<c10b75f9>] create_object+0x29/0x210
[ 315.070927] [<c138aae7>] ? kmemleak_alloc+0x27/0x50
[ 315.070927] [<c138aae7>] kmemleak_alloc+0x27/0x50
[ 315.070927] [<c10b0b28>] kmem_cache_alloc+0x88/0x120
[ 315.070927] [<c10a60a0>] ? anon_vma_fork+0x50/0xe0
[ 315.070927] [<c10a6022>] ? anon_vma_clone+0x82/0xb0
[ 315.070927] [<c10a60a0>] anon_vma_fork+0x50/0xe0
[ 315.070927] [<c102c411>] dup_mm+0x1d1/0x440
[ 315.070927] [<c102d11d>] copy_process+0x98d/0xcc0
[ 315.070927] [<c102d4a7>] do_fork+0x57/0x2e0
[ 315.070927] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 315.070927] [<c11c4cc4>] ? copy_to_user+0x34/0x130
[ 315.070927] [<c1008b6f>] sys_clone+0x2f/0x40
[ 315.070927] [<c139469d>] ptregs_clone+0x15/0x38
[ 315.070927] [<c13945d0>] ? sysenter_do_call+0x12/0x26

2011-04-25 12:16:47

by Tetsuo Handa

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

I don't know whether below is related with this bug. But...

static struct dentry *proc_pident_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
{
const struct pid_entry *p = ptr;
struct inode *inode;
struct proc_inode *ei;
struct dentry *error = ERR_PTR(-ENOENT);

inode = proc_pid_make_inode(dir->i_sb, task);
if (!inode)
goto out;

ei = PROC_I(inode);
inode->i_mode = p->mode;
if (S_ISDIR(inode->i_mode))
inode->i_nlink = 2; /* Use getattr to fix if necessary */
if (p->iop)
inode->i_op = p->iop;
if (p->fop)
inode->i_fop = p->fop;
ei->op = p->op;
d_set_d_op(dentry, &pid_dentry_operations);
d_add(dentry, inode);
/* Close the race of the process dying before we return the dentry */
if (pid_revalidate(dentry, NULL))
error = NULL;
out:
return error;
}

proc_pid_make_inode() gets a ref on task, but return value of pid_revalidate()
(one of 0, 1, -ECHILD) may not be what above 'if (pid_revalidate(dentry, NULL))'
part expects. (-ECHILD is a new return value introduced by LOOKUP_RCU.)

static int pid_revalidate(struct dentry *dentry, struct nameidata *nd)
{
struct inode *inode;
struct task_struct *task;
const struct cred *cred;

if (nd && nd->flags & LOOKUP_RCU)
return -ECHILD;

inode = dentry->d_inode;
task = get_proc_task(inode);

if (task) {
if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
task_dumpable(task)) {
rcu_read_lock();
cred = __task_cred(task);
inode->i_uid = cred->euid;
inode->i_gid = cred->egid;
rcu_read_unlock();
} else {
inode->i_uid = 0;
inode->i_gid = 0;
}
inode->i_mode &= ~(S_ISUID | S_ISGID);
security_task_to_inode(task, inode);
put_task_struct(task);
return 1;
}
d_drop(dentry);
return 0;
}

2011-04-25 12:24:23

by Tetsuo Handa

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

Tetsuo Handa wrote:
> proc_pid_make_inode() gets a ref on task, but return value of pid_revalidate()
> (one of 0, 1, -ECHILD) may not be what above 'if (pid_revalidate(dentry, NULL))'
> part expects. (-ECHILD is a new return value introduced by LOOKUP_RCU.)
Sorry, nd == NULL so never returns -ECHILD.

2011-04-25 15:30:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 2:17 AM, Bruno Pr?mont
<[email protected]> wrote:
>
> Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
> (re)emerging gimp and running revdep-rebuild -pi in another terminal.
> This produces a fork rate of about 100-300 per second.
>
> Suddenly kmalloc-128 slabs stop being freed and things degrade.

So everything seems to imply some kind of filesystem/vfs thing, but
let's try to gather a bit more information about exactly what it is.

Some of it also points to RCU freeing, but that "kmalloc-128" doesn't
really match my expectations. According to your slabinfo, it's not the
dentries.

One thing I'd ask you to do is to boot with the "slub_nomerge" kernel
command line switch. The SLUB "merge slab caches" thing may save some
memory, but it has been a disaster from every other standpoint - every
time there's a memory leak, it ends up making it very confusing to try
to figure things out.

For example, your traces seem to imply that the kmalloc-128 allocation
is actually the "filp" cache, but it has gotten merged with the
kmalloc-128 cache, so slabinfo doesn't actually show the right user.

(Pekka? This is a real _problem_. The whole "confused debugging" is
wasting a lot of peoples time. Can we please try to get slabinfo
statistics work right for the merged state. Or perhaps decide to just
not merge at all?)

As to why it has started to happen now: with the whole RCU lookup
thing, many more filesystem objects are RCU-free'd (dentries have been
for a long time, but now we have inodes and filp's too), and that may
end up delaying allocations sufficiently that you end up seeing
something that used to be borderline become a major problem.

Also, what's your kernel config, in particular wrt RCU? The RCU
freeing _should_ be self-limiting (if I recall correctly) and not let
infinite amounts of RCU work (ie pending freeing) accumulate, but
maybe something is broken. Do you have a UP kernel with TINY_RCU, for
example? Or maybe I'm just confused, and there's never any RCU
throttling at all. Paul?

Linus

2011-04-25 16:05:07

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Linus Torvalds wrote:
> On Mon, Apr 25, 2011 at 2:17 AM, Bruno Prémont wrote:
> >
> > Here it seems to happened when I run 2 intensive tasks in parallel, e.g.
> > (re)emerging gimp and running revdep-rebuild -pi in another terminal.
> > This produces a fork rate of about 100-300 per second.
> >
> > Suddenly kmalloc-128 slabs stop being freed and things degrade.
>
> So everything seems to imply some kind of filesystem/vfs thing, but
> let's try to gather a bit more information about exactly what it is.
>
> Some of it also points to RCU freeing, but that "kmalloc-128" doesn't
> really match my expectations. According to your slabinfo, it's not the
> dentries.
>
> One thing I'd ask you to do is to boot with the "slub_nomerge" kernel
> command line switch. The SLUB "merge slab caches" thing may save some
> memory, but it has been a disaster from every other standpoint - every
> time there's a memory leak, it ends up making it very confusing to try
> to figure things out.
>
> For example, your traces seem to imply that the kmalloc-128 allocation
> is actually the "filp" cache, but it has gotten merged with the
> kmalloc-128 cache, so slabinfo doesn't actually show the right user.

Redone with slub_nomerge cmdline switch.
Attached (for easy diffing):

slabinfo-2, meminfo-2: when memory use starts manifesting itself
(work triggering it being SIGSTOPped)

slabinfo-4, meminfo-4: info gathered again after sync && echo 2 > /proc/sys/vm/drop_caches

kmemleak reports 86681 new leaks between shortly after boot and -2 state.
(and 2348 additional ones between -2 and -4).

> (Pekka? This is a real _problem_. The whole "confused debugging" is
> wasting a lot of peoples time. Can we please try to get slabinfo
> statistics work right for the merged state. Or perhaps decide to just
> not merge at all?)
>
> As to why it has started to happen now: with the whole RCU lookup
> thing, many more filesystem objects are RCU-free'd (dentries have been
> for a long time, but now we have inodes and filp's too), and that may
> end up delaying allocations sufficiently that you end up seeing
> something that used to be borderline become a major problem.
>
> Also, what's your kernel config, in particular wrt RCU? The RCU
> freeing _should_ be self-limiting (if I recall correctly) and not let
> infinite amounts of RCU work (ie pending freeing) accumulate, but
> maybe something is broken. Do you have a UP kernel with TINY_RCU, for
> example?

Config was in first message of thread (but unfortunately not properly
labeled), attaching again (to include change for debugging features)

Yes, it's uni-processor system, so SMP=n.
TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
compression)

Bruno


> Or maybe I'm just confused, and there's never any RCU throttling at
> all. Paul?
>
> Linus


Attachments:
(No filename) (2.83 kB)
config.gz (15.33 kB)
meminfo-2 (0.98 kB)
meminfo-4 (0.98 kB)
slabinfo-2 (15.69 kB)
slabinfo-4 (15.69 kB)
Download all attachments

2011-04-25 16:31:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

2011/4/25 Bruno Pr?mont <[email protected]>:
>
> kmemleak reports 86681 new leaks between shortly after boot and -2 state.
> (and 2348 additional ones between -2 and -4).

I wouldn't necessarily trust kmemleak with the whole RCU-freeing
thing. In your slubinfo reports, the kmemleak data itself also tends
to overwhelm everything else - none of it looks unreasonable per se.

That said, you clearly have a *lot* of filp entries. I wouldn't
consider it unreasonable, though, because depending on load those may
well be fine. Perhaps you really do have some application(s) that hold
thousands of files open. The default file limit is 1024 (I think), but
you can raise it, and some programs do end up opening tens of
thousands of files for filesystem scanning purposes.

That said, I would suggest simply trying a saner kernel configuration,
and seeing if that makes a difference:

> Yes, it's uni-processor system, so SMP=n.
> TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
> compression)

I'm not at all certain that TINY_RCU is appropriate for
general-purpose loads. I'd call it more of a "embedded low-performance
option".

The _real_ RCU implementation ("tree rcu") forces quiescent states
every few jiffies and has logic to handle "I've got tons of RCU
events, I really need to start handling them now". All of which I
think tiny-rcu lacks.

So right now I suspect that you have a situation where you just have a
simple load that just ends up never triggering any RCU cleanup, and
the tiny-rcu thing just keeps on gathering events and delays freeing
stuff almost arbitrarily long.

So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
behavior goes away. That would confirm the "it's just tinyrcu being
too dang stupid" hypothesis.

Linus

2011-04-25 17:00:48

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Linus Torvalds wrote:
> 2011/4/25 Bruno Prémont <[email protected]>:
> >
> > kmemleak reports 86681 new leaks between shortly after boot and -2 state.
> > (and 2348 additional ones between -2 and -4).
>
> I wouldn't necessarily trust kmemleak with the whole RCU-freeing
> thing. In your slubinfo reports, the kmemleak data itself also tends
> to overwhelm everything else - none of it looks unreasonable per se.
>
> That said, you clearly have a *lot* of filp entries. I wouldn't
> consider it unreasonable, though, because depending on load those may
> well be fine. Perhaps you really do have some application(s) that hold
> thousands of files open. The default file limit is 1024 (I think), but
> you can raise it, and some programs do end up opening tens of
> thousands of files for filesystem scanning purposes.
>
> That said, I would suggest simply trying a saner kernel configuration,
> and seeing if that makes a difference:
>
> > Yes, it's uni-processor system, so SMP=n.
> > TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
> > compression)
>
> I'm not at all certain that TINY_RCU is appropriate for
> general-purpose loads. I'd call it more of a "embedded low-performance
> option".

Well, TINY_RCU is the only option when doing PREEMPT_VOLUNTARY on
SMP=n...

> The _real_ RCU implementation ("tree rcu") forces quiescent states
> every few jiffies and has logic to handle "I've got tons of RCU
> events, I really need to start handling them now". All of which I
> think tiny-rcu lacks.

Going to try it out (will take some time to compile), kmemleak disabled.

> So right now I suspect that you have a situation where you just have a
> simple load that just ends up never triggering any RCU cleanup, and
> the tiny-rcu thing just keeps on gathering events and delays freeing
> stuff almost arbitrarily long.

I hope tiny-rcu is not that broken... as it would mean driving any
PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
packages (and probably also just unpacking larger tarballs or running
things like du).

And with system doing nothing (except monitoring itself) memory usage
goes increasing all the time until it starves (well it seems to keep
~20M free, pushing processes it can to swap). Config is just being
make oldconfig from working 2.6.38 kernel (answering default for new
options)

Memory usage evolution graph in first message of this thread:
http://thread.gmane.org/gmane.linux.kernel.mm/61909/focus=1130480

Attached graph matching numbers of previous mail. (dropping caches was at
17:55, system idle since then)

Bruno


> So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
> behavior goes away. That would confirm the "it's just tinyrcu being
> too dang stupid" hypothesis.
>
> Linus


Attachments:
(No filename) (2.77 kB)
jupiter.png (8.62 kB)
Download all attachments

2011-04-25 17:19:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 10:00 AM, Bruno Pr?mont
<[email protected]> wrote:
>
> I hope tiny-rcu is not that broken... as it would mean driving any
> PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> packages (and probably also just unpacking larger tarballs or running
> things like du).

I'm sure that TINYRCU can be fixed if it really is the problem.

So I just want to make sure that we know what the root cause of your
problem is. It's quite possible that it _is_ a real leak of filp or
something, but before possibly wasting time trying to figure that out,
let's see if your config is to blame.

> And with system doing nothing (except monitoring itself) memory usage
> goes increasing all the time until it starves (well it seems to keep
> ~20M free, pushing processes it can to swap). Config is just being
> make oldconfig from working 2.6.38 kernel (answering default for new
> options)

How sure are you that the system really is idle? Quite frankly, the
constant growing doesn't really look idle to me.

> Attached graph matching numbers of previous mail. (dropping caches was at
> 17:55, system idle since then)

Nothing at all going on in 'ps' during that time? And what does
slabinfo say at that point now that kmemleak isn't dominating
everything else?

Linus

2011-04-25 17:21:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 10:10 AM, Linus Torvalds
<[email protected]> wrote:
>
> Nothing at all going on in 'ps' during that time? And what does
> slabinfo say at that point now that kmemleak isn't dominating
> everything else?

In particular, if it's filp-related, what does

ls /proc/*/fd

say? It should show any processes with lots of files very clearly.

Linus

2011-04-25 17:26:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 09:31:03AM -0700, Linus Torvalds wrote:
> 2011/4/25 Bruno Pr?mont <[email protected]>:
> >
> > kmemleak reports 86681 new leaks between shortly after boot and -2 state.
> > (and 2348 additional ones between -2 and -4).
>
> I wouldn't necessarily trust kmemleak with the whole RCU-freeing
> thing. In your slubinfo reports, the kmemleak data itself also tends
> to overwhelm everything else - none of it looks unreasonable per se.
>
> That said, you clearly have a *lot* of filp entries. I wouldn't
> consider it unreasonable, though, because depending on load those may
> well be fine. Perhaps you really do have some application(s) that hold
> thousands of files open. The default file limit is 1024 (I think), but
> you can raise it, and some programs do end up opening tens of
> thousands of files for filesystem scanning purposes.
>
> That said, I would suggest simply trying a saner kernel configuration,
> and seeing if that makes a difference:
>
> > Yes, it's uni-processor system, so SMP=n.
> > TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
> > compression)
>
> I'm not at all certain that TINY_RCU is appropriate for
> general-purpose loads. I'd call it more of a "embedded low-performance
> option".
>
> The _real_ RCU implementation ("tree rcu") forces quiescent states
> every few jiffies and has logic to handle "I've got tons of RCU
> events, I really need to start handling them now". All of which I
> think tiny-rcu lacks.
>
> So right now I suspect that you have a situation where you just have a
> simple load that just ends up never triggering any RCU cleanup, and
> the tiny-rcu thing just keeps on gathering events and delays freeing
> stuff almost arbitrarily long.
>
> So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
> behavior goes away. That would confirm the "it's just tinyrcu being
> too dang stupid" hypothesis.

CONFIG_TINY_RCU is a bit more stupid than CONFIG_TREE_RCU, and ditto
for both PREEMPT versions. CONFIG_TREE_RCU will throttle RCU callback
invocation. It defaults to invoking no more than 10 at a time, until
the number of outstanding callbacks on a given CPU exceeds 10,000,
at which point it goes into emergency mode and just processes all the
remaining callbacks.

In contrast, the TINY versions always process all the remaining
callbacks. There are two reasons for the difference:

1. The fact that TINY has but one CPU speeds up the grace periods
(particularly for synchronize_rcu() in CONFIG_TINY_RCU, which
is essentially a no-op), so that callbacks should be invoked in
a more timely manner.

2. There is only one CPU for TINY, so the scenarios where one
CPU keeps another CPU totally busy invoking RCU callbacks
cannot happen.

3. TINY is supposed to be TINY, so I figured I should add RCU
callback-throttling smarts when and if they proved to be needed.
(It is not clear to me that this problem means that more smarts
are needed, but if they are, I will of course add them.)

It is quite possible that some adjustments are needed in the defaults
for CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU due to the heavier
load from the tree-walking changes.

Thanx, Paul

2011-04-25 17:29:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 07:00:32PM +0200, Bruno Pr?mont wrote:
> On Mon, 25 April 2011 Linus Torvalds wrote:
> > 2011/4/25 Bruno Pr?mont <[email protected]>:
> > >
> > > kmemleak reports 86681 new leaks between shortly after boot and -2 state.
> > > (and 2348 additional ones between -2 and -4).
> >
> > I wouldn't necessarily trust kmemleak with the whole RCU-freeing
> > thing. In your slubinfo reports, the kmemleak data itself also tends
> > to overwhelm everything else - none of it looks unreasonable per se.
> >
> > That said, you clearly have a *lot* of filp entries. I wouldn't
> > consider it unreasonable, though, because depending on load those may
> > well be fine. Perhaps you really do have some application(s) that hold
> > thousands of files open. The default file limit is 1024 (I think), but
> > you can raise it, and some programs do end up opening tens of
> > thousands of files for filesystem scanning purposes.
> >
> > That said, I would suggest simply trying a saner kernel configuration,
> > and seeing if that makes a difference:
> >
> > > Yes, it's uni-processor system, so SMP=n.
> > > TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
> > > compression)
> >
> > I'm not at all certain that TINY_RCU is appropriate for
> > general-purpose loads. I'd call it more of a "embedded low-performance
> > option".
>
> Well, TINY_RCU is the only option when doing PREEMPT_VOLUNTARY on
> SMP=n...

You can either set SMP=y and NR_CPUS=1 or you can handed-edit
init/Kconfig to remove the dependency on SMP. Just change the

depends on !PREEMPT && SMP

to:

depends on !PREEMPT

This will work fine, especially for experimental purposes.

> > The _real_ RCU implementation ("tree rcu") forces quiescent states
> > every few jiffies and has logic to handle "I've got tons of RCU
> > events, I really need to start handling them now". All of which I
> > think tiny-rcu lacks.
>
> Going to try it out (will take some time to compile), kmemleak disabled.
>
> > So right now I suspect that you have a situation where you just have a
> > simple load that just ends up never triggering any RCU cleanup, and
> > the tiny-rcu thing just keeps on gathering events and delays freeing
> > stuff almost arbitrarily long.
>
> I hope tiny-rcu is not that broken... as it would mean driving any
> PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> packages (and probably also just unpacking larger tarballs or running
> things like du).

If it is broken, I will fix it. ;-)

Thanx, Paul

> And with system doing nothing (except monitoring itself) memory usage
> goes increasing all the time until it starves (well it seems to keep
> ~20M free, pushing processes it can to swap). Config is just being
> make oldconfig from working 2.6.38 kernel (answering default for new
> options)
>
> Memory usage evolution graph in first message of this thread:
> http://thread.gmane.org/gmane.linux.kernel.mm/61909/focus=1130480
>
> Attached graph matching numbers of previous mail. (dropping caches was at
> 17:55, system idle since then)
>
> Bruno
>
>
> > So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
> > behavior goes away. That would confirm the "it's just tinyrcu being
> > too dang stupid" hypothesis.
> >
> > Linus

2011-04-25 17:51:05

by Pekka Enberg

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 6:22 PM, Linus Torvalds
<[email protected]> wrote:
> (Pekka? This is a real _problem_. The whole "confused debugging" is
> wasting a lot of peoples time. Can we please try to get slabinfo
> statistics work right for the merged state. Or perhaps decide to just
> not merge at all?)

I sent proof of concept patches that hopefully fix SLUB statistics for
merged case. Lets see what Christoph and David think of them and if
it's a dead end, I'd be inclined to rip out slab merging
completely....

Pekka

2011-04-25 18:13:29

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 7:29 PM, Paul E. McKenney
<[email protected]> wrote:
> On Mon, Apr 25, 2011 at 07:00:32PM +0200, Bruno Prémont wrote:
>> On Mon, 25 April 2011 Linus Torvalds wrote:
>> > 2011/4/25 Bruno Prémont <[email protected]>:
>> > >
>> > > kmemleak reports 86681 new leaks between shortly after boot and -2 state.
>> > > (and 2348 additional ones between -2 and -4).
>> >
>> > I wouldn't necessarily trust kmemleak with the whole RCU-freeing
>> > thing. In your slubinfo reports, the kmemleak data itself also tends
>> > to overwhelm everything else - none of it looks unreasonable per se.
>> >
>> > That said, you clearly have a *lot* of filp entries. I wouldn't
>> > consider it unreasonable, though, because depending on load those may
>> > well be fine. Perhaps you really do have some application(s) that hold
>> > thousands of files open. The default file limit is 1024 (I think), but
>> > you can raise it, and some programs do end up opening tens of
>> > thousands of files for filesystem scanning purposes.
>> >
>> > That said, I would suggest simply trying a saner kernel configuration,
>> > and seeing if that makes a difference:
>> >
>> > > Yes, it's uni-processor system, so SMP=n.
>> > > TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
>> > > compression)
>> >
>> > I'm not at all certain that TINY_RCU is appropriate for
>> > general-purpose loads. I'd call it more of a "embedded low-performance
>> > option".
>>
>> Well, TINY_RCU is the only option when doing PREEMPT_VOLUNTARY on
>> SMP=n...
>
> You can either set SMP=y and NR_CPUS=1 or you can handed-edit
> init/Kconfig to remove the dependency on SMP.  Just change the
>
>        depends on !PREEMPT && SMP
>
> to:
>
>        depends on !PREEMPT
>
> This will work fine, especially for experimental purposes.
>
>> > The _real_ RCU implementation ("tree rcu") forces quiescent states
>> > every few jiffies and has logic to handle "I've got tons of RCU
>> > events, I really need to start handling them now". All of which I
>> > think tiny-rcu lacks.
>>
>> Going to try it out (will take some time to compile), kmemleak disabled.
>>
>> > So right now I suspect that you have a situation where you just have a
>> > simple load that just ends up never triggering any RCU cleanup, and
>> > the tiny-rcu thing just keeps on gathering events and delays freeing
>> > stuff almost arbitrarily long.
>>
>> I hope tiny-rcu is not that broken... as it would mean driving any
>> PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
>> packages (and probably also just unpacking larger tarballs or running
>> things like du).
>
> If it is broken, I will fix it.  ;-)
>
>                                                        Thanx, Paul
>
>> And with system doing nothing (except monitoring itself) memory usage
>> goes increasing all the time until it starves (well it seems to keep
>> ~20M free, pushing processes it can to swap). Config is just being
>> make oldconfig from working 2.6.38 kernel (answering default for new
>> options)
>>
>> Memory usage evolution graph in first message of this thread:
>> http://thread.gmane.org/gmane.linux.kernel.mm/61909/focus=1130480
>>
>> Attached graph matching numbers of previous mail. (dropping caches was at
>> 17:55, system idle since then)
>>
>> Bruno
>>
>>
>> > So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
>> > behavior goes away. That would confirm the "it's just tinyrcu being
>> > too dang stupid" hypothesis.
>> >
>> >                      Linus
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Hi,

I was playing with Debian's kernel-buildsystem for -rc4 with a
self-defined '686-up' so-called flavour.

Here I have a Banias Pentium-M (UP, *no* PAE) and still experimenting
with kernel-config options.

CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y

...is not possible with CONFIG_SMP=y

These settings are possible by not hacking existing Kconfigs:

$ egrep 'M486|M686|X86_UP|CONFIG_SMP|NR_CPUS|PREEMPT|_RCU|_HIGHMEM|PAE'
debian/build/build_i386_none_686-up/.config
CONFIG_TREE_PREEMPT_RCU=y
# CONFIG_TINY_RCU is not set
# CONFIG_TINY_PREEMPT_RCU is not set
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=32
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_SMP is not set
# CONFIG_M486 is not set
CONFIG_M686=y
CONFIG_NR_CPUS=1
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_PREEMPT_TRACER is not set

But I also see these warnings:

.config:2106:warning: override: TREE_PREEMPT_RCU changes choice state
.config:2182:warning: override: PREEMPT changes choice state

Not sure how to interprete them, so I am a bit careful :-).

( Untested - not compiled yet! )

- Sedat -

2011-04-25 18:28:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 08:13:27PM +0200, Sedat Dilek wrote:
> On Mon, Apr 25, 2011 at 7:29 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Mon, Apr 25, 2011 at 07:00:32PM +0200, Bruno Pr?mont wrote:
> >> On Mon, 25 April 2011 Linus Torvalds wrote:
> >> > 2011/4/25 Bruno Pr?mont <[email protected]>:
> >> > >
> >> > > kmemleak reports 86681 new leaks between shortly after boot and -2 state.
> >> > > (and 2348 additional ones between -2 and -4).
> >> >
> >> > I wouldn't necessarily trust kmemleak with the whole RCU-freeing
> >> > thing. In your slubinfo reports, the kmemleak data itself also tends
> >> > to overwhelm everything else - none of it looks unreasonable per se.
> >> >
> >> > That said, you clearly have a *lot* of filp entries. I wouldn't
> >> > consider it unreasonable, though, because depending on load those may
> >> > well be fine. Perhaps you really do have some application(s) that hold
> >> > thousands of files open. The default file limit is 1024 (I think), but
> >> > you can raise it, and some programs do end up opening tens of
> >> > thousands of files for filesystem scanning purposes.
> >> >
> >> > That said, I would suggest simply trying a saner kernel configuration,
> >> > and seeing if that makes a difference:
> >> >
> >> > > Yes, it's uni-processor system, so SMP=n.
> >> > > TINY_RCU=y, PREEMPT_VOLUNTARY=y (whole /proc/config.gz attached keeping
> >> > > compression)
> >> >
> >> > I'm not at all certain that TINY_RCU is appropriate for
> >> > general-purpose loads. I'd call it more of a "embedded low-performance
> >> > option".
> >>
> >> Well, TINY_RCU is the only option when doing PREEMPT_VOLUNTARY on
> >> SMP=n...
> >
> > You can either set SMP=y and NR_CPUS=1 or you can handed-edit
> > init/Kconfig to remove the dependency on SMP. ?Just change the
> >
> > ? ? ? ?depends on !PREEMPT && SMP
> >
> > to:
> >
> > ? ? ? ?depends on !PREEMPT
> >
> > This will work fine, especially for experimental purposes.
> >
> >> > The _real_ RCU implementation ("tree rcu") forces quiescent states
> >> > every few jiffies and has logic to handle "I've got tons of RCU
> >> > events, I really need to start handling them now". All of which I
> >> > think tiny-rcu lacks.
> >>
> >> Going to try it out (will take some time to compile), kmemleak disabled.
> >>
> >> > So right now I suspect that you have a situation where you just have a
> >> > simple load that just ends up never triggering any RCU cleanup, and
> >> > the tiny-rcu thing just keeps on gathering events and delays freeing
> >> > stuff almost arbitrarily long.
> >>
> >> I hope tiny-rcu is not that broken... as it would mean driving any
> >> PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> >> packages (and probably also just unpacking larger tarballs or running
> >> things like du).
> >
> > If it is broken, I will fix it. ?;-)
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
> >
> >> And with system doing nothing (except monitoring itself) memory usage
> >> goes increasing all the time until it starves (well it seems to keep
> >> ~20M free, pushing processes it can to swap). Config is just being
> >> make oldconfig from working 2.6.38 kernel (answering default for new
> >> options)
> >>
> >> Memory usage evolution graph in first message of this thread:
> >> http://thread.gmane.org/gmane.linux.kernel.mm/61909/focus=1130480
> >>
> >> Attached graph matching numbers of previous mail. (dropping caches was at
> >> 17:55, system idle since then)
> >>
> >> Bruno
> >>
> >>
> >> > So try CONFIG_PREEMPT and CONFIG_TREE_PREEMPT_RCU to see if the
> >> > behavior goes away. That would confirm the "it's just tinyrcu being
> >> > too dang stupid" hypothesis.
> >> >
> >> > ? ? ? ? ? ? ? ? ? ? ?Linus
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >
>
> Hi,
>
> I was playing with Debian's kernel-buildsystem for -rc4 with a
> self-defined '686-up' so-called flavour.
>
> Here I have a Banias Pentium-M (UP, *no* PAE) and still experimenting
> with kernel-config options.
>
> CONFIG_X86_UP_APIC=y
> CONFIG_X86_UP_IOAPIC=y
>
> ...is not possible with CONFIG_SMP=y

Right, hence my advice to hand-edit init/Kconfig for experimental
purposes. Once that is done, you can select CONFIG_TREE_RCU with
CONFIG_SMP=n.

Thanx, Paul

> These settings are possible by not hacking existing Kconfigs:
>
> $ egrep 'M486|M686|X86_UP|CONFIG_SMP|NR_CPUS|PREEMPT|_RCU|_HIGHMEM|PAE'
> debian/build/build_i386_none_686-up/.config
> CONFIG_TREE_PREEMPT_RCU=y
> # CONFIG_TINY_RCU is not set
> # CONFIG_TINY_PREEMPT_RCU is not set
> CONFIG_PREEMPT_RCU=y
> # CONFIG_RCU_TRACE is not set
> CONFIG_RCU_FANOUT=32
> # CONFIG_RCU_FANOUT_EXACT is not set
> # CONFIG_TREE_RCU_TRACE is not set
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_SMP is not set
> # CONFIG_M486 is not set
> CONFIG_M686=y
> CONFIG_NR_CPUS=1
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_X86_UP_APIC=y
> CONFIG_X86_UP_IOAPIC=y
> CONFIG_HIGHMEM4G=y
> # CONFIG_HIGHMEM64G is not set
> CONFIG_HIGHMEM=y
> CONFIG_DEBUG_PREEMPT=y
> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_DEBUG_HIGHMEM is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> # CONFIG_PREEMPT_TRACER is not set
>
> But I also see these warnings:
>
> .config:2106:warning: override: TREE_PREEMPT_RCU changes choice state
> .config:2182:warning: override: PREEMPT changes choice state
>
> Not sure how to interprete them, so I am a bit careful :-).
>
> ( Untested - not compiled yet! )
>
> - Sedat -

2011-04-25 18:36:21

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 Linus Torvalds wrote:
> On Mon, Apr 25, 2011 at 10:00 AM, Bruno Prémont wrote:
> >
> > I hope tiny-rcu is not that broken... as it would mean driving any
> > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> > packages (and probably also just unpacking larger tarballs or running
> > things like du).
>
> I'm sure that TINYRCU can be fixed if it really is the problem.
>
> So I just want to make sure that we know what the root cause of your
> problem is. It's quite possible that it _is_ a real leak of filp or
> something, but before possibly wasting time trying to figure that out,
> let's see if your config is to blame.

With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced
yet.

When I was reproducing with TINYRCU things went normally for some time
until suddenly slabs stopped being freed.

> > And with system doing nothing (except monitoring itself) memory usage
> > goes increasing all the time until it starves (well it seems to keep
> > ~20M free, pushing processes it can to swap). Config is just being
> > make oldconfig from working 2.6.38 kernel (answering default for new
> > options)
>
> How sure are you that the system really is idle? Quite frankly, the
> constant growing doesn't really look idle to me.

Except the SIGSTOPed build there is not much left, collectd running in
background (it polls /proc for process counts, fork rate, memory usage,
... opening, reading, closing the files -- scanning every 10 seconds),
slabtop on one terminal.

CPU activity was near-zero with 10%-20% spikes of system use every 10
minutes and io-wait when all cache had been pushed out.

> > Attached graph matching numbers of previous mail. (dropping caches was at
> > 17:55, system idle since then)
>
> Nothing at all going on in 'ps' during that time? And what does
> slabinfo say at that point now that kmemleak isn't dominating
> everything else?

ps definitely does not show anything special, 30 or so userspace processes.
Didn't check ls /proc/*/fd though. Will do at next occurrence.


Going to test further with various PREEMPT and RCU selections. Will report
back as I progress (but won't have much time tomorrow).

Bruno

2011-04-25 19:16:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 08:36:06PM +0200, Bruno Pr?mont wrote:
> On Mon, 25 April 2011 Linus Torvalds wrote:
> > On Mon, Apr 25, 2011 at 10:00 AM, Bruno Pr?mont wrote:
> > >
> > > I hope tiny-rcu is not that broken... as it would mean driving any
> > > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> > > packages (and probably also just unpacking larger tarballs or running
> > > things like du).
> >
> > I'm sure that TINYRCU can be fixed if it really is the problem.
> >
> > So I just want to make sure that we know what the root cause of your
> > problem is. It's quite possible that it _is_ a real leak of filp or
> > something, but before possibly wasting time trying to figure that out,
> > let's see if your config is to blame.
>
> With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced
> yet.
>
> When I was reproducing with TINYRCU things went normally for some time
> until suddenly slabs stopped being freed.

Hmmm... If the system is responsive during this time, could you please
do the following after the slabs stop being freed?

ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cpu_time,cmd | grep '\[rcu'

Thanx, Paul

> > > And with system doing nothing (except monitoring itself) memory usage
> > > goes increasing all the time until it starves (well it seems to keep
> > > ~20M free, pushing processes it can to swap). Config is just being
> > > make oldconfig from working 2.6.38 kernel (answering default for new
> > > options)
> >
> > How sure are you that the system really is idle? Quite frankly, the
> > constant growing doesn't really look idle to me.
>
> Except the SIGSTOPed build there is not much left, collectd running in
> background (it polls /proc for process counts, fork rate, memory usage,
> ... opening, reading, closing the files -- scanning every 10 seconds),
> slabtop on one terminal.
>
> CPU activity was near-zero with 10%-20% spikes of system use every 10
> minutes and io-wait when all cache had been pushed out.
>
> > > Attached graph matching numbers of previous mail. (dropping caches was at
> > > 17:55, system idle since then)
> >
> > Nothing at all going on in 'ps' during that time? And what does
> > slabinfo say at that point now that kmemleak isn't dominating
> > everything else?
>
> ps definitely does not show anything special, 30 or so userspace processes.
> Didn't check ls /proc/*/fd though. Will do at next occurrence.
>
>
> Going to test further with various PREEMPT and RCU selections. Will report
> back as I progress (but won't have much time tomorrow).
>
> Bruno

2011-04-25 21:10:34

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 April 2011 "Paul E. McKenney" wrote:
> On Mon, Apr 25, 2011 at 08:36:06PM +0200, Bruno Prémont wrote:
> > On Mon, 25 April 2011 Linus Torvalds wrote:
> > > On Mon, Apr 25, 2011 at 10:00 AM, Bruno Prémont wrote:
> > > >
> > > > I hope tiny-rcu is not that broken... as it would mean driving any
> > > > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> > > > packages (and probably also just unpacking larger tarballs or running
> > > > things like du).
> > >
> > > I'm sure that TINYRCU can be fixed if it really is the problem.
> > >
> > > So I just want to make sure that we know what the root cause of your
> > > problem is. It's quite possible that it _is_ a real leak of filp or
> > > something, but before possibly wasting time trying to figure that out,
> > > let's see if your config is to blame.
> >
> > With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced
> > yet.
> >
> > When I was reproducing with TINYRCU things went normally for some time
> > until suddenly slabs stopped being freed.
>
> Hmmm... If the system is responsive during this time, could you please
> do the following after the slabs stop being freed?
>
> ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cpu_time,cmd | grep '\[rcu'

Looks like tinyrcu is not innocent (or at least it makes bug appear much
more easily)

With + + TREE_PREMPT_RCU system was stable compiling for over 2 hours,
switching to TINY_RCU, filp count started increasing pretty early after beginning
compiling.

All the relevant information attached (PREEMPT+TINY_RCU):
config.gz
ps auxf |
slabinfo | twice, once early (1-*), the second 30 minutes later (2-*)
meminfo |

ls -l proc/*/fd produces 658 lines for the 1-* series of numbers, 300 for 2-*.

In both cases
ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cputime,cmd | grep '\[rcu'
returns the same information:
6 FF 1 1 R R 0 00:00:00 [rcu_kthread]


according to slabtop filp count is increasing permanentally, (about +1000
every 3 seconds) probably because of top (1s refresh rate) and collectd (10s
rate) scanning /proc (without top, increasing by about 300 every 10s).

Running something like `for ((X=0; X < 200; X++)); do /bin/true; done` causes
count of pid, task_struct, signal_cache slab count to increase by about 200,
but no zombies are being left behind.

1-* Taken a few minutes after starting compile process, but after having
SIGSTOPed the compiling process tree
2-* about 30 minutes later, killed compile process tree, run above for loop
multiple times, close most terminal sessions (including top)

Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
ones did decrease. Don't know which ones are RCU-affected and which ones are
not.

Bruno


Attachments:
(No filename) (2.74 kB)
config.gz (15.34 kB)
1-meminfo (0.98 kB)
1-ps_auxf (22.80 kB)
1-slabinfo (15.48 kB)
2-meminfo (0.98 kB)
2-ps_auxf (4.62 kB)
2-slabinfo (15.48 kB)
Download all attachments

2011-04-25 21:26:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 11:10:16PM +0200, Bruno Pr?mont wrote:
> On Mon, 25 April 2011 "Paul E. McKenney" wrote:
> > On Mon, Apr 25, 2011 at 08:36:06PM +0200, Bruno Pr?mont wrote:
> > > On Mon, 25 April 2011 Linus Torvalds wrote:
> > > > On Mon, Apr 25, 2011 at 10:00 AM, Bruno Pr?mont wrote:
> > > > >
> > > > > I hope tiny-rcu is not that broken... as it would mean driving any
> > > > > PREEMPT_NONE or PREEMPT_VOLUNTARY system out of memory when compiling
> > > > > packages (and probably also just unpacking larger tarballs or running
> > > > > things like du).
> > > >
> > > > I'm sure that TINYRCU can be fixed if it really is the problem.
> > > >
> > > > So I just want to make sure that we know what the root cause of your
> > > > problem is. It's quite possible that it _is_ a real leak of filp or
> > > > something, but before possibly wasting time trying to figure that out,
> > > > let's see if your config is to blame.
> > >
> > > With changed config (PREEMPT=y, TREE_PREEMPT_RCU=y) I haven't reproduced
> > > yet.
> > >
> > > When I was reproducing with TINYRCU things went normally for some time
> > > until suddenly slabs stopped being freed.
> >
> > Hmmm... If the system is responsive during this time, could you please
> > do the following after the slabs stop being freed?
> >
> > ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cpu_time,cmd | grep '\[rcu'
>
> Looks like tinyrcu is not innocent (or at least it makes bug appear much
> more easily)
>
> With + + TREE_PREMPT_RCU system was stable compiling for over 2 hours,
> switching to TINY_RCU, filp count started increasing pretty early after beginning
> compiling.
>
> All the relevant information attached (PREEMPT+TINY_RCU):
> config.gz
> ps auxf |
> slabinfo | twice, once early (1-*), the second 30 minutes later (2-*)
> meminfo |
>
> ls -l proc/*/fd produces 658 lines for the 1-* series of numbers, 300 for 2-*.
>
> In both cases
> ps -eo pid,class,sched,rtprio,stat,state,sgi_p,cputime,cmd | grep '\[rcu'
> returns the same information:
> 6 FF 1 1 R R 0 00:00:00 [rcu_kthread]

So rcu_kthread is runnable at SCHED_FIFO priority 1, but not accumulating
any CPU time. Sedat has also seen this, and I never have been able to
reproduce it.

Anyone have any idea how this woiuld happen?

(And if rcu_kthread isn't running, I would expect exactly the symptoms
you are seeing. I just don't understand why it isn't running if it
is runnable.)

Thanx, Paul

> according to slabtop filp count is increasing permanentally, (about +1000
> every 3 seconds) probably because of top (1s refresh rate) and collectd (10s
> rate) scanning /proc (without top, increasing by about 300 every 10s).
>
> Running something like `for ((X=0; X < 200; X++)); do /bin/true; done` causes
> count of pid, task_struct, signal_cache slab count to increase by about 200,
> but no zombies are being left behind.
>
> 1-* Taken a few minutes after starting compile process, but after having
> SIGSTOPed the compiling process tree
> 2-* about 30 minutes later, killed compile process tree, run above for loop
> multiple times, close most terminal sessions (including top)
>
> Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
> ones did decrease. Don't know which ones are RCU-affected and which ones are
> not.
>
> Bruno


> MemTotal: 480420 kB
> MemFree: 175180 kB
> Buffers: 37604 kB
> Cached: 30436 kB
> SwapCached: 128 kB
> Active: 97532 kB
> Inactive: 77776 kB
> Active(anon): 51012 kB
> Inactive(anon): 55480 kB
> Active(file): 46520 kB
> Inactive(file): 22296 kB
> Unevictable: 32 kB
> Mlocked: 32 kB
> SwapTotal: 524284 kB
> SwapFree: 524156 kB
> Dirty: 16 kB
> Writeback: 0 kB
> AnonPages: 106300 kB
> Mapped: 12732 kB
> Shmem: 112 kB
> Slab: 67580 kB
> SReclaimable: 18596 kB
> SUnreclaim: 48984 kB
> KernelStack: 56352 kB
> PageTables: 1344 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 764492 kB
> Committed_AS: 173588 kB
> VmallocTotal: 548548 kB
> VmallocUsed: 8392 kB
> VmallocChunk: 534328 kB
> AnonHugePages: 0 kB
> DirectMap4k: 16320 kB
> DirectMap4M: 475136 kB

> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 2 0.0 0.0 0 0 ? S 22:14 0:00 [kthreadd]
> root 3 0.0 0.0 0 0 ? S 22:14 0:00 \_ [ksoftirqd/0]
> root 6 0.1 0.0 0 0 ? R 22:14 0:00 \_ [rcu_kthread]
> root 7 0.0 0.0 0 0 ? R 22:14 0:00 \_ [watchdog/0]
> root 8 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [khelper]
> root 138 0.0 0.0 0 0 ? S 22:14 0:00 \_ [sync_supers]
> root 140 0.0 0.0 0 0 ? S 22:14 0:00 \_ [bdi-default]
> root 142 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [kblockd]
> root 230 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [ata_sff]
> root 237 0.0 0.0 0 0 ? S 22:14 0:00 \_ [khubd]
> root 365 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kswapd0]
> root 464 0.0 0.0 0 0 ? S 22:14 0:00 \_ [fsnotify_mark]
> root 486 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfs_mru_cache]
> root 489 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfslogd]
> root 490 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfsdatad]
> root 491 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfsconvertd]
> root 554 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_0]
> root 559 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_1]
> root 573 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_2]
> root 576 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_3]
> root 579 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kworker/u:4]
> root 580 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kworker/u:5]
> root 589 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_4]
> root 592 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_5]
> root 655 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [kpsmoused]
> root 706 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [reiserfs]
> root 1485 0.1 0.0 0 0 ? S 22:14 0:01 \_ [kworker/0:3]
> root 1486 0.0 0.0 0 0 ? S 22:14 0:00 \_ [flush-8:0]
> root 1692 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [rpciod]
> root 1693 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [nfsiod]
> root 1697 0.0 0.0 0 0 ? S 22:14 0:00 \_ [lockd]
> root 26248 0.0 0.0 0 0 ? S 22:21 0:00 \_ [kworker/0:2]
> root 26445 0.0 0.0 0 0 ? S 22:21 0:00 \_ [kworker/0:4]
> root 1 0.3 0.1 1740 588 ? Ss 22:14 0:02 init [3]
> root 823 0.0 0.1 2132 824 ? S<s 22:14 0:00 /sbin/udevd --daemon
> root 1778 0.0 0.1 2128 696 ? S< 22:15 0:00 \_ /sbin/udevd --daemon
> root 1377 0.0 0.3 4876 1780 tty2 Ss 22:14 0:00 -bash
> root 3692 0.1 0.2 2276 988 tty2 S+ 22:18 0:00 \_ slabtop
> root 1378 0.0 0.3 4876 1768 tty3 Ss+ 22:14 0:00 -bash
> root 1781 1.4 6.1 34372 29736 tty3 TN 22:16 0:08 \_ /usr/bin/python2.7 /usr/bin/emerge --oneshot gimp
> portage 15556 0.0 0.5 5924 2696 tty3 TN 22:19 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
> portage 15655 0.0 0.4 6060 2200 tty3 TN 22:19 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
> portage 15662 0.0 0.3 4880 1560 tty3 TN 22:19 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild-helpers/emake
> portage 15667 0.0 0.1 3860 960 tty3 TN 22:19 0:00 \_ make -j2
> portage 15668 0.0 0.2 3864 992 tty3 TN 22:19 0:00 \_ make all-recursive
> portage 15669 0.0 0.2 4752 1420 tty3 TN 22:19 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='m4macros tools cursors themes po po-libgimp po-plug-ins po-python po-script-fu po-tips data desktop menus libgimpbase libgimpcolor libgimpmath libgimpconfig libgimpmodule libgimpthumb libgimpwidgets libgimp app modules plug-ins etc devel-docs docs'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (CDPATH="${ZSH_VERSION+.}:" && cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 31137 0.0 0.1 4752 740 tty3 TN 22:22 0:00 \_ /bin/sh -c fail= failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='m4macros tools cursors themes po po-libgimp po-plug-ins po-python po-script-fu po-tips data desktop menus libgimpbase libgimpcolor libgimpmath libgimpconfig libgimpmodule libgimpthumb libgimpwidgets libgimp app modules plug-ins etc devel-docs docs'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (CDPATH="${ZSH_VERSION+.}:" && cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 31138 0.0 0.2 3992 1164 tty3 TN 22:22 0:00 \_ make all
> portage 601 0.0 0.3 5012 1676 tty3 TN 22:22 0:00 \_ /bin/sh ../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.. -I.. -pthread -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -O2 -march=athlon-xp -pipe -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wold-style-definition -MT gimpcolordisplaystack.lo -MD -MP -MF .deps/gimpcolordisplaystack.Tpo -c -o gimpcolordisplaystack.lo gimpcolordisplaystack.c
> portage 616 0.0 0.1 1924 536 tty3 TN 22:22 0:00 | \_ /usr/i686-pc-linux-gnu/gcc-bin/4.4.5/i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.. -I.. -pthread -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -O2 -march=athlon-xp -pipe -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wold-style-definition -MT gimpcolordisplaystack.lo -MD -MP -MF .deps/gimpcolordisplaystack.Tpo -c gimpcolordisplaystack.c -fPIC -DPIC -o .libs/gimpcolordisplaystack.o
> portage 617 0.4 4.5 27296 21728 tty3 TN 22:22 0:00 | \_ /usr/libexec/gcc/i686-pc-linux-gnu/4.4.5/cc1 -quiet -I. -I.. -I.. -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -MD .libs/gimpcolordisplaystack.d -MF .deps/gimpcolordisplaystack.Tpo -MP -MT gimpcolordisplaystack.lo -D_REENTRANT -DHAVE_CONFIG_H -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -DPIC gimpcolordisplaystack.c -D_FORTIFY_SOURCE=2 -quiet -dumpbase gimpcolordisplaystack.c -march=athlon-xp -auxbase-strip .libs/gimpcolordisplaystack.o -O2 -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wo
> ld-style-definition -fPIC -o -
> portage 619 0.0 0.6 5284 3128 tty3 TN 22:22 0:00 | \_ /usr/lib/gcc/i686-pc-linux-gnu/4.4.5/../../../../i686-pc-linux-gnu/bin/as -Qy -o .libs/gimpcolordisplaystack.o -
> portage 632 0.0 0.3 5012 1672 tty3 TN 22:22 0:00 \_ /bin/sh ../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.. -I.. -pthread -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -O2 -march=athlon-xp -pipe -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wold-style-definition -MT gimpenumwidgets.lo -MD -MP -MF .deps/gimpenumwidgets.Tpo -c -o gimpenumwidgets.lo gimpenumwidgets.c
> portage 647 0.0 0.1 1924 536 tty3 TN 22:22 0:00 \_ /usr/i686-pc-linux-gnu/gcc-bin/4.4.5/i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I.. -I.. -pthread -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -O2 -march=athlon-xp -pipe -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wold-style-definition -MT gimpenumwidgets.lo -MD -MP -MF .deps/gimpenumwidgets.Tpo -c gimpenumwidgets.c -fPIC -DPIC -o .libs/gimpenumwidgets.o
> portage 648 0.1 2.3 19448 11284 tty3 TN 22:22 0:00 \_ /usr/libexec/gcc/i686-pc-linux-gnu/4.4.5/cc1 -quiet -I. -I.. -I.. -I/usr/include/gtk-2.0 -I/usr/lib/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/gdk-pixbuf-2.0 -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng14 -I/usr/include/libdrm -I/usr/include -MD .libs/gimpenumwidgets.d -MF .deps/gimpenumwidgets.Tpo -MP -MT gimpenumwidgets.lo -D_REENTRANT -DHAVE_CONFIG_H -DG_LOG_DOMAIN="LibGimpWidgets" -DGIMP_DISABLE_DEPRECATED -DGDK_MULTIHEAD_SAFE -DGTK_MULTIHEAD_SAFE -DPIC gimpenumwidgets.c -D_FORTIFY_SOURCE=2 -quiet -dumpbase gimpenumwidgets.c -march=athlon-xp -auxbase-strip .libs/gimpenumwidgets.o -O2 -Wall -Wdeclaration-after-statement -Wmissing-prototypes -Wmissing-declarations -Winit-self -Wpointer-arith -Wold-style-definition -fPIC -o -
> portage 649 0.0 0.6 5284 3116 tty3 TN 22:22 0:00 \_ /usr/lib/gcc/i686-pc-linux-gnu/4.4.5/../../../../i686-pc-linux-gnu/bin/as -Qy -o .libs/gimpenumwidgets.o -
> root 1379 0.0 0.3 4876 1728 tty4 Ss+ 22:14 0:00 -bash
> root 4015 1.2 6.1 34176 29364 tty4 TN 22:18 0:06 \_ /usr/bin/python2.7 /usr/bin/emerge --oneshot libetpan
> portage 7306 0.0 0.3 5136 1864 tty4 TN 22:18 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
> portage 7463 0.0 0.5 6132 2460 tty4 TN 22:18 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild.sh compile
> portage 19334 0.0 0.3 4876 1556 tty4 TN 22:19 0:00 \_ /bin/bash /usr/lib/portage/bin/ebuild-helpers/emake
> portage 19339 0.0 0.2 3848 1032 tty4 TN 22:19 0:00 \_ make -j2
> portage 19736 0.0 0.2 3860 972 tty4 TN 22:19 0:00 \_ make all-recursive
> portage 19737 0.0 0.2 4748 1404 tty4 TN 22:19 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='build-windows include src tests doc'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 19747 0.0 0.1 4748 696 tty4 TN 22:19 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='build-windows include src tests doc'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 19748 0.0 0.2 3848 1052 tty4 TN 22:19 0:00 \_ make all
> portage 19749 0.0 0.2 3848 980 tty4 TN 22:19 0:00 \_ make all-recursive
> portage 19750 0.0 0.2 4748 1404 tty4 TN 22:19 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='bsd data-types low-level driver main engine'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 23219 0.0 0.1 4748 696 tty4 TN 22:20 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='bsd data-types low-level driver main engine'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 23220 0.0 0.2 3840 1040 tty4 TN 22:20 0:00 \_ make all
> portage 23225 0.0 0.2 3860 968 tty4 TN 22:20 0:00 \_ make all-recursive
> portage 23227 0.0 0.2 4748 1404 tty4 TN 22:20 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='imap imf maildir mbox mh mime nntp pop3 smtp feed'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 337 0.0 0.1 4748 696 tty4 TN 22:22 0:00 \_ /bin/sh -c failcom='exit 1'; \?for f in x $MAKEFLAGS; do \? case $f in \? *=* | --[!k]*);; \? *k*) failcom='fail=yes';; \? esac; \?done; \?dot_seen=no; \?target=`echo all-recursive | sed s/-recursive//`; \?list='imap imf maildir mbox mh mime nntp pop3 smtp feed'; for subdir in $list; do \? echo "Making $target in $subdir"; \? if test "$subdir" = "."; then \? dot_seen=yes; \? local_target="$target-am"; \? else \? local_target="$target"; \? fi; \? (cd $subdir && make $local_target) \? || eval $failcom; \?done; \?if test "$dot_seen" = "no"; then \? make "$target-am" || exit 1; \?fi; test -z "$fail"
> portage 338 0.0 0.2 3944 1064 tty4 TN 22:22 0:00 \_ make all
> portage 342 0.0 0.2 3844 1004 tty4 TN 22:22 0:00 \_ make all-am
> portage 653 0.0 0.4 5532 2176 tty4 TN 22:22 0:00 \_ /bin/sh ../../../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_content.lo -MD -MP -MF .deps/mailmime_content.Tpo -c -o mailmime_content.lo mailmime_content.c
> portage 927 0.0 0.1 1920 532 tty4 TN 22:22 0:00 | \_ /usr/i686-pc-linux-gnu/gcc-bin/4.4.5/i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_content.lo -MD -MP -MF .deps/mailmime_content.Tpo -c mailmime_content.c -o mailmime_content.o
> portage 930 0.0 0.7 11692 3620 tty4 TN 22:22 0:00 | \_ /usr/libexec/gcc/i686-pc-linux-gnu/4.4.5/cc1 -quiet -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -MD mailmime_content.d -MF .deps/mailmime_content.Tpo -MP -MT mailmime_content.lo -DHAVE_CONFIG_H -DDEBUG -D_REENTRANT mailmime_content.c -D_FORTIFY_SOURCE=2 -quiet -dumpbase mailmime_content.c -march=athlon-xp -auxbase-strip mailmime_content.o -g -O2 -O2 -W -Wall -o -
> portage 932 0.0 0.6 5280 3108 tty4 TN 22:22 0:00 | \_ /usr/lib/gcc/i686-pc-linux-gnu/4.4.5/../../../../i686-pc-linux-gnu/bin/as -Qy -o mailmime_content.o -
> portage 891 0.0 0.4 5400 2156 tty4 TN 22:22 0:00 \_ /bin/sh ../../../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_disposition.lo -MD -MP -MF .deps/mailmime_disposition.Tpo -c -o mailmime_disposition.lo mailmime_disposition.c
> portage 938 0.0 0.2 5400 1244 tty4 TN 22:22 0:00 \_ /bin/sh ../../../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_disposition.lo -MD -MP -MF .deps/mailmime_disposition.Tpo -c -o mailmime_disposition.lo mailmime_disposition.c
> portage 939 0.0 0.1 5400 944 tty4 TN 22:22 0:00 \_ /bin/sh ../../../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_disposition.lo -MD -MP -MF .deps/mailmime_disposition.Tpo -c -o mailmime_disposition.lo mailmime_disposition.c
> portage 940 0.0 0.1 5400 944 tty4 TN 22:22 0:00 \_ /bin/sh ../../../libtool --tag=CC --mode=compile i686-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../.. -I../../../include -I../../../src/low-level/imf -I../../../src/data-types -DDEBUG -D_REENTRANT -O2 -march=athlon-xp -pipe -O2 -g -W -Wall -MT mailmime_disposition.lo -MD -MP -MF .deps/mailmime_disposition.Tpo -c -o mailmime_disposition.lo mailmime_disposition.c
> root 1380 0.0 0.3 4876 1728 tty5 Ss 22:14 0:00 -bash
> root 1792 2.6 0.2 2420 1156 tty5 S+ 22:16 0:15 \_ top
> root 1381 0.0 0.1 1892 768 tty6 Ss+ 22:14 0:00 /sbin/agetty 38400 tty6 linux
> root 1521 0.0 0.0 1928 356 ? Ss 22:14 0:00 dhcpcd -m 2 eth0
> root 1562 0.0 0.1 5128 544 ? S 22:14 0:00 supervising syslog-ng
> root 1563 0.0 0.4 5408 1968 ? Ss 22:14 0:00 \_ /usr/sbin/syslog-ng
> ntp 1587 0.0 0.2 4360 1352 ? Ss 22:14 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u ntp:ntp
> collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:04 /usr/sbin/collectd -P /var/run/collectd/collectd.pid -C /etc/collectd.conf
> root 1623 0.0 0.1 1944 508 ? Ss 22:14 0:00 /usr/sbin/gpm -m /dev/input/mice -t ps2
> root 1663 0.0 0.1 2116 760 ? Ss 22:14 0:00 /sbin/rpcbind
> root 1677 0.0 0.2 2188 968 ? Ss 22:14 0:00 /sbin/rpc.statd --no-notify
> root 1737 0.0 0.2 4204 988 ? Ss 22:15 0:00 /usr/sbin/sshd
> root 942 0.0 0.4 6872 2252 ? Ss 22:23 0:00 \_ sshd: root@pts/2
> root 944 0.0 0.3 4876 1780 pts/2 Ss 22:23 0:00 \_ -bash
> root 961 0.0 0.2 4124 964 pts/2 R+ 22:26 0:00 \_ ps auxf
> root 1766 0.0 0.1 1892 780 tty1 Ss+ 22:15 0:00 /sbin/agetty 38400 tty1 linux
> root 1767 0.0 0.1 1892 784 ttyS0 Ss+ 22:15 0:00 /sbin/agetty 115200 ttyS0 vt100

> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> squashfs_inode_cache 1900 1900 384 10 1 : tunables 0 0 0 : slabdata 190 190 0
> nfs_direct_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> nfs_write_data 40 40 480 8 1 : tunables 0 0 0 : slabdata 5 5 0
> nfs_read_data 36 36 448 9 1 : tunables 0 0 0 : slabdata 4 4 0
> nfs_inode_cache 70 70 576 14 2 : tunables 0 0 0 : slabdata 5 5 0
> nfs_page 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> rpc_buffers 15 15 2080 15 8 : tunables 0 0 0 : slabdata 1 1 0
> rpc_tasks 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> rpc_inode_cache 36 36 448 9 1 : tunables 0 0 0 : slabdata 4 4 0
> fib6_nodes 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> ip6_dst_cache 29 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> ndisc_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 672 12 2 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> tw_sock_TCPv6 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
> request_sock_TCPv6 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> TCPv6 12 12 1312 12 4 : tunables 0 0 0 : slabdata 1 1 0
> aoe_bufs 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> scsi_sense_cache 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> scsi_cmd_cache 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> sd_ext_cdb 85 85 48 85 1 : tunables 0 0 0 : slabdata 1 1 0
> cfq_io_context 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> cfq_queue 41 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> mqueue_inode_cache 8 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
> xfs_buf 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> fstrm_item 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_mru_cache_elem 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ili 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_inode 0 0 608 13 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_efi_item 0 0 296 13 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_efd_item 0 0 296 13 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_buf_item 0 0 184 22 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_log_item_desc 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_trans 0 0 240 17 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ifork 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_dabuf 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_da_state 0 0 352 11 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_btree_cur 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_bmap_free_item 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_log_ticket 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ioend 51 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
> reiser_inode_cache 12780 12780 400 10 1 : tunables 0 0 0 : slabdata 1278 1278 0
> configfs_dir_cache 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> kioctx 0 0 224 18 1 : tunables 0 0 0 : slabdata 0 0 0
> kiocb 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> inotify_event_private_data 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0
> inotify_inode_mark 46 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
> fasync_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> khugepaged_mm_slot 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> nsproxy 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> posix_timers_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> uid_cache 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 25 27 448 9 1 : tunables 0 0 0 : slabdata 3 3 0
> UDP-Lite 0 0 544 15 2 : tunables 0 0 0 : slabdata 0 0 0
> tcp_bind_bucket 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0
> inet_peer_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_fib_trie 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_fib_alias 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_dst_cache 50 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> arp_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> RAW 8 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
> UDP 15 15 544 15 2 : tunables 0 0 0 : slabdata 1 1 0
> tw_sock_TCP 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> request_sock_TCP 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> TCP 13 13 1184 13 4 : tunables 0 0 0 : slabdata 1 1 0
> eventpoll_pwq 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0
> eventpoll_epi 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> sgpool-128 12 12 2592 12 8 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-64 12 12 1312 12 4 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-32 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-16 11 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-8 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> scsi_data_buffer 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> blkdev_queue 17 17 936 17 4 : tunables 0 0 0 : slabdata 1 1 0
> blkdev_requests 26 36 224 18 1 : tunables 0 0 0 : slabdata 2 2 0
> blkdev_ioc 73 73 56 73 1 : tunables 0 0 0 : slabdata 1 1 0
> fsnotify_event_holder 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> fsnotify_event 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0
> bio-0 27 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> biovec-256 10 10 3104 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 0 0 1568 10 4 : tunables 0 0 0 : slabdata 0 0 0
> biovec-64 10 10 800 10 2 : tunables 0 0 0 : slabdata 1 1 0
> biovec-16 18 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
> sock_inode_cache 70 77 352 11 1 : tunables 0 0 0 : slabdata 7 7 0
> skbuff_fclone_cache 11 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
> skbuff_head_cache 511 546 192 21 1 : tunables 0 0 0 : slabdata 26 26 0
> file_lock_cache 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
> shmem_inode_cache 894 910 408 10 1 : tunables 0 0 0 : slabdata 91 91 0
> Acpi-Operand 949 949 56 73 1 : tunables 0 0 0 : slabdata 13 13 0
> Acpi-ParseExt 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-Parse 85 85 48 85 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-State 73 73 56 73 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-Namespace 612 612 40 102 1 : tunables 0 0 0 : slabdata 6 6 0
> proc_inode_cache 4393 4393 344 23 2 : tunables 0 0 0 : slabdata 191 191 0
> sigqueue 32 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> bdev_cache 13 18 448 9 1 : tunables 0 0 0 : slabdata 2 2 0
> sysfs_dir_cache 13696 13696 64 64 1 : tunables 0 0 0 : slabdata 214 214 0
> mnt_cache 50 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> filp 209184 209184 128 32 1 : tunables 0 0 0 : slabdata 6537 6537 0
> inode_cache 3972 3972 320 12 1 : tunables 0 0 0 : slabdata 331 331 0
> dentry 35700 35700 144 28 1 : tunables 0 0 0 : slabdata 1275 1275 0
> names_cache 7 7 4128 7 8 : tunables 0 0 0 : slabdata 1 1 0
> buffer_head 13166 37856 72 56 1 : tunables 0 0 0 : slabdata 676 676 0
> vm_area_struct 2508 2535 104 39 1 : tunables 0 0 0 : slabdata 65 65 0
> mm_struct 68 72 448 9 1 : tunables 0 0 0 : slabdata 8 8 0
> fs_cache 128 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0
> files_cache 4240 4242 192 21 1 : tunables 0 0 0 : slabdata 202 202 0
> signal_cache 7040 7040 512 8 1 : tunables 0 0 0 : slabdata 880 880 0
> sighand_cache 102 108 1312 12 4 : tunables 0 0 0 : slabdata 9 9 0
> task_xstate 350 350 576 14 2 : tunables 0 0 0 : slabdata 25 25 0
> task_struct 7049 7049 832 19 4 : tunables 0 0 0 : slabdata 371 371 0
> cred_jar 18496 18496 128 32 1 : tunables 0 0 0 : slabdata 578 578 0
> anon_vma_chain 2371 2448 40 102 1 : tunables 0 0 0 : slabdata 24 24 0
> anon_vma 1432 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
> pid 7104 7104 64 64 1 : tunables 0 0 0 : slabdata 111 111 0
> radix_tree_node 6422 6422 312 13 1 : tunables 0 0 0 : slabdata 494 494 0
> idr_layer_cache 273 275 160 25 1 : tunables 0 0 0 : slabdata 11 11 0
> dma-kmalloc-8192 0 0 8208 3 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4112 7 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2064 15 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1040 15 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 528 15 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 272 15 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 112 36 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 12 12 8208 3 8 : tunables 0 0 0 : slabdata 4 4 0
> kmalloc-4096 300 301 4112 7 8 : tunables 0 0 0 : slabdata 43 43 0
> kmalloc-2048 556 570 2064 15 8 : tunables 0 0 0 : slabdata 38 38 0
> kmalloc-1024 2984 2985 1040 15 4 : tunables 0 0 0 : slabdata 199 199 0
> kmalloc-512 431 435 528 15 2 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-256 44 45 272 15 1 : tunables 0 0 0 : slabdata 3 3 0
> kmalloc-128 336 336 144 28 1 : tunables 0 0 0 : slabdata 12 12 0
> kmalloc-64 3822 3825 80 51 1 : tunables 0 0 0 : slabdata 75 75 0
> kmalloc-32 4505 4505 48 85 1 : tunables 0 0 0 : slabdata 53 53 0
> kmalloc-16 2363 5248 32 128 1 : tunables 0 0 0 : slabdata 41 41 0
> kmalloc-8 3569 3570 24 170 1 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-192 133 133 208 19 1 : tunables 0 0 0 : slabdata 7 7 0
> kmalloc-96 1008 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> kmem_cache 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0

> MemTotal: 480420 kB
> MemFree: 233396 kB
> Buffers: 38816 kB
> Cached: 34944 kB
> SwapCached: 128 kB
> Active: 53088 kB
> Inactive: 28216 kB
> Active(anon): 1844 kB
> Inactive(anon): 4924 kB
> Active(file): 51244 kB
> Inactive(file): 23292 kB
> Unevictable: 32 kB
> Mlocked: 32 kB
> SwapTotal: 524284 kB
> SwapFree: 524156 kB
> Dirty: 32 kB
> Writeback: 0 kB
> AnonPages: 6580 kB
> Mapped: 5456 kB
> Shmem: 112 kB
> Slab: 97772 kB
> SReclaimable: 19920 kB
> SUnreclaim: 77852 kB
> KernelStack: 62800 kB
> PageTables: 460 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 764492 kB
> Committed_AS: 56340 kB
> VmallocTotal: 548548 kB
> VmallocUsed: 8392 kB
> VmallocChunk: 534328 kB
> AnonHugePages: 0 kB
> DirectMap4k: 16320 kB
> DirectMap4M: 475136 kB

> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 2 0.0 0.0 0 0 ? S 22:14 0:00 [kthreadd]
> root 3 0.0 0.0 0 0 ? S 22:14 0:00 \_ [ksoftirqd/0]
> root 6 0.0 0.0 0 0 ? R 22:14 0:00 \_ [rcu_kthread]
> root 7 0.0 0.0 0 0 ? R 22:14 0:00 \_ [watchdog/0]
> root 8 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [khelper]
> root 138 0.0 0.0 0 0 ? S 22:14 0:00 \_ [sync_supers]
> root 140 0.0 0.0 0 0 ? S 22:14 0:00 \_ [bdi-default]
> root 142 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [kblockd]
> root 230 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [ata_sff]
> root 237 0.0 0.0 0 0 ? S 22:14 0:00 \_ [khubd]
> root 365 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kswapd0]
> root 464 0.0 0.0 0 0 ? S 22:14 0:00 \_ [fsnotify_mark]
> root 486 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfs_mru_cache]
> root 489 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfslogd]
> root 490 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfsdatad]
> root 491 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [xfsconvertd]
> root 554 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_0]
> root 559 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_1]
> root 573 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_2]
> root 576 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_3]
> root 579 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kworker/u:4]
> root 580 0.0 0.0 0 0 ? S 22:14 0:00 \_ [kworker/u:5]
> root 589 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_4]
> root 592 0.0 0.0 0 0 ? S 22:14 0:00 \_ [scsi_eh_5]
> root 655 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [kpsmoused]
> root 706 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [reiserfs]
> root 1486 0.0 0.0 0 0 ? S 22:14 0:00 \_ [flush-8:0]
> root 1692 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [rpciod]
> root 1693 0.0 0.0 0 0 ? S< 22:14 0:00 \_ [nfsiod]
> root 1697 0.0 0.0 0 0 ? S 22:15 0:00 \_ [lockd]
> root 976 0.0 0.0 0 0 ? S 22:30 0:00 \_ [kworker/0:0]
> root 1004 0.0 0.0 0 0 ? S 22:38 0:00 \_ [kworker/0:1]
> root 1 0.1 0.1 1740 588 ? Ss 22:14 0:02 init [3]
> root 823 0.0 0.1 2132 824 ? S<s 22:14 0:00 /sbin/udevd --daemon
> root 1778 0.0 0.1 2128 696 ? S< 22:15 0:00 \_ /sbin/udevd --daemon
> root 1377 0.0 0.3 4876 1780 tty2 Ss 22:14 0:00 -bash
> root 1145 0.0 0.2 2276 988 tty2 S+ 22:40 0:00 \_ slabtop
> root 1381 0.0 0.1 1892 768 tty6 Ss+ 22:14 0:00 /sbin/agetty 38400 tty6 linux
> root 1521 0.0 0.0 1928 356 ? Ss 22:14 0:00 dhcpcd -m 2 eth0
> root 1562 0.0 0.1 5128 544 ? S 22:14 0:00 supervising syslog-ng
> root 1563 0.0 0.4 5408 1968 ? Ss 22:14 0:00 \_ /usr/sbin/syslog-ng
> ntp 1587 0.0 0.2 4360 1352 ? Ss 22:14 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u ntp:ntp
> collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14 /usr/sbin/collectd -P /var/run/collectd/collectd.pid -C /etc/collectd.conf
> root 1623 0.0 0.1 1944 508 ? Ss 22:14 0:00 /usr/sbin/gpm -m /dev/input/mice -t ps2
> root 1663 0.0 0.1 2116 760 ? Ss 22:14 0:00 /sbin/rpcbind
> root 1677 0.0 0.2 2188 968 ? Ss 22:14 0:00 /sbin/rpc.statd --no-notify
> root 1737 0.0 0.2 4204 988 ? Ss 22:15 0:00 /usr/sbin/sshd
> root 942 0.0 0.4 7004 2264 ? Ss 22:23 0:00 \_ sshd: root@pts/2
> root 944 0.0 0.3 4876 1812 pts/2 Ss 22:23 0:00 \_ -bash
> root 1791 0.0 0.1 4124 960 pts/2 R+ 22:53 0:00 \_ ps auxf
> root 1766 0.0 0.1 1892 780 tty1 Ss+ 22:15 0:00 /sbin/agetty 38400 tty1 linux
> root 1767 0.0 0.1 1892 784 ttyS0 Ss+ 22:15 0:00 /sbin/agetty 115200 ttyS0 vt100
> root 982 0.0 0.1 1892 784 tty5 Ss+ 22:38 0:00 /sbin/agetty 38400 tty5 linux
> root 1011 0.0 0.3 4876 1748 tty3 Ss+ 22:38 0:00 -bash
> root 1126 0.0 0.1 1892 780 tty4 Ss+ 22:38 0:00 /sbin/agetty 38400 tty4 linux

> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> squashfs_inode_cache 1920 1920 384 10 1 : tunables 0 0 0 : slabdata 192 192 0
> nfs_direct_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> nfs_write_data 40 40 480 8 1 : tunables 0 0 0 : slabdata 5 5 0
> nfs_read_data 36 36 448 9 1 : tunables 0 0 0 : slabdata 4 4 0
> nfs_inode_cache 70 70 576 14 2 : tunables 0 0 0 : slabdata 5 5 0
> nfs_page 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> rpc_buffers 15 15 2080 15 8 : tunables 0 0 0 : slabdata 1 1 0
> rpc_tasks 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> rpc_inode_cache 36 36 448 9 1 : tunables 0 0 0 : slabdata 4 4 0
> fib6_nodes 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> ip6_dst_cache 29 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> ndisc_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 672 12 2 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> tw_sock_TCPv6 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
> request_sock_TCPv6 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> TCPv6 12 12 1312 12 4 : tunables 0 0 0 : slabdata 1 1 0
> aoe_bufs 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> scsi_sense_cache 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> scsi_cmd_cache 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> sd_ext_cdb 85 85 48 85 1 : tunables 0 0 0 : slabdata 1 1 0
> cfq_io_context 153 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
> cfq_queue 36 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> mqueue_inode_cache 8 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
> xfs_buf 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> fstrm_item 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_mru_cache_elem 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ili 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_inode 0 0 608 13 2 : tunables 0 0 0 : slabdata 0 0 0
> xfs_efi_item 0 0 296 13 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_efd_item 0 0 296 13 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_buf_item 0 0 184 22 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_log_item_desc 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_trans 0 0 240 17 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ifork 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_dabuf 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_da_state 0 0 352 11 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_btree_cur 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_bmap_free_item 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_log_ticket 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> xfs_ioend 51 51 80 51 1 : tunables 0 0 0 : slabdata 1 1 0
> reiser_inode_cache 13050 13050 400 10 1 : tunables 0 0 0 : slabdata 1305 1305 0
> configfs_dir_cache 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> kioctx 0 0 224 18 1 : tunables 0 0 0 : slabdata 0 0 0
> kiocb 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> inotify_event_private_data 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0
> inotify_inode_mark 46 46 88 46 1 : tunables 0 0 0 : slabdata 1 1 0
> fasync_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> khugepaged_mm_slot 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> nsproxy 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> posix_timers_cache 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> uid_cache 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 25 27 448 9 1 : tunables 0 0 0 : slabdata 3 3 0
> UDP-Lite 0 0 544 15 2 : tunables 0 0 0 : slabdata 0 0 0
> tcp_bind_bucket 128 128 32 128 1 : tunables 0 0 0 : slabdata 1 1 0
> inet_peer_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_fib_trie 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_fib_alias 102 102 40 102 1 : tunables 0 0 0 : slabdata 1 1 0
> ip_dst_cache 50 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> arp_cache 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> RAW 8 8 512 8 1 : tunables 0 0 0 : slabdata 1 1 0
> UDP 15 15 544 15 2 : tunables 0 0 0 : slabdata 1 1 0
> tw_sock_TCP 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> request_sock_TCP 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
> TCP 13 13 1184 13 4 : tunables 0 0 0 : slabdata 1 1 0
> eventpoll_pwq 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0
> eventpoll_epi 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> sgpool-128 12 12 2592 12 8 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-64 12 12 1312 12 4 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-32 12 12 672 12 2 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-16 11 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
> sgpool-8 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
> scsi_data_buffer 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> blkdev_queue 17 17 936 17 4 : tunables 0 0 0 : slabdata 1 1 0
> blkdev_requests 26 36 224 18 1 : tunables 0 0 0 : slabdata 2 2 0
> blkdev_ioc 73 73 56 73 1 : tunables 0 0 0 : slabdata 1 1 0
> fsnotify_event_holder 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> fsnotify_event 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0
> bio-0 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> biovec-256 10 10 3104 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 0 0 1568 10 4 : tunables 0 0 0 : slabdata 0 0 0
> biovec-64 10 10 800 10 2 : tunables 0 0 0 : slabdata 1 1 0
> biovec-16 18 18 224 18 1 : tunables 0 0 0 : slabdata 1 1 0
> sock_inode_cache 70 77 352 11 1 : tunables 0 0 0 : slabdata 7 7 0
> skbuff_fclone_cache 11 11 352 11 1 : tunables 0 0 0 : slabdata 1 1 0
> skbuff_head_cache 517 567 192 21 1 : tunables 0 0 0 : slabdata 27 27 0
> file_lock_cache 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
> shmem_inode_cache 910 910 408 10 1 : tunables 0 0 0 : slabdata 91 91 0
> Acpi-Operand 949 949 56 73 1 : tunables 0 0 0 : slabdata 13 13 0
> Acpi-ParseExt 64 64 64 64 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-Parse 85 85 48 85 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-State 73 73 56 73 1 : tunables 0 0 0 : slabdata 1 1 0
> Acpi-Namespace 612 612 40 102 1 : tunables 0 0 0 : slabdata 6 6 0
> proc_inode_cache 6256 6256 344 23 2 : tunables 0 0 0 : slabdata 272 272 0
> sigqueue 25 25 160 25 1 : tunables 0 0 0 : slabdata 1 1 0
> bdev_cache 13 18 448 9 1 : tunables 0 0 0 : slabdata 2 2 0
> sysfs_dir_cache 13696 13696 64 64 1 : tunables 0 0 0 : slabdata 214 214 0
> mnt_cache 50 50 160 25 1 : tunables 0 0 0 : slabdata 2 2 0
> filp 422592 422592 128 32 1 : tunables 0 0 0 : slabdata 13206 13206 0
> inode_cache 3954 3972 320 12 1 : tunables 0 0 0 : slabdata 331 331 0
> dentry 39312 39312 144 28 1 : tunables 0 0 0 : slabdata 1404 1404 0
> names_cache 7 7 4128 7 8 : tunables 0 0 0 : slabdata 1 1 0
> buffer_head 13560 37856 72 56 1 : tunables 0 0 0 : slabdata 676 676 0
> vm_area_struct 862 1053 104 39 1 : tunables 0 0 0 : slabdata 27 27 0
> mm_struct 27 54 448 9 1 : tunables 0 0 0 : slabdata 6 6 0
> fs_cache 80 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0
> files_cache 4325 4326 192 21 1 : tunables 0 0 0 : slabdata 206 206 0
> signal_cache 7848 7848 512 8 1 : tunables 0 0 0 : slabdata 981 981 0
> sighand_cache 64 108 1312 12 4 : tunables 0 0 0 : slabdata 9 9 0
> task_xstate 392 392 576 14 2 : tunables 0 0 0 : slabdata 28 28 0
> task_struct 7866 7866 832 19 4 : tunables 0 0 0 : slabdata 414 414 0
> cred_jar 21792 21792 128 32 1 : tunables 0 0 0 : slabdata 681 681 0
> anon_vma_chain 1033 1632 40 102 1 : tunables 0 0 0 : slabdata 16 16 0
> anon_vma 707 896 32 128 1 : tunables 0 0 0 : slabdata 7 7 0
> pid 7872 7872 64 64 1 : tunables 0 0 0 : slabdata 123 123 0
> radix_tree_node 6565 6565 312 13 1 : tunables 0 0 0 : slabdata 505 505 0
> idr_layer_cache 269 275 160 25 1 : tunables 0 0 0 : slabdata 11 11 0
> dma-kmalloc-8192 0 0 8208 3 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4112 7 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2064 15 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1040 15 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 528 15 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 272 15 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 80 51 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 208 19 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 112 36 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 12 12 8208 3 8 : tunables 0 0 0 : slabdata 4 4 0
> kmalloc-4096 285 294 4112 7 8 : tunables 0 0 0 : slabdata 42 42 0
> kmalloc-2048 547 555 2064 15 8 : tunables 0 0 0 : slabdata 37 37 0
> kmalloc-1024 3690 3690 1040 15 4 : tunables 0 0 0 : slabdata 246 246 0
> kmalloc-512 422 435 528 15 2 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-256 44 45 272 15 1 : tunables 0 0 0 : slabdata 3 3 0
> kmalloc-128 336 336 144 28 1 : tunables 0 0 0 : slabdata 12 12 0
> kmalloc-64 4486 4488 80 51 1 : tunables 0 0 0 : slabdata 88 88 0
> kmalloc-32 5354 5355 48 85 1 : tunables 0 0 0 : slabdata 63 63 0
> kmalloc-16 2351 5248 32 128 1 : tunables 0 0 0 : slabdata 41 41 0
> kmalloc-8 3566 3570 24 170 1 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-192 152 152 208 19 1 : tunables 0 0 0 : slabdata 8 8 0
> kmalloc-96 1038 1044 112 36 1 : tunables 0 0 0 : slabdata 29 29 0
> kmem_cache 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0

2011-04-25 21:37:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

2011/4/25 Bruno Pr?mont <[email protected]>:
>
> Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
> ones did decrease. Don't know which ones are RCU-affected and which ones are
> not.

It really sounds as if the tiny-rcu kthread somehow just stops
handling callbacks. The ones that keep increasing do seem to be all
rcu-free'd (but I didn't really check).

The thing is shown as running:

root 6 0.0 0.0 0 0 ? R 22:14 0:00 \_
[rcu_kthread]

but nothing seems to happen and the CPU time hasn't increased at all.

I dunno. Makes no sense to me, but yeah, I'm definitely blaming
tiny-rcu. Paul, any ideas?

Linus

2011-04-25 21:49:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, Apr 25, 2011 at 02:30:02PM -0700, Linus Torvalds wrote:
> 2011/4/25 Bruno Pr?mont <[email protected]>:
> >
> > Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
> > ones did decrease. Don't know which ones are RCU-affected and which ones are
> > not.
>
> It really sounds as if the tiny-rcu kthread somehow just stops
> handling callbacks. The ones that keep increasing do seem to be all
> rcu-free'd (but I didn't really check).
>
> The thing is shown as running:
>
> root 6 0.0 0.0 0 0 ? R 22:14 0:00 \_
> [rcu_kthread]
>
> but nothing seems to happen and the CPU time hasn't increased at all.
>
> I dunno. Makes no sense to me, but yeah, I'm definitely blaming
> tiny-rcu. Paul, any ideas?

So the only ways I know for something to be runnable but not run on
a uniprocessor are:

1. The CPU is continually busy with higher-priority work.
This doesn't make sense in this case because the system
is idle much of the time.

2. The system is hibernating. This doesn't make sense, otherwise
"ps" wouldn't run either.

Any others ideas on how the heck a process can get into this state?
(I have thus far been completely unable to reproduce it.)

The process in question has a loop in rcu_kthread() in kernel/rcutiny.c.
This loop contains a wait_event_interruptible(), waits for a global flag
to become non-zero.

It is awakened by invoke_rcu_kthread() in that same file, which
simply sets the flag to 1 and does a wake_up(), all with hardirqs
disabled.

Hmmm... One "hail mary" patch below. What it does is make rcu_kthread
run at normal priority rather than at real-time priority. This is
not for inclusion -- it breaks RCU priority boosting. But well worth
trying.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
index 0c343b9..4551824 100644
--- a/kernel/rcutiny.c
+++ b/kernel/rcutiny.c
@@ -314,11 +314,15 @@ EXPORT_SYMBOL_GPL(rcu_barrier_sched);
*/
static int __init rcu_spawn_kthreads(void)
{
+#if 0
struct sched_param sp;
+#endif

rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
+#if 0
sp.sched_priority = RCU_BOOST_PRIO;
sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
+#endif
return 0;
}
early_initcall(rcu_spawn_kthreads);

2011-04-25 22:08:54

by Mike Frysinger

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

2011/4/25 Bruno Prémont:
> With + + TREE_PREMPT_RCU system was stable compiling for over 2 hours,
> switching to TINY_RCU, filp count started increasing pretty early after beginning
> compiling.

since you can reproduce fairly easily, could you try some of the major
rc's to see if you could narrow down things ? see if 2.6.39-rc[123]
all act the same while 2.6.38 works ?
-mike

2011-04-26 06:19:08

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Mon, 25 Apr 2011 14:49:33 "Paul E. McKenney" wrote:
> On Mon, Apr 25, 2011 at 02:30:02PM -0700, Linus Torvalds wrote:
> > 2011/4/25 Bruno Prémont <[email protected]>:
> > >
> > > Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
> > > ones did decrease. Don't know which ones are RCU-affected and which ones are
> > > not.
> >
> > It really sounds as if the tiny-rcu kthread somehow just stops
> > handling callbacks. The ones that keep increasing do seem to be all
> > rcu-free'd (but I didn't really check).
> >
> > The thing is shown as running:
> >
> > root 6 0.0 0.0 0 0 ? R 22:14 0:00 \_
> > [rcu_kthread]
> >
> > but nothing seems to happen and the CPU time hasn't increased at all.
> >
> > I dunno. Makes no sense to me, but yeah, I'm definitely blaming
> > tiny-rcu. Paul, any ideas?
>
> So the only ways I know for something to be runnable but not run on
> a uniprocessor are:
>
> 1. The CPU is continually busy with higher-priority work.
> This doesn't make sense in this case because the system
> is idle much of the time.
>
> 2. The system is hibernating. This doesn't make sense, otherwise
> "ps" wouldn't run either.
>
> Any others ideas on how the heck a process can get into this state?
> (I have thus far been completely unable to reproduce it.)
>
> The process in question has a loop in rcu_kthread() in kernel/rcutiny.c.
> This loop contains a wait_event_interruptible(), waits for a global flag
> to become non-zero.
>
> It is awakened by invoke_rcu_kthread() in that same file, which
> simply sets the flag to 1 and does a wake_up(), all with hardirqs
> disabled.
>
> Hmmm... One "hail mary" patch below. What it does is make rcu_kthread
> run at normal priority rather than at real-time priority. This is
> not for inclusion -- it breaks RCU priority boosting. But well worth
> trying.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
> index 0c343b9..4551824 100644
> --- a/kernel/rcutiny.c
> +++ b/kernel/rcutiny.c
> @@ -314,11 +314,15 @@ EXPORT_SYMBOL_GPL(rcu_barrier_sched);
> */
> static int __init rcu_spawn_kthreads(void)
> {
> +#if 0
> struct sched_param sp;
> +#endif
>
> rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
> +#if 0
> sp.sched_priority = RCU_BOOST_PRIO;
> sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
> +#endif
> return 0;
> }
> early_initcall(rcu_spawn_kthreads);

I will give that patch a shot on Wednesday evening (European time) as I
wont have enough time in front of the affected box until then to do any
deeper testing. (same for trying to out with the other -rc kernels as
suggested by Mike)

Though I will use the few minutes I have this evening to try to fetch
kernel traces of running tasks with sysrq+t which may eventually give
us a hint at where rcu_thread is stuck/waiting.

Bruno

2011-04-26 11:28:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 08:19:04AM +0200, Bruno Pr?mont wrote:
> On Mon, 25 Apr 2011 14:49:33 "Paul E. McKenney" wrote:
> > On Mon, Apr 25, 2011 at 02:30:02PM -0700, Linus Torvalds wrote:
> > > 2011/4/25 Bruno Pr?mont <[email protected]>:
> > > >
> > > > Between 1-slabinfo and 2-slabinfo some values increased (a lot) while a few
> > > > ones did decrease. Don't know which ones are RCU-affected and which ones are
> > > > not.
> > >
> > > It really sounds as if the tiny-rcu kthread somehow just stops
> > > handling callbacks. The ones that keep increasing do seem to be all
> > > rcu-free'd (but I didn't really check).
> > >
> > > The thing is shown as running:
> > >
> > > root 6 0.0 0.0 0 0 ? R 22:14 0:00 \_
> > > [rcu_kthread]
> > >
> > > but nothing seems to happen and the CPU time hasn't increased at all.
> > >
> > > I dunno. Makes no sense to me, but yeah, I'm definitely blaming
> > > tiny-rcu. Paul, any ideas?
> >
> > So the only ways I know for something to be runnable but not run on
> > a uniprocessor are:
> >
> > 1. The CPU is continually busy with higher-priority work.
> > This doesn't make sense in this case because the system
> > is idle much of the time.
> >
> > 2. The system is hibernating. This doesn't make sense, otherwise
> > "ps" wouldn't run either.
> >
> > Any others ideas on how the heck a process can get into this state?
> > (I have thus far been completely unable to reproduce it.)
> >
> > The process in question has a loop in rcu_kthread() in kernel/rcutiny.c.
> > This loop contains a wait_event_interruptible(), waits for a global flag
> > to become non-zero.
> >
> > It is awakened by invoke_rcu_kthread() in that same file, which
> > simply sets the flag to 1 and does a wake_up(), all with hardirqs
> > disabled.
> >
> > Hmmm... One "hail mary" patch below. What it does is make rcu_kthread
> > run at normal priority rather than at real-time priority. This is
> > not for inclusion -- it breaks RCU priority boosting. But well worth
> > trying.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcutiny.c b/kernel/rcutiny.c
> > index 0c343b9..4551824 100644
> > --- a/kernel/rcutiny.c
> > +++ b/kernel/rcutiny.c
> > @@ -314,11 +314,15 @@ EXPORT_SYMBOL_GPL(rcu_barrier_sched);
> > */
> > static int __init rcu_spawn_kthreads(void)
> > {
> > +#if 0
> > struct sched_param sp;
> > +#endif
> >
> > rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
> > +#if 0
> > sp.sched_priority = RCU_BOOST_PRIO;
> > sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
> > +#endif
> > return 0;
> > }
> > early_initcall(rcu_spawn_kthreads);
>
> I will give that patch a shot on Wednesday evening (European time) as I
> wont have enough time in front of the affected box until then to do any
> deeper testing. (same for trying to out with the other -rc kernels as
> suggested by Mike)

Thank you for both of these!!!

> Though I will use the few minutes I have this evening to try to fetch
> kernel traces of running tasks with sysrq+t which may eventually give
> us a hint at where rcu_thread is stuck/waiting.

This would be very helpful to me!

For my part, I will use some plane time today to stare at my code some
more and see what bugs I can find.

Linus, in the meantime, please feel free to revert 687d7a960 (rcu:
restrict TREE_RCU to SMP builds with !PREEMPT), which would allow anyone
not wanting to help chase this down to get on with their lives.

Thanx, Paul

2011-04-26 16:39:16

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, 26 April 2011 "Paul E. McKenney" <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 08:19:04AM +0200, Bruno Prémont wrote:
> > Though I will use the few minutes I have this evening to try to fetch
> > kernel traces of running tasks with sysrq+t which may eventually give
> > us a hint at where rcu_thread is stuck/waiting.
>
> This would be very helpful to me!

Here it comes:

rcu_kthread (when build processes are STOPped):
[ 836.050003] rcu_kthread R running 7324 6 2 0x00000000
[ 836.050003] dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
[ 836.050003] c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
[ 836.050003] dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
[ 836.050003] Call Trace:
[ 836.050003] [<c10bf8a2>] ? check_object+0x92/0x210
[ 836.050003] [<c10be848>] ? init_object+0x38/0x70
[ 836.050003] [<c10bfde2>] ? free_debug_processing+0x112/0x1f0
[ 836.050003] [<c103d9fd>] ? lock_timer_base+0x2d/0x70
[ 836.050003] [<c13c8ec7>] schedule_timeout+0x137/0x280
[ 836.050003] [<c10c02b8>] ? kmem_cache_free+0xe8/0x140
[ 836.050003] [<c103db60>] ? sys_gettid+0x20/0x20
[ 836.050003] [<c13c9064>] schedule_timeout_interruptible+0x14/0x20
[ 836.050003] [<c10736e0>] rcu_kthread+0xa0/0xc0
[ 836.050003] [<c104de00>] ? wake_up_bit+0x70/0x70
[ 836.050003] [<c1073640>] ? rcu_process_callbacks+0x60/0x60
[ 836.050003] [<c104d874>] kthread+0x74/0x80
[ 836.050003] [<c104d800>] ? flush_kthread_worker+0x90/0x90
[ 836.050003] [<c13caeb6>] kernel_thread_helper+0x6/0xd

a few minutes later when build processes have been killed:
[ 966.930008] rcu_kthread R running 7324 6 2 0x00000000
[ 966.930008] dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
[ 966.930008] c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
[ 966.930008] dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
[ 966.930008] Call Trace:
[ 966.930008] [<c10bf8a2>] ? check_object+0x92/0x210
[ 966.930008] [<c10be848>] ? init_object+0x38/0x70
[ 966.930008] [<c10bfde2>] ? free_debug_processing+0x112/0x1f0
[ 966.930008] [<c103d9fd>] ? lock_timer_base+0x2d/0x70
[ 966.930008] [<c13c8ec7>] schedule_timeout+0x137/0x280
[ 966.930008] [<c10c02b8>] ? kmem_cache_free+0xe8/0x140
[ 966.930008] [<c103db60>] ? sys_gettid+0x20/0x20
[ 966.930008] [<c13c9064>] schedule_timeout_interruptible+0x14/0x20
[ 966.930008] [<c10736e0>] rcu_kthread+0xa0/0xc0
[ 966.930008] [<c104de00>] ? wake_up_bit+0x70/0x70
[ 966.930008] [<c1073640>] ? rcu_process_callbacks+0x60/0x60
[ 966.930008] [<c104d874>] kthread+0x74/0x80
[ 966.930008] [<c104d800>] ? flush_kthread_worker+0x90/0x90
[ 966.930008] [<c13caeb6>] kernel_thread_helper+0x6/0xd

Attached (gzipped) the complete dmesg log (dmesg-t1 contains dmesg from boot until
after first sysrq+t -- dmesg-t2 the output of sysrq+t 2 minutes later
after having killed build processes).
Just in case, I joined slabinfo.
Ten minutes later rcu_kthread trace has not changed at all.

>
> For my part, I will use some plane time today to stare at my code some
> more and see what bugs I can find.

Possibly useful detail, it's somewhere during my compile that rcu_kthread
seems to stop doing its job as after booting things look fine (no slabs
piling up). Some part of the whole emerge -> configure and/or emerge -> make -> gcc
process tree must be confusing kernel as it's only then that things
start piling up (don't know if other kinds of work trigger it as well)

Bruno


Attachments:
(No filename) (3.53 kB)
dmesg-t1.gz (23.73 kB)
dmesg-t2.gz (11.26 kB)
slabinfo.gz (2.23 kB)
Download all attachments

2011-04-26 17:09:33

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, 26 April 2011 Bruno Prémont <[email protected]> wrote:
> On Tue, 26 April 2011 "Paul E. McKenney" <[email protected]> wrote:
> > On Tue, Apr 26, 2011 at 08:19:04AM +0200, Bruno Prémont wrote:
> > > Though I will use the few minutes I have this evening to try to fetch
> > > kernel traces of running tasks with sysrq+t which may eventually give
> > > us a hint at where rcu_thread is stuck/waiting.
> >
> > This would be very helpful to me!
>
> Here it comes:
>
> rcu_kthread (when build processes are STOPped):
> [ 836.050003] rcu_kthread R running 7324 6 2 0x00000000
> [ 836.050003] dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
> [ 836.050003] c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
> [ 836.050003] dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
> [ 836.050003] Call Trace:
> [ 836.050003] [<c10bf8a2>] ? check_object+0x92/0x210
> [ 836.050003] [<c10be848>] ? init_object+0x38/0x70
> [ 836.050003] [<c10bfde2>] ? free_debug_processing+0x112/0x1f0
> [ 836.050003] [<c103d9fd>] ? lock_timer_base+0x2d/0x70
> [ 836.050003] [<c13c8ec7>] schedule_timeout+0x137/0x280
> [ 836.050003] [<c10c02b8>] ? kmem_cache_free+0xe8/0x140
> [ 836.050003] [<c103db60>] ? sys_gettid+0x20/0x20
> [ 836.050003] [<c13c9064>] schedule_timeout_interruptible+0x14/0x20
> [ 836.050003] [<c10736e0>] rcu_kthread+0xa0/0xc0
> [ 836.050003] [<c104de00>] ? wake_up_bit+0x70/0x70
> [ 836.050003] [<c1073640>] ? rcu_process_callbacks+0x60/0x60
> [ 836.050003] [<c104d874>] kthread+0x74/0x80
> [ 836.050003] [<c104d800>] ? flush_kthread_worker+0x90/0x90
> [ 836.050003] [<c13caeb6>] kernel_thread_helper+0x6/0xd
>
> a few minutes later when build processes have been killed:
> [ 966.930008] rcu_kthread R running 7324 6 2 0x00000000
> [ 966.930008] dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
> [ 966.930008] c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
> [ 966.930008] dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
> [ 966.930008] Call Trace:
> [ 966.930008] [<c10bf8a2>] ? check_object+0x92/0x210
> [ 966.930008] [<c10be848>] ? init_object+0x38/0x70
> [ 966.930008] [<c10bfde2>] ? free_debug_processing+0x112/0x1f0
> [ 966.930008] [<c103d9fd>] ? lock_timer_base+0x2d/0x70
> [ 966.930008] [<c13c8ec7>] schedule_timeout+0x137/0x280
> [ 966.930008] [<c10c02b8>] ? kmem_cache_free+0xe8/0x140
> [ 966.930008] [<c103db60>] ? sys_gettid+0x20/0x20
> [ 966.930008] [<c13c9064>] schedule_timeout_interruptible+0x14/0x20
> [ 966.930008] [<c10736e0>] rcu_kthread+0xa0/0xc0
> [ 966.930008] [<c104de00>] ? wake_up_bit+0x70/0x70
> [ 966.930008] [<c1073640>] ? rcu_process_callbacks+0x60/0x60
> [ 966.930008] [<c104d874>] kthread+0x74/0x80
> [ 966.930008] [<c104d800>] ? flush_kthread_worker+0x90/0x90
> [ 966.930008] [<c13caeb6>] kernel_thread_helper+0x6/0xd
>
> Attached (gzipped) the complete dmesg log (dmesg-t1 contains dmesg from boot until
> after first sysrq+t -- dmesg-t2 the output of sysrq+t 2 minutes later
> after having killed build processes).
> Just in case, I joined slabinfo.
> Ten minutes later rcu_kthread trace has not changed at all.

Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
context switches and exactly one non-voluntary one.

In addition when rcu_kthread has stopped doing its work
`swapoff $(swapdevice)` seems to block forever (at least normal shutdown
blocks on disabling swap device).
If I get to do it when I get back home I will manually try to swapoff
and take process traces with sysrq-t.

Bruno

2011-04-26 17:13:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 9:38 AM, Bruno Pr?mont
<[email protected]> wrote:
>
> Here it comes:
>
> rcu_kthread (when build processes are STOPped):
> [ ?836.050003] rcu_kthread ? ? R running ? 7324 ? ? 6 ? ? ?2 0x00000000
> [ ?836.050003] ?dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
> [ ?836.050003] ?c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
> [ ?836.050003] ?dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
> [ ?836.050003] Call Trace:
> [ ?836.050003] ?[<c10bf8a2>] ? check_object+0x92/0x210
> [ ?836.050003] ?[<c10be848>] ? init_object+0x38/0x70
> [ ?836.050003] ?[<c10bfde2>] ? free_debug_processing+0x112/0x1f0
> [ ?836.050003] ?[<c103d9fd>] ? lock_timer_base+0x2d/0x70
> [ ?836.050003] ?[<c13c8ec7>] schedule_timeout+0x137/0x280

Hmm.

I'm adding Ingo and Peter to the cc, because this whole "rcu_kthread
is running, but never actually running" is starting to smell like a
scheduler issue.

Peter/Ingo: RCUTINY seems to be broken for Bruno. During any kind of
heavy workload, at some point it looks like rcu_kthread simply stops
making any progress. It's constantly in runnable state, but it doesn't
actually use any CPU time, and it's not processing the RCU callbacks,
so the RCU memory freeing isn't happening, and slabs just build up
until the machine dies.

And it really is RCUTINY, because the thing doesn't happen with the
regular tree-RCU.

This is without CONFIG_RCU_BOOST_PRIO, so we basically have

struct sched_param sp;

rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
sp.sched_priority = RCU_BOOST_PRIO;
sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);

where RCU_BOOST_PRIO is 1 for the non-boost case.

Is that so low that even the idle thread will take priority? It's a UP
config with PREEMPT_VOLUNTARY. So pretty much _all_ the stars are
aligned for odd scheduling behavior.

Other users of SCHED_FIFO tend to set the priority really high (eg
"MAX_RT_PRIO-1" is clearly the default one - softirq's, watchdog), but
"1" is not unheard of either (touchscreen/ucb1400_ts and
mmc/core/sdio_irq), and there are some other random choises out tere.

Any ideas?

Linus

2011-04-26 17:24:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 10:09 AM, Bruno Pr?mont
<[email protected]> wrote:
>
> Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> context switches and exactly one non-voluntary one.
>
> In addition when rcu_kthread has stopped doing its work
> `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> blocks on disabling swap device).
> If I get to do it when I get back home I will manually try to swapoff
> and take process traces with sysrq-t.

That "exactly one non-voluntary one" sounds like the smoking gun.

Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
kind of the point of SCHED_FIFO. Involuntary context switches happen
when some higher-priority SCHED_FIFO process becomes runnable (irq
handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
too), and maybe there is a bug in the runqueue handling for that case.

Ingo, do you have any tests for SCHED_FIFO scheduling? Particularly
with UP and voluntary preempt?

Linus

2011-04-26 18:50:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 10:12:39AM -0700, Linus Torvalds wrote:
> On Tue, Apr 26, 2011 at 9:38 AM, Bruno Pr?mont
> <[email protected]> wrote:
> >
> > Here it comes:
> >
> > rcu_kthread (when build processes are STOPped):
> > [ ?836.050003] rcu_kthread ? ? R running ? 7324 ? ? 6 ? ? ?2 0x00000000
> > [ ?836.050003] ?dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
> > [ ?836.050003] ?c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
> > [ ?836.050003] ?dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
> > [ ?836.050003] Call Trace:
> > [ ?836.050003] ?[<c10bf8a2>] ? check_object+0x92/0x210
> > [ ?836.050003] ?[<c10be848>] ? init_object+0x38/0x70
> > [ ?836.050003] ?[<c10bfde2>] ? free_debug_processing+0x112/0x1f0
> > [ ?836.050003] ?[<c103d9fd>] ? lock_timer_base+0x2d/0x70
> > [ ?836.050003] ?[<c13c8ec7>] schedule_timeout+0x137/0x280
>
> Hmm.
>
> I'm adding Ingo and Peter to the cc, because this whole "rcu_kthread
> is running, but never actually running" is starting to smell like a
> scheduler issue.
>
> Peter/Ingo: RCUTINY seems to be broken for Bruno. During any kind of
> heavy workload, at some point it looks like rcu_kthread simply stops
> making any progress. It's constantly in runnable state, but it doesn't
> actually use any CPU time, and it's not processing the RCU callbacks,
> so the RCU memory freeing isn't happening, and slabs just build up
> until the machine dies.
>
> And it really is RCUTINY, because the thing doesn't happen with the
> regular tree-RCU.

The difference between TINY_RCU and TREE_RCU is that TREE_RCU still uses
softirq for the core RCU processing. TINY_RCU switched to a kthread
when I implemented RCU priority boosting. There is a similar change in
my -rcu tree that makes TREE_RCU use kthreads, and Sedat has been running
into a very similar problem with that change in place. Which is why I
do not yet push it to the -next tree.

> This is without CONFIG_RCU_BOOST_PRIO, so we basically have
>
> struct sched_param sp;
>
> rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
> sp.sched_priority = RCU_BOOST_PRIO;
> sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
>
> where RCU_BOOST_PRIO is 1 for the non-boost case.

Good point! Bruno, Sedat, could you please set CONFIG_RCU_BOOST_PRIO to
(say) 50, and see if this still happens? (I bet that you do, but...)

> Is that so low that even the idle thread will take priority? It's a UP
> config with PREEMPT_VOLUNTARY. So pretty much _all_ the stars are
> aligned for odd scheduling behavior.
>
> Other users of SCHED_FIFO tend to set the priority really high (eg
> "MAX_RT_PRIO-1" is clearly the default one - softirq's, watchdog), but
> "1" is not unheard of either (touchscreen/ucb1400_ts and
> mmc/core/sdio_irq), and there are some other random choises out tere.
>
> Any ideas?

I have found one bug so far in my code, but it only affects TREE_RCU
in my -rcu tree, and even then only if HOTPLUG_CPU is enabled. I am
testing a fix, but I expect Sedat's tests to still break.

I gave Sedat a patch that make rcu_kthread() run at normal (non-realtime)
priority, and he did not see the failure. So running non-realtime at
least greatly reduces the probability of failure.

Thanx, Paul

2011-04-26 19:17:31

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 8:50 PM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, Apr 26, 2011 at 10:12:39AM -0700, Linus Torvalds wrote:
>> On Tue, Apr 26, 2011 at 9:38 AM, Bruno Prémont
>> <[email protected]> wrote:
>> >
>> > Here it comes:
>> >
>> > rcu_kthread (when build processes are STOPped):
>> > [  836.050003] rcu_kthread     R running   7324     6      2 0x00000000
>> > [  836.050003]  dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
>> > [  836.050003]  c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
>> > [  836.050003]  dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
>> > [  836.050003] Call Trace:
>> > [  836.050003]  [<c10bf8a2>] ? check_object+0x92/0x210
>> > [  836.050003]  [<c10be848>] ? init_object+0x38/0x70
>> > [  836.050003]  [<c10bfde2>] ? free_debug_processing+0x112/0x1f0
>> > [  836.050003]  [<c103d9fd>] ? lock_timer_base+0x2d/0x70
>> > [  836.050003]  [<c13c8ec7>] schedule_timeout+0x137/0x280
>>
>> Hmm.
>>
>> I'm adding Ingo and Peter to the cc, because this whole "rcu_kthread
>> is running, but never actually running" is starting to smell like a
>> scheduler issue.
>>
>> Peter/Ingo: RCUTINY seems to be broken for Bruno. During any kind of
>> heavy workload, at some point it looks like rcu_kthread simply stops
>> making any progress. It's constantly in runnable state, but it doesn't
>> actually use any CPU time, and it's not processing the RCU callbacks,
>> so the RCU memory freeing isn't happening, and slabs just build up
>> until the machine dies.
>>
>> And it really is RCUTINY, because the thing doesn't happen with the
>> regular tree-RCU.
>
> The difference between TINY_RCU and TREE_RCU is that TREE_RCU still uses
> softirq for the core RCU processing.  TINY_RCU switched to a kthread
> when I implemented RCU priority boosting.  There is a similar change in
> my -rcu tree that makes TREE_RCU use kthreads, and Sedat has been running
> into a very similar problem with that change in place.  Which is why I
> do not yet push it to the -next tree.
>
>> This is without CONFIG_RCU_BOOST_PRIO, so we basically have
>>
>>         struct sched_param sp;
>>
>>         rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
>>         sp.sched_priority = RCU_BOOST_PRIO;
>>         sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
>>
>> where RCU_BOOST_PRIO is 1 for the non-boost case.
>
> Good point!  Bruno, Sedat, could you please set CONFIG_RCU_BOOST_PRIO to
> (say) 50, and see if this still happens?  (I bet that you do, but...)
>

What's with CONFIG_RCU_BOOST_DELAY setting?

Are those values OK?

$ egrep 'M486|M686|X86_UP|CONFIG_SMP|NR_CPUS|PREEMPT|_RCU|_HIGHMEM|PAE' .config
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_TRACE=y
CONFIG_RCU_FANOUT=32
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_TREE_RCU_TRACE=y
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=50
CONFIG_RCU_BOOST_DELAY=500
CONFIG_SMP=y
# CONFIG_M486 is not set
CONFIG_M686=y
CONFIG_NR_CPUS=32
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_CPU_STALL_VERBOSE=y
CONFIG_PREEMPT_TRACER=y

- Sedat -

>> Is that so low that even the idle thread will take priority? It's a UP
>> config with PREEMPT_VOLUNTARY. So pretty much _all_ the stars are
>> aligned for odd scheduling behavior.
>>
>> Other users of SCHED_FIFO tend to set the priority really high (eg
>> "MAX_RT_PRIO-1" is clearly the default one - softirq's, watchdog), but
>> "1" is not unheard of either (touchscreen/ucb1400_ts and
>> mmc/core/sdio_irq), and there are some other random choises out tere.
>>
>> Any ideas?
>
> I have found one bug so far in my code, but it only affects TREE_RCU
> in my -rcu tree, and even then only if HOTPLUG_CPU is enabled.  I am
> testing a fix, but I expect Sedat's tests to still break.
>
> I gave Sedat a patch that make rcu_kthread() run at normal (non-realtime)
> priority, and he did not see the failure.  So running non-realtime at
> least greatly reduces the probability of failure.
>
>                                                        Thanx, Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

2011-04-26 22:28:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, 26 Apr 2011, Linus Torvalds wrote:

> On Tue, Apr 26, 2011 at 10:09 AM, Bruno Pr?mont
> <[email protected]> wrote:
> >
> > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > context switches and exactly one non-voluntary one.
> >
> > In addition when rcu_kthread has stopped doing its work
> > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > blocks on disabling swap device).
> > If I get to do it when I get back home I will manually try to swapoff
> > and take process traces with sysrq-t.
>
> That "exactly one non-voluntary one" sounds like the smoking gun.
>
> Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
> kind of the point of SCHED_FIFO. Involuntary context switches happen
> when some higher-priority SCHED_FIFO process becomes runnable (irq
> handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
> too), and maybe there is a bug in the runqueue handling for that case.

The forced irq threading is only effective when you add the command
line parameter "threadirqs". I don't see any irq threads in the ps
outputs, so that's not the problem.

Though the whole ps output is weird. There is only one thread/process
which accumulated CPU time

collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14

All others show 0:00 CPU time - not only kthread_rcu.

Bruno, are you running on real hardware or in a virtual machine?

Can you please enable CONFIG_SCHED_DEBUG and provide the output of
/proc/sched_stat when the problem surfaces and a minute after the
first snapshot?

Also please apply the patch below and check, whether the printk shows
up in your dmesg.

Thanks,

tglx

---
kernel/sched_rt.c | 1 +
1 file changed, 1 insertion(+)

Index: linux-2.6-tip/kernel/sched_rt.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched_rt.c
+++ linux-2.6-tip/kernel/sched_rt.c
@@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str

if (rt_rq->rt_time > runtime) {
rt_rq->rt_throttled = 1;
+ printk_once(KERN_WARNING "sched: RT throttling activated\n");
if (rt_rq_throttled(rt_rq)) {
sched_rt_rq_dequeue(rt_rq);
return 1;

2011-04-27 06:15:08

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > On Tue, Apr 26, 2011 at 10:09 AM, Bruno Prémont wrote:
> > >
> > > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > > context switches and exactly one non-voluntary one.
> > >
> > > In addition when rcu_kthread has stopped doing its work
> > > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > > blocks on disabling swap device).
> > > If I get to do it when I get back home I will manually try to swapoff
> > > and take process traces with sysrq-t.
> >
> > That "exactly one non-voluntary one" sounds like the smoking gun.
> >
> > Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
> > kind of the point of SCHED_FIFO. Involuntary context switches happen
> > when some higher-priority SCHED_FIFO process becomes runnable (irq
> > handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
> > too), and maybe there is a bug in the runqueue handling for that case.
>
> The forced irq threading is only effective when you add the command
> line parameter "threadirqs". I don't see any irq threads in the ps
> outputs, so that's not the problem.
>
> Though the whole ps output is weird. There is only one thread/process
> which accumulated CPU time
>
> collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14

Whole system does not have much uptime so it's quite expected that CPU
time remains low. collectd is the only daemon that has more work to do
(scan many files every 10s)
On the ps output with stopped build processes there should be some more
with accumulated CPU time... though looking at it only top and python
have accumulated anything.

Next time I can scan /proc/${PID}/ for more precise CPU times to see
how zero they are.

> All others show 0:00 CPU time - not only kthread_rcu.
>
> Bruno, are you running on real hardware or in a virtual machine?

It's real hardware (nforce420 chipset - aka first nforce generation -,
AMD Athlon 1800 CPU, 512MB of RAM out of which 32MB taken by
IGP, so something like 7-10 or so years old)

> Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> /proc/sched_stat when the problem surfaces and a minute after the
> first snapshot?
>
> Also please apply the patch below and check, whether the printk shows
> up in your dmesg.

Will include in my testing when back home this evening. (Will have to
offload kernel compilations to a quicker box otherwise my evening will
be much too short...)

Bruno


> Thanks,
>
> tglx
>
> ---
> kernel/sched_rt.c | 1 +
> 1 file changed, 1 insertion(+)
>
> Index: linux-2.6-tip/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/sched_rt.c
> +++ linux-2.6-tip/kernel/sched_rt.c
> @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
>
> if (rt_rq->rt_time > runtime) {
> rt_rq->rt_throttled = 1;
> + printk_once(KERN_WARNING "sched: RT throttling activated\n");
> if (rt_rq_throttled(rt_rq)) {
> sched_rt_rq_dequeue(rt_rq);
> return 1;

2011-04-27 10:28:54

by Catalin Marinas

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On 25 April 2011 17:31, Linus Torvalds <[email protected]> wrote:
> 2011/4/25 Bruno Prémont <[email protected]>:
>> kmemleak reports 86681 new leaks between shortly after boot and -2 state.
>> (and 2348 additional ones between -2 and -4).
>
> I wouldn't necessarily trust kmemleak with the whole RCU-freeing
> thing. In your slubinfo reports, the kmemleak data itself also tends
> to overwhelm everything else - none of it looks unreasonable per se.

Kmemleak reports that it couldn't find any pointers to those objects
when scanning the memory. In theory, it is safe with RCU since objects
queued for freeing via the RCU are in a linked list and still
referred.

There are of course false positives, usually when pointers are stored
in some structures not scanned by kmemleak (e.g. some arrays allocated
with alloc_pages which are not explicitly tracked by kmemleak) but I
haven't seen any related to RCU (yet).

--
Catalin

2011-04-27 18:41:55

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 April 2011 Bruno Prémont wrote:
> On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > > On Tue, Apr 26, 2011 at 10:09 AM, Bruno Prémont wrote:
> > > >
> > > > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > > > context switches and exactly one non-voluntary one.
> > > >
> > > > In addition when rcu_kthread has stopped doing its work
> > > > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > > > blocks on disabling swap device).

Apparently it's not swapoff but `umount -a -t tmpfs` that's getting
stuck here. Manual swapoff worked.

The stuck umount:
[ 1714.960735] umount D 5a000040 5668 20331 20324 0x00000000
[ 1714.960735] c3c99e5c 00000086 dd407900 5a000040 dd25a1a8 dd407900 dd25a120 c3c99e0c
[ 1714.960735] c3c99e24 c10c1be2 c14d9f20 c3c99e5c c3c8c680 c3c8c680 000000bb c3c99e24
[ 1714.960735] c10c0b88 dd25a120 dd407900 ddfd4b40 c3c99e4c ddfc9d20 dd402380 5a000010
[ 1714.960735] Call Trace:
[ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735] [<c13cb37d>] schedule_timeout+0x16d/0x280
[ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735] [<c10c2122>] ? free_debug_processing+0x112/0x1f0
[ 1714.960735] [<c10a3791>] ? shmem_put_super+0x11/0x20
[ 1714.960735] [<c13cae9c>] wait_for_common+0x9c/0x150
[ 1714.960735] [<c102c890>] ? try_to_wake_up+0x170/0x170
[ 1714.960735] [<c13caff2>] wait_for_completion+0x12/0x20
[ 1714.960735] [<c1075ad7>] rcu_barrier_sched+0x47/0x50
[ 1714.960735] [<c104d3c0>] ? alloc_pid+0x370/0x370
[ 1714.960735] [<c10ce74a>] deactivate_locked_super+0x3a/0x60
[ 1714.960735] [<c10ce948>] deactivate_super+0x48/0x70
[ 1714.960735] [<c10e7427>] mntput_no_expire+0x87/0xe0
[ 1714.960735] [<c10e7800>] sys_umount+0x60/0x320
[ 1714.960735] [<c10b231a>] ? remove_vma+0x3a/0x50
[ 1714.960735] [<c10b3b22>] ? do_munmap+0x212/0x2f0
[ 1714.960735] [<c10e7ad9>] sys_oldumount+0x19/0x20
[ 1714.960735] [<c13cce10>] sysenter_do_call+0x12/0x26

which looks like lock conflict with RCU:

[ 1714.960735] rcu_kthread R running 6924 6 2 0x00000000
[ 1714.960735] dd473f28 00000046 5a000240 dbd6ba7c dd407360 ddfaf840 dbd6b740 dd473ed8
[ 1714.960735] ddfaee00 dd407a20 5a000000 dd473f28 dd445040 dd445040 0000009c dd473f0c
[ 1714.960735] c10c1be2 c14d9f20 dbf7057c 0000005a 000000bb 000000bb dd473f0c c10c0b88
[ 1714.960735] Call Trace:
[ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735] [<c103fd8d>] ? lock_timer_base+0x2d/0x70
[ 1714.960735] [<c13cb347>] schedule_timeout+0x137/0x280
[ 1714.960735] [<c103fef0>] ? sys_gettid+0x20/0x20
[ 1714.960735] [<c13cb4e4>] schedule_timeout_interruptible+0x14/0x20
[ 1714.960735] [<c1075a70>] rcu_kthread+0xa0/0xc0
[ 1714.960735] [<c1050190>] ? wake_up_bit+0x70/0x70
[ 1714.960735] [<c10759d0>] ? rcu_process_callbacks+0x60/0x60
[ 1714.960735] [<c104fc04>] kthread+0x74/0x80
[ 1714.960735] [<c104fb90>] ? flush_kthread_worker+0x90/0x90
[ 1714.960735] [<c13cd336>] kernel_thread_helper+0x6/0xd

(I have rest of sysreq+t output available in case someone wants it)

> > > > If I get to do it when I get back home I will manually try to swapoff
> > > > and take process traces with sysrq-t.
> > >
> > > That "exactly one non-voluntary one" sounds like the smoking gun.

It's not the gun we're looking for as it's already smoking long before
any RCU-managed slabs start piling up (e.g. already when I get at a
shell after boot sequence).

Voluntary context switches stay constant from the time on SLABs pile up.
(which makes sense as it doesn't run get CPU slices anymore)

> > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > /proc/sched_stat when the problem surfaces and a minute after the
> > first snapshot?

hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?

I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
/proc/sched_debug which exists... (attached, taken about 7min and +1min
after SLABs started piling up), though build processes were SIGSTOPped
during first minute.

printk wrote (in case its timestamp is useful, more below):
[ 518.480103] sched: RT throttling activated

If my choice was the wrong one, please tell so I can generate the other
ones.

> > Also please apply the patch below and check, whether the printk shows
> > up in your dmesg.
>
> > Index: linux-2.6-tip/kernel/sched_rt.c
> > ===================================================================
> > --- linux-2.6-tip.orig/kernel/sched_rt.c
> > +++ linux-2.6-tip/kernel/sched_rt.c
> > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
> >
> > if (rt_rq->rt_time > runtime) {
> > rt_rq->rt_throttled = 1;
> > + printk_once(KERN_WARNING "sched: RT throttling activated\n");

This gun is triggering right before RCU-managed slabs start piling up as
visible under slabtop so chances are it's at least a related!

Bruno


> > if (rt_rq_throttled(rt_rq)) {
> > sched_rt_rq_dequeue(rt_rq);
> > return 1;


Attachments:
(No filename) (5.09 kB)
sched_debug-n (12.80 kB)
sched_debug-n+60 (12.81 kB)
Download all attachments

2011-04-27 19:19:39

by Pádraig Brady

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On 27/04/11 19:41, Bruno Pr?mont wrote:
> On Wed, 27 April 2011 Bruno Pr?mont wrote:
>> On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
>>> On Tue, 26 Apr 2011, Linus Torvalds wrote:
>>>> On Tue, Apr 26, 2011 at 10:09 AM, Bruno Pr?mont wrote:
>>>>>
>>>>> Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
>>>>> context switches and exactly one non-voluntary one.
>>>>>
>>>>> In addition when rcu_kthread has stopped doing its work
>>>>> `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
>>>>> blocks on disabling swap device).
>
> Apparently it's not swapoff but `umount -a -t tmpfs` that's getting
> stuck here. Manual swapoff worked.

Anything to do with this?
http://thread.gmane.org/gmane.linux.kernel.mm/60953/

cheers,
P?draig.

2011-04-27 19:34:48

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 April 2011 Pádraig Brady wrote:
> On 27/04/11 19:41, Bruno Prémont wrote:
> > On Wed, 27 April 2011 Bruno Prémont wrote:
> >> On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> >>> On Tue, 26 Apr 2011, Linus Torvalds wrote:
> >>>> On Tue, Apr 26, 2011 at 10:09 AM, Bruno Prémont wrote:
> >>>>> Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> >>>>> context switches and exactly one non-voluntary one.
> >>>>>
> >>>>> In addition when rcu_kthread has stopped doing its work
> >>>>> `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> >>>>> blocks on disabling swap device).
> >
> > Apparently it's not swapoff but `umount -a -t tmpfs` that's getting
> > stuck here. Manual swapoff worked.
>
> Anything to do with this?
> http://thread.gmane.org/gmane.linux.kernel.mm/60953/

I don't think so, if it is, it is only loosely related.

>From the trace you omitted to keep it's visible that it gets hit by
non-operating RCU kthread.
Maybe existence of RCU barrier in this trace has some relation to
above thread but I don't see it at first glance.

[ 1714.960735] umount D 5a000040 5668 20331 20324 0x00000000
[ 1714.960735] c3c99e5c 00000086 dd407900 5a000040 dd25a1a8 dd407900 dd25a120 c3c99e0c
[ 1714.960735] c3c99e24 c10c1be2 c14d9f20 c3c99e5c c3c8c680 c3c8c680 000000bb c3c99e24
[ 1714.960735] c10c0b88 dd25a120 dd407900 ddfd4b40 c3c99e4c ddfc9d20 dd402380 5a000010
[ 1714.960735] Call Trace:
[ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
[ 1714.960735] [<c13cb37d>] schedule_timeout+0x16d/0x280
[ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
[ 1714.960735] [<c10c2122>] ? free_debug_processing+0x112/0x1f0
[ 1714.960735] [<c10a3791>] ? shmem_put_super+0x11/0x20
[ 1714.960735] [<c13cae9c>] wait_for_common+0x9c/0x150
[ 1714.960735] [<c102c890>] ? try_to_wake_up+0x170/0x170
[ 1714.960735] [<c13caff2>] wait_for_completion+0x12/0x20
[ 1714.960735] [<c1075ad7>] rcu_barrier_sched+0x47/0x50
^^^^^^^^^^^^^^^^^
[ 1714.960735] [<c104d3c0>] ? alloc_pid+0x370/0x370
[ 1714.960735] [<c10ce74a>] deactivate_locked_super+0x3a/0x60
[ 1714.960735] [<c10ce948>] deactivate_super+0x48/0x70
[ 1714.960735] [<c10e7427>] mntput_no_expire+0x87/0xe0
[ 1714.960735] [<c10e7800>] sys_umount+0x60/0x320
[ 1714.960735] [<c10b231a>] ? remove_vma+0x3a/0x50
[ 1714.960735] [<c10b3b22>] ? do_munmap+0x212/0x2f0
[ 1714.960735] [<c10e7ad9>] sys_oldumount+0x19/0x20
[ 1714.960735] [<c13cce10>] sysenter_do_call+0x12/0x26

Bruno

2011-04-27 20:40:39

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 April 2011 Bruno Prémont wrote:
> On Wed, 27 April 2011 Bruno Prémont wrote:
> > On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > > Also please apply the patch below and check, whether the printk shows
> > > up in your dmesg.
> >
> > > Index: linux-2.6-tip/kernel/sched_rt.c
> > > ===================================================================
> > > --- linux-2.6-tip.orig/kernel/sched_rt.c
> > > +++ linux-2.6-tip/kernel/sched_rt.c
> > > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
> > >
> > > if (rt_rq->rt_time > runtime) {
> > > rt_rq->rt_throttled = 1;
> > > + printk_once(KERN_WARNING "sched: RT throttling activated\n");
>
> This gun is triggering right before RCU-managed slabs start piling up as
> visible under slabtop so chances are it's at least a related!

Letting the machine idle (except running collectd and slabtop) scheduler
suddenly decided to restart giving rcu_kthread CPU cycles (after two hours
or so! if I read my statistics graphs correctly)

While looking at lkml during the above 2 hours I stumbled across this (the
patch of which doesn't help in my case) which looked possibly related.
http://thread.gmane.org/gmane.linux.kernel/1129614

Bruno

2011-04-27 21:55:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, Apr 27, 2011 at 12:28:37AM +0200, Thomas Gleixner wrote:
> On Tue, 26 Apr 2011, Linus Torvalds wrote:
>
> > On Tue, Apr 26, 2011 at 10:09 AM, Bruno Pr?mont
> > <[email protected]> wrote:
> > >
> > > Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > > context switches and exactly one non-voluntary one.
> > >
> > > In addition when rcu_kthread has stopped doing its work
> > > `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > > blocks on disabling swap device).
> > > If I get to do it when I get back home I will manually try to swapoff
> > > and take process traces with sysrq-t.
> >
> > That "exactly one non-voluntary one" sounds like the smoking gun.
> >
> > Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
> > kind of the point of SCHED_FIFO. Involuntary context switches happen
> > when some higher-priority SCHED_FIFO process becomes runnable (irq
> > handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
> > too), and maybe there is a bug in the runqueue handling for that case.
>
> The forced irq threading is only effective when you add the command
> line parameter "threadirqs". I don't see any irq threads in the ps
> outputs, so that's not the problem.
>
> Though the whole ps output is weird. There is only one thread/process
> which accumulated CPU time
>
> collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14

I believe that the above is the script that prints out the RCU debugfs
information periodically. Unless there is something else that begins
with "collectd" instead of just collectdebugfs.sh.

Thanx, Paul

> All others show 0:00 CPU time - not only kthread_rcu.
>
> Bruno, are you running on real hardware or in a virtual machine?
>
> Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> /proc/sched_stat when the problem surfaces and a minute after the
> first snapshot?
>
> Also please apply the patch below and check, whether the printk shows
> up in your dmesg.
>
> Thanks,
>
> tglx
>
> ---
> kernel/sched_rt.c | 1 +
> 1 file changed, 1 insertion(+)
>
> Index: linux-2.6-tip/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/sched_rt.c
> +++ linux-2.6-tip/kernel/sched_rt.c
> @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
>
> if (rt_rq->rt_time > runtime) {
> rt_rq->rt_throttled = 1;
> + printk_once(KERN_WARNING "sched: RT throttling activated\n");
> if (rt_rq_throttled(rt_rq)) {
> sched_rt_rq_dequeue(rt_rq);
> return 1;

2011-04-27 22:02:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Tue, Apr 26, 2011 at 09:17:28PM +0200, Sedat Dilek wrote:
> On Tue, Apr 26, 2011 at 8:50 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, Apr 26, 2011 at 10:12:39AM -0700, Linus Torvalds wrote:
> >> On Tue, Apr 26, 2011 at 9:38 AM, Bruno Pr?mont
> >> <[email protected]> wrote:
> >> >
> >> > Here it comes:
> >> >
> >> > rcu_kthread (when build processes are STOPped):
> >> > [ ?836.050003] rcu_kthread ? ? R running ? 7324 ? ? 6 ? ? ?2 0x00000000
> >> > [ ?836.050003] ?dd473f28 00000046 5a000240 dd65207c dd407360 dd651d40 0000035c dd473ed8
> >> > [ ?836.050003] ?c10bf8a2 c14d63d8 dd65207c dd473f28 dd445040 dd445040 dd473eec c10be848
> >> > [ ?836.050003] ?dd651d40 dd407360 ddfdca00 dd473f14 c10bfde2 00000000 00000001 000007b6
> >> > [ ?836.050003] Call Trace:
> >> > [ ?836.050003] ?[<c10bf8a2>] ? check_object+0x92/0x210
> >> > [ ?836.050003] ?[<c10be848>] ? init_object+0x38/0x70
> >> > [ ?836.050003] ?[<c10bfde2>] ? free_debug_processing+0x112/0x1f0
> >> > [ ?836.050003] ?[<c103d9fd>] ? lock_timer_base+0x2d/0x70
> >> > [ ?836.050003] ?[<c13c8ec7>] schedule_timeout+0x137/0x280
> >>
> >> Hmm.
> >>
> >> I'm adding Ingo and Peter to the cc, because this whole "rcu_kthread
> >> is running, but never actually running" is starting to smell like a
> >> scheduler issue.
> >>
> >> Peter/Ingo: RCUTINY seems to be broken for Bruno. During any kind of
> >> heavy workload, at some point it looks like rcu_kthread simply stops
> >> making any progress. It's constantly in runnable state, but it doesn't
> >> actually use any CPU time, and it's not processing the RCU callbacks,
> >> so the RCU memory freeing isn't happening, and slabs just build up
> >> until the machine dies.
> >>
> >> And it really is RCUTINY, because the thing doesn't happen with the
> >> regular tree-RCU.
> >
> > The difference between TINY_RCU and TREE_RCU is that TREE_RCU still uses
> > softirq for the core RCU processing. ?TINY_RCU switched to a kthread
> > when I implemented RCU priority boosting. ?There is a similar change in
> > my -rcu tree that makes TREE_RCU use kthreads, and Sedat has been running
> > into a very similar problem with that change in place. ?Which is why I
> > do not yet push it to the -next tree.
> >
> >> This is without CONFIG_RCU_BOOST_PRIO, so we basically have
> >>
> >> ? ? ? ? struct sched_param sp;
> >>
> >> ? ? ? ? rcu_kthread_task = kthread_run(rcu_kthread, NULL, "rcu_kthread");
> >> ? ? ? ? sp.sched_priority = RCU_BOOST_PRIO;
> >> ? ? ? ? sched_setscheduler_nocheck(rcu_kthread_task, SCHED_FIFO, &sp);
> >>
> >> where RCU_BOOST_PRIO is 1 for the non-boost case.
> >
> > Good point! ?Bruno, Sedat, could you please set CONFIG_RCU_BOOST_PRIO to
> > (say) 50, and see if this still happens? ?(I bet that you do, but...)
> >
>
> What's with CONFIG_RCU_BOOST_DELAY setting?

CONFIG_RCU_BOOST_DELAY controls how long preemptible RCU lets a grace
period run before boosting the priority of any blocked RCU readers.

It is completely irrelevant if the rcu_kthread task isn't getting a
chance to run, though. This is because it is the rcu_kthread task
that does the boosting.

> Are those values OK?
>
> $ egrep 'M486|M686|X86_UP|CONFIG_SMP|NR_CPUS|PREEMPT|_RCU|_HIGHMEM|PAE' .config
> CONFIG_TREE_PREEMPT_RCU=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_RCU_TRACE=y
> CONFIG_RCU_FANOUT=32
> # CONFIG_RCU_FANOUT_EXACT is not set
> CONFIG_TREE_RCU_TRACE=y
> CONFIG_RCU_BOOST=y

I suggest CONFIG_RCU_BOOST=n to keep things simple for the moment, but
CONFIG_RCU_BOOST=y should be OK too.

> CONFIG_RCU_BOOST_PRIO=50
> CONFIG_RCU_BOOST_DELAY=500
> CONFIG_SMP=y
> # CONFIG_M486 is not set
> CONFIG_M686=y

I don't have an opinion on CONFIG_M486 vs. CONFIG_M686.

> CONFIG_NR_CPUS=32
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_HIGHMEM4G=y
> # CONFIG_HIGHMEM64G is not set
> CONFIG_HIGHMEM=y
> CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> CONFIG_DEBUG_PREEMPT=y

The above two could be left out, but shouldn't hurt.

> # CONFIG_SPARSE_RCU_POINTER is not set
> # CONFIG_DEBUG_HIGHMEM is not set
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=60
> CONFIG_RCU_CPU_STALL_VERBOSE=y
> CONFIG_PREEMPT_TRACER=y

So they look fine to me, the ones that I understand, anyway. ;-)

Thanx, Paul
>
> - Sedat -
>
> >> Is that so low that even the idle thread will take priority? It's a UP
> >> config with PREEMPT_VOLUNTARY. So pretty much _all_ the stars are
> >> aligned for odd scheduling behavior.
> >>
> >> Other users of SCHED_FIFO tend to set the priority really high (eg
> >> "MAX_RT_PRIO-1" is clearly the default one - softirq's, watchdog), but
> >> "1" is not unheard of either (touchscreen/ucb1400_ts and
> >> mmc/core/sdio_irq), and there are some other random choises out tere.
> >>
> >> Any ideas?
> >
> > I have found one bug so far in my code, but it only affects TREE_RCU
> > in my -rcu tree, and even then only if HOTPLUG_CPU is enabled. ?I am
> > testing a fix, but I expect Sedat's tests to still break.
> >
> > I gave Sedat a patch that make rcu_kthread() run at normal (non-realtime)
> > priority, and he did not see the failure. ?So running non-realtime at
> > least greatly reduces the probability of failure.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >

2011-04-27 22:05:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, Apr 27, 2011 at 09:34:31PM +0200, Bruno Pr?mont wrote:
> On Wed, 27 April 2011 P?draig Brady wrote:
> > On 27/04/11 19:41, Bruno Pr?mont wrote:
> > > On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > >> On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > >>> On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > >>>> On Tue, Apr 26, 2011 at 10:09 AM, Bruno Pr?mont wrote:
> > >>>>> Just in case, /proc/$(pidof rcu_kthread)/status shows ~20k voluntary
> > >>>>> context switches and exactly one non-voluntary one.
> > >>>>>
> > >>>>> In addition when rcu_kthread has stopped doing its work
> > >>>>> `swapoff $(swapdevice)` seems to block forever (at least normal shutdown
> > >>>>> blocks on disabling swap device).
> > >
> > > Apparently it's not swapoff but `umount -a -t tmpfs` that's getting
> > > stuck here. Manual swapoff worked.

Doesn't "umount" wait for an RCU grace period? If so, then your hang
is just a consequence of RCU grace periods hanging, which in turn appears
to be a consequence of rcu_kthread not being allowed to run.

> > Anything to do with this?
> > http://thread.gmane.org/gmane.linux.kernel.mm/60953/
>
> I don't think so, if it is, it is only loosely related.
>
> From the trace you omitted to keep it's visible that it gets hit by
> non-operating RCU kthread.

Yep, makes sense!

Thanx, Paul

> Maybe existence of RCU barrier in this trace has some relation to
> above thread but I don't see it at first glance.
>
> [ 1714.960735] umount D 5a000040 5668 20331 20324 0x00000000
> [ 1714.960735] c3c99e5c 00000086 dd407900 5a000040 dd25a1a8 dd407900 dd25a120 c3c99e0c
> [ 1714.960735] c3c99e24 c10c1be2 c14d9f20 c3c99e5c c3c8c680 c3c8c680 000000bb c3c99e24
> [ 1714.960735] c10c0b88 dd25a120 dd407900 ddfd4b40 c3c99e4c ddfc9d20 dd402380 5a000010
> [ 1714.960735] Call Trace:
> [ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
> [ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
> [ 1714.960735] [<c10c1be2>] ? check_object+0x92/0x210
> [ 1714.960735] [<c13cb37d>] schedule_timeout+0x16d/0x280
> [ 1714.960735] [<c10c0b88>] ? init_object+0x38/0x70
> [ 1714.960735] [<c10c2122>] ? free_debug_processing+0x112/0x1f0
> [ 1714.960735] [<c10a3791>] ? shmem_put_super+0x11/0x20
> [ 1714.960735] [<c13cae9c>] wait_for_common+0x9c/0x150
> [ 1714.960735] [<c102c890>] ? try_to_wake_up+0x170/0x170
> [ 1714.960735] [<c13caff2>] wait_for_completion+0x12/0x20
> [ 1714.960735] [<c1075ad7>] rcu_barrier_sched+0x47/0x50
> ^^^^^^^^^^^^^^^^^
> [ 1714.960735] [<c104d3c0>] ? alloc_pid+0x370/0x370
> [ 1714.960735] [<c10ce74a>] deactivate_locked_super+0x3a/0x60
> [ 1714.960735] [<c10ce948>] deactivate_super+0x48/0x70
> [ 1714.960735] [<c10e7427>] mntput_no_expire+0x87/0xe0
> [ 1714.960735] [<c10e7800>] sys_umount+0x60/0x320
> [ 1714.960735] [<c10b231a>] ? remove_vma+0x3a/0x50
> [ 1714.960735] [<c10b3b22>] ? do_munmap+0x212/0x2f0
> [ 1714.960735] [<c10e7ad9>] sys_oldumount+0x19/0x20
> [ 1714.960735] [<c13cce10>] sysenter_do_call+0x12/0x26
>
> Bruno

2011-04-27 22:06:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 Apr 2011, Bruno Pr?mont wrote:
> On Wed, 27 April 2011 Bruno Pr?mont wrote:
> Voluntary context switches stay constant from the time on SLABs pile up.
> (which makes sense as it doesn't run get CPU slices anymore)
>
> > > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > > /proc/sched_stat when the problem surfaces and a minute after the
> > > first snapshot?
>
> hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?
>
> I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
> /proc/sched_debug which exists... (attached, taken about 7min and +1min
> after SLABs started piling up), though build processes were SIGSTOPped
> during first minute.

Oops. /proc/sched_debug is the right thing.

> printk wrote (in case its timestamp is useful, more below):
> [ 518.480103] sched: RT throttling activated

Ok. Aside of the fact that the CPU time accounting is completely hosed
this is pointing to the root cause of the problem.

kthread_rcu seems to run in circles for whatever reason and the RT
throttler catches it. After that things go down the drain completely
as it should get on the CPU again after that 50ms throttling break.

Though we should not ignore the fact, that the RT throttler hit, but
none of the RT tasks actually accumulated runtime.

So there is a couple of questions:

- Why does the scheduler detect the 950 ms RT runtime, but does
not accumulate that runtime to any thread

- Why is the runtime accounting totally hosed

- Why does that not happen (at least not reproducible) with
TREE_RCU

I need some sleep now, but I will try to come up with sensible
debugging tomorrow unless Paul or someone else beats me to it.

Thanks,

tglx

2011-04-27 22:07:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, Apr 27, 2011 at 10:40:23PM +0200, Bruno Pr?mont wrote:
> On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > > On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > > > Also please apply the patch below and check, whether the printk shows
> > > > up in your dmesg.
> > >
> > > > Index: linux-2.6-tip/kernel/sched_rt.c
> > > > ===================================================================
> > > > --- linux-2.6-tip.orig/kernel/sched_rt.c
> > > > +++ linux-2.6-tip/kernel/sched_rt.c
> > > > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
> > > >
> > > > if (rt_rq->rt_time > runtime) {
> > > > rt_rq->rt_throttled = 1;
> > > > + printk_once(KERN_WARNING "sched: RT throttling activated\n");
> >
> > This gun is triggering right before RCU-managed slabs start piling up as
> > visible under slabtop so chances are it's at least a related!
>
> Letting the machine idle (except running collectd and slabtop) scheduler
> suddenly decided to restart giving rcu_kthread CPU cycles (after two hours
> or so! if I read my statistics graphs correctly)

And this also returned the slab memory, right?

Two hours is quite some time...

Thanx, Paul

> While looking at lkml during the above 2 hours I stumbled across this (the
> patch of which doesn't help in my case) which looked possibly related.
> http://thread.gmane.org/gmane.linux.kernel/1129614
>
> Bruno

2011-04-27 22:27:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 12:06:11AM +0200, Thomas Gleixner wrote:
> On Wed, 27 Apr 2011, Bruno Pr?mont wrote:
> > On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > Voluntary context switches stay constant from the time on SLABs pile up.
> > (which makes sense as it doesn't run get CPU slices anymore)
> >
> > > > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > > > /proc/sched_stat when the problem surfaces and a minute after the
> > > > first snapshot?
> >
> > hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?
> >
> > I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
> > /proc/sched_debug which exists... (attached, taken about 7min and +1min
> > after SLABs started piling up), though build processes were SIGSTOPped
> > during first minute.
>
> Oops. /proc/sched_debug is the right thing.
>
> > printk wrote (in case its timestamp is useful, more below):
> > [ 518.480103] sched: RT throttling activated
>
> Ok. Aside of the fact that the CPU time accounting is completely hosed
> this is pointing to the root cause of the problem.
>
> kthread_rcu seems to run in circles for whatever reason and the RT
> throttler catches it. After that things go down the drain completely
> as it should get on the CPU again after that 50ms throttling break.

Ah. This could happen if there was a huge number of callbacks, in
which case blimit would be set very large and kthread_rcu could then
go CPU-bound. And this workload was generating large numbers of
callbacks due to filesystem operations, right?

So, perhaps I should kick kthread_rcu back to SCHED_NORMAL if blimit
has been set high. Or have some throttling of my own. I must confess
that throttling kthread_rcu for two hours seems a bit harsh. ;-)

If this was just throttling kthread_rcu for a few hundred milliseconds,
or even for a second or two, things would be just fine.

Left to myself, I will put together a patch that puts callback processing
down to SCHED_NORMAL in the case where there are huge numbers of
callbacks to be processed.

> Though we should not ignore the fact, that the RT throttler hit, but
> none of the RT tasks actually accumulated runtime.
>
> So there is a couple of questions:
>
> - Why does the scheduler detect the 950 ms RT runtime, but does
> not accumulate that runtime to any thread
>
> - Why is the runtime accounting totally hosed
>
> - Why does that not happen (at least not reproducible) with
> TREE_RCU

This one I can answer -- In Linus's tree, TREE_RCU still uses softirq,
so there is no RCU kthread, so there is nothing to throttle other
than ksoftirqd itself.

Thanx, Paul

> I need some sleep now, but I will try to come up with sensible
> debugging tomorrow unless Paul or someone else beats me to it.
>
> Thanks,
>
> tglx

2011-04-27 22:33:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?



On Wed, 27 Apr 2011, Paul E. McKenney wrote:

> On Thu, Apr 28, 2011 at 12:06:11AM +0200, Thomas Gleixner wrote:
> > On Wed, 27 Apr 2011, Bruno Pr?mont wrote:
> > > On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > > Voluntary context switches stay constant from the time on SLABs pile up.
> > > (which makes sense as it doesn't run get CPU slices anymore)
> > >
> > > > > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > > > > /proc/sched_stat when the problem surfaces and a minute after the
> > > > > first snapshot?
> > >
> > > hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?
> > >
> > > I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
> > > /proc/sched_debug which exists... (attached, taken about 7min and +1min
> > > after SLABs started piling up), though build processes were SIGSTOPped
> > > during first minute.
> >
> > Oops. /proc/sched_debug is the right thing.
> >
> > > printk wrote (in case its timestamp is useful, more below):
> > > [ 518.480103] sched: RT throttling activated
> >
> > Ok. Aside of the fact that the CPU time accounting is completely hosed
> > this is pointing to the root cause of the problem.
> >
> > kthread_rcu seems to run in circles for whatever reason and the RT
> > throttler catches it. After that things go down the drain completely
> > as it should get on the CPU again after that 50ms throttling break.
>
> Ah. This could happen if there was a huge number of callbacks, in
> which case blimit would be set very large and kthread_rcu could then
> go CPU-bound. And this workload was generating large numbers of
> callbacks due to filesystem operations, right?
>
> So, perhaps I should kick kthread_rcu back to SCHED_NORMAL if blimit
> has been set high. Or have some throttling of my own. I must confess
> that throttling kthread_rcu for two hours seems a bit harsh. ;-)

That's not the intended thing. See below.

> If this was just throttling kthread_rcu for a few hundred milliseconds,
> or even for a second or two, things would be just fine.
>
> Left to myself, I will put together a patch that puts callback processing
> down to SCHED_NORMAL in the case where there are huge numbers of
> callbacks to be processed.

Well that's going to paper over the problem at hand possibly. I really
don't see why that thing would run for more than 950ms in a row even
if there is a large number of callbacks pending.

And then I don't have an explanation for the hosed CPU accounting and
why that thing does not get another 950ms RT time when the 50ms
throttling break is over.

Thanks,

tglx

2011-04-27 22:59:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 12:32:50AM +0200, Thomas Gleixner wrote:
> On Wed, 27 Apr 2011, Paul E. McKenney wrote:
> > On Thu, Apr 28, 2011 at 12:06:11AM +0200, Thomas Gleixner wrote:
> > > On Wed, 27 Apr 2011, Bruno Pr?mont wrote:
> > > > On Wed, 27 April 2011 Bruno Pr?mont wrote:
> > > > Voluntary context switches stay constant from the time on SLABs pile up.
> > > > (which makes sense as it doesn't run get CPU slices anymore)
> > > >
> > > > > > Can you please enable CONFIG_SCHED_DEBUG and provide the output of
> > > > > > /proc/sched_stat when the problem surfaces and a minute after the
> > > > > > first snapshot?
> > > >
> > > > hm, did you mean CONFIG_SCHEDSTAT or /proc/sched_debug?
> > > >
> > > > I did use CONFIG_SCHED_DEBUG (and there is no /proc/sched_stat) so I took
> > > > /proc/sched_debug which exists... (attached, taken about 7min and +1min
> > > > after SLABs started piling up), though build processes were SIGSTOPped
> > > > during first minute.
> > >
> > > Oops. /proc/sched_debug is the right thing.
> > >
> > > > printk wrote (in case its timestamp is useful, more below):
> > > > [ 518.480103] sched: RT throttling activated
> > >
> > > Ok. Aside of the fact that the CPU time accounting is completely hosed
> > > this is pointing to the root cause of the problem.
> > >
> > > kthread_rcu seems to run in circles for whatever reason and the RT
> > > throttler catches it. After that things go down the drain completely
> > > as it should get on the CPU again after that 50ms throttling break.
> >
> > Ah. This could happen if there was a huge number of callbacks, in
> > which case blimit would be set very large and kthread_rcu could then
> > go CPU-bound. And this workload was generating large numbers of
> > callbacks due to filesystem operations, right?
> >
> > So, perhaps I should kick kthread_rcu back to SCHED_NORMAL if blimit
> > has been set high. Or have some throttling of my own. I must confess
> > that throttling kthread_rcu for two hours seems a bit harsh. ;-)
>
> That's not the intended thing. See below.
>
> > If this was just throttling kthread_rcu for a few hundred milliseconds,
> > or even for a second or two, things would be just fine.
> >
> > Left to myself, I will put together a patch that puts callback processing
> > down to SCHED_NORMAL in the case where there are huge numbers of
> > callbacks to be processed.
>
> Well that's going to paper over the problem at hand possibly. I really
> don't see why that thing would run for more than 950ms in a row even
> if there is a large number of callbacks pending.

True enough, it would probably take millions of callbacks to keep
rcu_do_batch() busy for 950 milliseconds. Possible, but hopefully
unlikely.

Hmmm... If this is happening, I should see it in the debug stuff that
Sedat sent me. And the biggest change I see in a 15-second interval
is 50,000 RCU callbacks, which is large, but should not be problematic.
Even if they all showed up at once, I would hope that they could be
invoked within a few hundred milliseconds.

> And then I don't have an explanation for the hosed CPU accounting and
> why that thing does not get another 950ms RT time when the 50ms
> throttling break is over.

Would problems in the CPU accounting result in spurious throttles,
or are we talking different types of accounting here?

Thanx, Paul

2011-04-27 23:29:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, Apr 27, 2011 at 3:32 PM, Thomas Gleixner <[email protected]> wrote:
>
> Well that's going to paper over the problem at hand possibly. I really
> don't see why that thing would run for more than 950ms in a row even
> if there is a large number of callbacks pending.

Stop with this bogosity already, guys.

We _know_ it didn't run continuously for 950ms. That number is totally
made up. There's not enough work for it to run that long, but more
importantly, the thread has zero CPU time. There is _zero_ reason to
believe that it runs for long periods.

There is some scheduler bug, probably the rt_time hasn't been
initialized at all, or runtime we compare against is zero, or the
calculations are just wrong.

The 950ms didn't happen. Stop harping on it. It almost certainly
simply doesn't exist.

Since that

if (rt_rq->rt_time > runtime) {
rt_rq->rt_throttled = 1;
+ printk_once(KERN_WARNING "sched: RT throttling activated\n");

test triggers, we know that either 'runtime' or 'rt_time' is just
bogus. Make the printk print out the values, and maybe that gives some
hints.

But in the meantime, I'd suggest looking for the places that
initialize or calculate those values, and just assume that some of
them are buggy.

> And then I don't have an explanation for the hosed CPU accounting and
> why that thing does not get another 950ms RT time when the 50ms
> throttling break is over.

Again, don't even bother talking about "another 950ms". It didn't
happen in the first place, there's no "another" there either.

Linus

2011-04-27 23:53:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, Apr 27, 2011 at 4:28 PM, Linus Torvalds
<[email protected]> wrote:
>
> We _know_ it didn't run continuously for 950ms. That number is totally
> made up. There's not enough work for it to run that long, but more
> importantly, the thread has zero CPU time. There is _zero_ reason to
> believe that it runs for long periods.

Hmm. But it might certainly have run for a _total_ of 950ms. Since
that's just under a second, we wouldn't see it in the "ps" output.

Where is rt_time cleared? I see that subtract in
do_sched_rt_period_timer(), but judging by the caller that is only
called for some timer overrun case (I didn't look at what the
definition of such an overrun is, though). Shouldn't rt_time be
cleared when the task goes to sleep voluntarily?

What am I missing?

Linus

2011-04-28 06:10:18

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 Apr 2011 15:07:17 "Paul E. McKenney" wrote:
> On Wed, Apr 27, 2011 at 10:40:23PM +0200, Bruno Prémont wrote:
> > On Wed, 27 April 2011 Bruno Prémont wrote:
> > > On Wed, 27 April 2011 Bruno Prémont wrote:
> > > > On Wed, 27 Apr 2011 00:28:37 +0200 (CEST) Thomas Gleixner wrote:
> > > > > Also please apply the patch below and check, whether the printk shows
> > > > > up in your dmesg.
> > > >
> > > > > Index: linux-2.6-tip/kernel/sched_rt.c
> > > > > ===================================================================
> > > > > --- linux-2.6-tip.orig/kernel/sched_rt.c
> > > > > +++ linux-2.6-tip/kernel/sched_rt.c
> > > > > @@ -609,6 +609,7 @@ static int sched_rt_runtime_exceeded(str
> > > > >
> > > > > if (rt_rq->rt_time > runtime) {
> > > > > rt_rq->rt_throttled = 1;
> > > > > + printk_once(KERN_WARNING "sched: RT throttling activated\n");
> > >
> > > This gun is triggering right before RCU-managed slabs start piling up as
> > > visible under slabtop so chances are it's at least a related!
> >
> > Letting the machine idle (except running collectd and slabtop) scheduler
> > suddenly decided to restart giving rcu_kthread CPU cycles (after two hours
> > or so! if I read my statistics graphs correctly)
>
> And this also returned the slab memory, right?

Exactly!

> Two hours is quite some time...
>
> Thanx, Paul
>
> > While looking at lkml during the above 2 hours I stumbled across this (the
> > patch of which doesn't help in my case) which looked possibly related.
> > http://thread.gmane.org/gmane.linux.kernel/1129614
> >
> > Bruno

2011-04-28 06:22:32

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Wed, 27 Apr 2011 14:55:49 "Paul E. McKenney" wrote:
> On Wed, Apr 27, 2011 at 12:28:37AM +0200, Thomas Gleixner wrote:
> > On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > > Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
> > > kind of the point of SCHED_FIFO. Involuntary context switches happen
> > > when some higher-priority SCHED_FIFO process becomes runnable (irq
> > > handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
> > > too), and maybe there is a bug in the runqueue handling for that case.
> >
> > The forced irq threading is only effective when you add the command
> > line parameter "threadirqs". I don't see any irq threads in the ps
> > outputs, so that's not the problem.
> >
> > Though the whole ps output is weird. There is only one thread/process
> > which accumulated CPU time
> >
> > collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14
>
> I believe that the above is the script that prints out the RCU debugfs
> information periodically. Unless there is something else that begins
> with "collectd" instead of just collectdebugfs.sh.

No, collectd is a multi-threaded daemon that collects statistics of all
kinds, see http://www.collectd.org/ for details (on my machine it
collects CPU usage, memory usage [just the basics], disk statistics,
network statistics load and a few more)

Bruno

2011-04-28 09:09:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

Bruno,

On Thu, 28 Apr 2011, Thomas Gleixner wrote:
> On Wed, 27 Apr 2011, Bruno Pr?mont wrote:
> I need some sleep now, but I will try to come up with sensible
> debugging tomorrow unless Paul or someone else beats me to it.

can you please add the patch below and provide the /proc/sched_debug
output when the problem shows up again?

Thanks,

tglx

---
kernel/sched.c | 3 ---
1 file changed, 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -642,9 +642,6 @@ static void update_rq_clock(struct rq *r
{
s64 delta;

- if (rq->skip_clock_update)
- return;
-
delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
rq->clock += delta;
update_rq_clock_task(rq, delta);

2011-04-28 09:17:51

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 11:09 AM, Thomas Gleixner <[email protected]> wrote:
> Bruno,
>
> On Thu, 28 Apr 2011, Thomas Gleixner wrote:
>> On Wed, 27 Apr 2011, Bruno Prémont wrote:
>> I need some sleep now, but I will try to come up with sensible
>> debugging tomorrow unless Paul or someone else beats me to it.
>
> can you please add the patch below and provide the /proc/sched_debug
> output when the problem shows up again?
>
> Thanks,
>
>        tglx
>
> ---
>  kernel/sched.c |    3 ---
>  1 file changed, 3 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -642,9 +642,6 @@ static void update_rq_clock(struct rq *r
>  {
>        s64 delta;
>
> -       if (rq->skip_clock_update)
> -               return;
> -
>        delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
>        rq->clock += delta;
>        update_rq_clock_task(rq, delta);

Referring to [1]?

- Sedat -

[1] http://lkml.org/lkml/2011/4/22/35

2011-04-28 09:40:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, Sedat Dilek wrote:
> On Thu, Apr 28, 2011 at 11:09 AM, Thomas Gleixner <[email protected]> wrote:
> > Bruno,
> >
> > On Thu, 28 Apr 2011, Thomas Gleixner wrote:
> >> On Wed, 27 Apr 2011, Bruno Prémont wrote:
> >> I need some sleep now, but I will try to come up with sensible
> >> debugging tomorrow unless Paul or someone else beats me to it.
> >
> > can you please add the patch below and provide the /proc/sched_debug
> > output when the problem shows up again?
> >
> > Thanks,
> >
> >        tglx
> >
> > ---
> >  kernel/sched.c |    3 ---
> >  1 file changed, 3 deletions(-)
> >
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -642,9 +642,6 @@ static void update_rq_clock(struct rq *r
> >  {
> >        s64 delta;
> >
> > -       if (rq->skip_clock_update)
> > -               return;
> > -
> >        delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> >        rq->clock += delta;
> >        update_rq_clock_task(rq, delta);
>
> Referring to [1]?
>
> - Sedat -
>
> [1] http://lkml.org/lkml/2011/4/22/35

Kinda, but I suspect there is more wrong with that optimization thing
for yet unknown reasons.

Thanks,

tglx

2011-04-28 09:45:08

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

Hi,

not sure if my problem from linux-2.6-rcu.git#sedat.2011.04.23a is
related to the issue here.

Just FYI:
I am here on a Pentium-M (uniprocessor aka UP) and still unsure if I
have the correct (optimal?) kernel-configs set.

Paul gave me a script to collect RCU data and I enhanced it with
collecting SCHED data.

In the above mentionned GIT branch I applied these two extra commits
(0001 requested by Paul and 0002 proposed by Thomas):

patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch

Furthermore, I have added my kernel-config file, scripts, patches and
logs (also output of 'cat /proc/cpuinfo').

Hope this helps the experts to narrow down the problem.

Regards,
- Sedat -

P.S.: I adapted the patch from [1] against
linux-2.6-rcu.git#sedat.2011.04.23a, but did not help here.

[1] http://lkml.org/lkml/2011/4/22/35


Attachments:
from-dileks.tar.xz (92.38 kB)
from-dileks.tar.xz.sha256sum (85.00 B)
Download all attachments

2011-04-28 10:12:13

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 2011-04-28 at 11:40 +0200, Thomas Gleixner wrote:
> On Thu, 28 Apr 2011, Sedat Dilek wrote:
> > On Thu, Apr 28, 2011 at 11:09 AM, Thomas Gleixner <[email protected]> wrote:
> > > Bruno,
> > >
> > > On Thu, 28 Apr 2011, Thomas Gleixner wrote:
> > >> On Wed, 27 Apr 2011, Bruno Prémont wrote:
> > >> I need some sleep now, but I will try to come up with sensible
> > >> debugging tomorrow unless Paul or someone else beats me to it.
> > >
> > > can you please add the patch below and provide the /proc/sched_debug
> > > output when the problem shows up again?
> > >
> > > Thanks,
> > >
> > > tglx
> > >
> > > ---
> > > kernel/sched.c | 3 ---
> > > 1 file changed, 3 deletions(-)
> > >
> > > Index: linux-2.6/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/sched.c
> > > +++ linux-2.6/kernel/sched.c
> > > @@ -642,9 +642,6 @@ static void update_rq_clock(struct rq *r
> > > {
> > > s64 delta;
> > >
> > > - if (rq->skip_clock_update)
> > > - return;
> > > -
> > > delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> > > rq->clock += delta;
> > > update_rq_clock_task(rq, delta);
> >
> > Referring to [1]?
> >
> > - Sedat -
> >
> > [1] http://lkml.org/lkml/2011/4/22/35
>
> Kinda, but I suspect there is more wrong with that optimization thing
> for yet unknown reasons.

It's definitely getting in the way in the throttled to unthrottled RT
when otherwise idle case. Removing it to test is a good idea.

-Mike

2011-04-28 10:26:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 11:45:03AM +0200, Sedat Dilek wrote:
> Hi,
>
> not sure if my problem from linux-2.6-rcu.git#sedat.2011.04.23a is
> related to the issue here.
>
> Just FYI:
> I am here on a Pentium-M (uniprocessor aka UP) and still unsure if I
> have the correct (optimal?) kernel-configs set.
>
> Paul gave me a script to collect RCU data and I enhanced it with
> collecting SCHED data.
>
> In the above mentionned GIT branch I applied these two extra commits
> (0001 requested by Paul and 0002 proposed by Thomas):
>
> patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
>
> Furthermore, I have added my kernel-config file, scripts, patches and
> logs (also output of 'cat /proc/cpuinfo').
>
> Hope this helps the experts to narrow down the problem.

Yow!!!

Now this one might well be able to hit the 950 millisecond limit.
There are no fewer than 1,314,958 RCU callbacks queued up at the end of
the test. And RCU has indeed noticed this and cranked up the number
of callbacks to be handled by each invocation of rcu_do_batch() to
2,147,483,647. And only 15 seconds earlier, there were zero callbacks
queued and the rcu_do_batch() limit was at the default of 10 callbacks
per invocation.

Thanx, Paul

> Regards,
> - Sedat -
>
> P.S.: I adapted the patch from [1] against
> linux-2.6-rcu.git#sedat.2011.04.23a, but did not help here.
>
> [1] http://lkml.org/lkml/2011/4/22/35


2011-04-28 10:27:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 08:22:29AM +0200, Bruno Pr?mont wrote:
> On Wed, 27 Apr 2011 14:55:49 "Paul E. McKenney" wrote:
> > On Wed, Apr 27, 2011 at 12:28:37AM +0200, Thomas Gleixner wrote:
> > > On Tue, 26 Apr 2011, Linus Torvalds wrote:
> > > > Normally SCHED_FIFO runs until it voluntarily gives up the CPU. That's
> > > > kind of the point of SCHED_FIFO. Involuntary context switches happen
> > > > when some higher-priority SCHED_FIFO process becomes runnable (irq
> > > > handlers? You _do_ have CONFIG_IRQ_FORCED_THREADING=y in your config
> > > > too), and maybe there is a bug in the runqueue handling for that case.
> > >
> > > The forced irq threading is only effective when you add the command
> > > line parameter "threadirqs". I don't see any irq threads in the ps
> > > outputs, so that's not the problem.
> > >
> > > Though the whole ps output is weird. There is only one thread/process
> > > which accumulated CPU time
> > >
> > > collectd 1605 0.6 0.7 49924 3748 ? SNLsl 22:14 0:14
> >
> > I believe that the above is the script that prints out the RCU debugfs
> > information periodically. Unless there is something else that begins
> > with "collectd" instead of just collectdebugfs.sh.
>
> No, collectd is a multi-threaded daemon that collects statistics of all
> kinds, see http://www.collectd.org/ for details (on my machine it
> collects CPU usage, memory usage [just the basics], disk statistics,
> network statistics load and a few more)

OK, thank you for the info!

Thanx, Paul

2011-04-28 13:30:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 2011-04-28 at 03:26 -0700, Paul E. McKenney wrote:
> On Thu, Apr 28, 2011 at 11:45:03AM +0200, Sedat Dilek wrote:
> > Hi,
> >
> > not sure if my problem from linux-2.6-rcu.git#sedat.2011.04.23a is
> > related to the issue here.
> >
> > Just FYI:
> > I am here on a Pentium-M (uniprocessor aka UP) and still unsure if I
> > have the correct (optimal?) kernel-configs set.
> >
> > Paul gave me a script to collect RCU data and I enhanced it with
> > collecting SCHED data.
> >
> > In the above mentionned GIT branch I applied these two extra commits
> > (0001 requested by Paul and 0002 proposed by Thomas):
> >
> > patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> > patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
> >
> > Furthermore, I have added my kernel-config file, scripts, patches and
> > logs (also output of 'cat /proc/cpuinfo').
> >
> > Hope this helps the experts to narrow down the problem.
>
> Yow!!!
>
> Now this one might well be able to hit the 950 millisecond limit.
> There are no fewer than 1,314,958 RCU callbacks queued up at the end of
> the test. And RCU has indeed noticed this and cranked up the number
> of callbacks to be handled by each invocation of rcu_do_batch() to
> 2,147,483,647. And only 15 seconds earlier, there were zero callbacks
> queued and the rcu_do_batch() limit was at the default of 10 callbacks
> per invocation.

Yeah, yow. Once the RT throttle hit, it stuck.

.clock : 1386824.201768
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.132427
.rt_runtime : 950.000000
rcuc0 7 0.034118 10857 98 0.034118 1472.309646 0.000000 /
FF 1 1 R R 0 [rcuc0]
.clock : 1402450.997994
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.132427
.rt_runtime : 950.000000
rcuc0 7 0.034118 10857 98 0.034118 1472.309646 0.000000 /
FF 1 1 R R 0 [rcuc0]

...

.clock : 2707432.862374
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.132427
.rt_runtime : 950.000000
rcuc0 7 0.034118 10857 98 0.034118 1472.309646 0.000000 /
FF 1 1 R R 0 [rcuc0]
.clock : 2722572.958381
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.132427
.rt_runtime : 950.000000
rcuc0 7 0.034118 10857 98 0.034118 1472.309646 0.000000 /
FF 1 1 R R 0 [rcuc0]

2011-04-28 15:28:28

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 3:30 PM, Mike Galbraith <[email protected]> wrote:
> On Thu, 2011-04-28 at 03:26 -0700, Paul E. McKenney wrote:
>> On Thu, Apr 28, 2011 at 11:45:03AM +0200, Sedat Dilek wrote:
>> > Hi,
>> >
>> > not sure if my problem from linux-2.6-rcu.git#sedat.2011.04.23a is
>> > related to the issue here.
>> >
>> > Just FYI:
>> > I am here on a Pentium-M (uniprocessor aka UP) and still unsure if I
>> > have the correct (optimal?) kernel-configs set.
>> >
>> > Paul gave me a script to collect RCU data and I enhanced it with
>> > collecting SCHED data.
>> >
>> > In the above mentionned GIT branch I applied these two extra commits
>> > (0001 requested by Paul and 0002 proposed by Thomas):
>> >
>> > patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
>> > patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
>> >
>> > Furthermore, I have added my kernel-config file, scripts, patches and
>> > logs (also output of 'cat /proc/cpuinfo').
>> >
>> > Hope this helps the experts to narrow down the problem.
>>
>> Yow!!!
>>
>> Now this one might well be able to hit the 950 millisecond limit.
>> There are no fewer than 1,314,958 RCU callbacks queued up at the end of
>> the test.  And RCU has indeed noticed this and cranked up the number
>> of callbacks to be handled by each invocation of rcu_do_batch() to
>> 2,147,483,647.  And only 15 seconds earlier, there were zero callbacks
>> queued and the rcu_do_batch() limit was at the default of 10 callbacks
>> per invocation.
>
> Yeah, yow.  Once the RT throttle hit, it stuck.
>
>  .clock                         : 1386824.201768
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>  .rt_time                       : 950.132427
>  .rt_runtime                    : 950.000000
>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
> FF    1      1 R    R 0 [rcuc0]
>  .clock                         : 1402450.997994
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>  .rt_time                       : 950.132427
>  .rt_runtime                    : 950.000000
>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
> FF    1      1 R    R 0 [rcuc0]
>
> ...
>
>  .clock                         : 2707432.862374
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>  .rt_time                       : 950.132427
>  .rt_runtime                    : 950.000000
>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
> FF    1      1 R    R 0 [rcuc0]
>  .clock                         : 2722572.958381
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>  .rt_time                       : 950.132427
>  .rt_runtime                    : 950.000000
>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
> FF    1      1 R    R 0 [rcuc0]
>
>

Hi,

OK, I tried with the patch proposed by Thomas (0003):

patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
patches/0003-sched-Remove-skip_clock_update-check.patch

>From the very beginning it looked as the system is "stable" due to:

.rt_nr_running : 0
.rt_throttled : 0

This changed when I started a simple tar-job to save my kernel
build-dir to an external USB-hdd.
From...

.rt_nr_running : 1
.rt_throttled : 1

...To:

.rt_nr_running : 2
.rt_throttled : 1

Unfortunately, reducing all activities to a minimum load, did not
change from last known RT throttling state.

Just noticed rt_time exceeds the value of 950 first time here:

.rt_nr_running : 1
.rt_throttled : 1
.rt_time : 950.005460

Full data attchached as tarball.

- Sedat -

P.S.: Excerpt from
collectdebugfs-v2_2.6.39-rc3-rcutree-sedat.2011.04.23a+.log (0:0 ->
1:1 -> 2:1)

--
rt_rq[0]:
.rt_nr_running : 0
.rt_throttled : 0
.rt_time : 888.893877
.rt_runtime : 950.000000

runnable tasks:
task PID tree-key switches prio
exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
R cat 2652 115108.993460 1 120
115108.993460 1.147986 0.000000 /
--
rt_rq[0]:
.rt_nr_running : 1
.rt_throttled : 1
.rt_time : 950.005460
.rt_runtime : 950.000000

runnable tasks:
task PID tree-key switches prio
exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
rcuc0 7 0.000000 56869 98
0.000000 981.385605 0.000000 /
--
rt_rq[0]:
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.005460
.rt_runtime : 950.000000

runnable tasks:
task PID tree-key switches prio
exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
rcuc0 7 0.000000 56869 98
0.000000 981.385605 0.000000 /
--


Attachments:
from-dileks-2.tar.xz (70.83 kB)
from-dileks-2.tar.xz.sha256sum (87.00 B)
Download all attachments

2011-04-28 15:44:29

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 5:28 PM, Sedat Dilek <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 3:30 PM, Mike Galbraith <[email protected]> wrote:
>> On Thu, 2011-04-28 at 03:26 -0700, Paul E. McKenney wrote:
>>> On Thu, Apr 28, 2011 at 11:45:03AM +0200, Sedat Dilek wrote:
>>> > Hi,
>>> >
>>> > not sure if my problem from linux-2.6-rcu.git#sedat.2011.04.23a is
>>> > related to the issue here.
>>> >
>>> > Just FYI:
>>> > I am here on a Pentium-M (uniprocessor aka UP) and still unsure if I
>>> > have the correct (optimal?) kernel-configs set.
>>> >
>>> > Paul gave me a script to collect RCU data and I enhanced it with
>>> > collecting SCHED data.
>>> >
>>> > In the above mentionned GIT branch I applied these two extra commits
>>> > (0001 requested by Paul and 0002 proposed by Thomas):
>>> >
>>> > patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
>>> > patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
>>> >
>>> > Furthermore, I have added my kernel-config file, scripts, patches and
>>> > logs (also output of 'cat /proc/cpuinfo').
>>> >
>>> > Hope this helps the experts to narrow down the problem.
>>>
>>> Yow!!!
>>>
>>> Now this one might well be able to hit the 950 millisecond limit.
>>> There are no fewer than 1,314,958 RCU callbacks queued up at the end of
>>> the test.  And RCU has indeed noticed this and cranked up the number
>>> of callbacks to be handled by each invocation of rcu_do_batch() to
>>> 2,147,483,647.  And only 15 seconds earlier, there were zero callbacks
>>> queued and the rcu_do_batch() limit was at the default of 10 callbacks
>>> per invocation.
>>
>> Yeah, yow.  Once the RT throttle hit, it stuck.
>>
>>  .clock                         : 1386824.201768
>>  .rt_nr_running                 : 2
>>  .rt_throttled                  : 1
>>  .rt_time                       : 950.132427
>>  .rt_runtime                    : 950.000000
>>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
>> FF    1      1 R    R 0 [rcuc0]
>>  .clock                         : 1402450.997994
>>  .rt_nr_running                 : 2
>>  .rt_throttled                  : 1
>>  .rt_time                       : 950.132427
>>  .rt_runtime                    : 950.000000
>>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
>> FF    1      1 R    R 0 [rcuc0]
>>
>> ...
>>
>>  .clock                         : 2707432.862374
>>  .rt_nr_running                 : 2
>>  .rt_throttled                  : 1
>>  .rt_time                       : 950.132427
>>  .rt_runtime                    : 950.000000
>>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
>> FF    1      1 R    R 0 [rcuc0]
>>  .clock                         : 2722572.958381
>>  .rt_nr_running                 : 2
>>  .rt_throttled                  : 1
>>  .rt_time                       : 950.132427
>>  .rt_runtime                    : 950.000000
>>           rcuc0     7         0.034118     10857    98         0.034118      1472.309646         0.000000 /
>> FF    1      1 R    R 0 [rcuc0]
>>
>>
>
> Hi,
>
> OK, I tried with the patch proposed by Thomas (0003):
>
> patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
> patches/0003-sched-Remove-skip_clock_update-check.patch
>
> From the very beginning it looked as the system is "stable" due to:
>
>  .rt_nr_running                 : 0
>  .rt_throttled                  : 0
>
> This changed when I started a simple tar-job to save my kernel
> build-dir to an external USB-hdd.
> From...
>
>  .rt_nr_running                 : 1
>  .rt_throttled                  : 1
>
> ...To:
>
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>
> Unfortunately, reducing all activities to a minimum load, did not
> change from last known RT throttling state.
>
> Just noticed rt_time exceeds the value of 950 first time here:
>
>  .rt_nr_running                 : 1
>  .rt_throttled                  : 1
>  .rt_time                       : 950.005460
>
> Full data attchached as tarball.
>
> - Sedat -
>
> P.S.: Excerpt from
> collectdebugfs-v2_2.6.39-rc3-rcutree-sedat.2011.04.23a+.log (0:0 ->
> 1:1 -> 2:1)
>
> --
> rt_rq[0]:
>  .rt_nr_running                 : 0
>  .rt_throttled                  : 0
>  .rt_time                       : 888.893877
>  .rt_runtime                    : 950.000000
>
> runnable tasks:
>            task   PID         tree-key  switches  prio
> exec-runtime         sum-exec        sum-sleep
> ----------------------------------------------------------------------------------------------------------
> R            cat  2652    115108.993460         1   120
> 115108.993460         1.147986         0.000000 /
> --
> rt_rq[0]:
>  .rt_nr_running                 : 1
>  .rt_throttled                  : 1
>  .rt_time                       : 950.005460
>  .rt_runtime                    : 950.000000
>
> runnable tasks:
>            task   PID         tree-key  switches  prio
> exec-runtime         sum-exec        sum-sleep
> ----------------------------------------------------------------------------------------------------------
>           rcuc0     7         0.000000     56869    98
> 0.000000       981.385605         0.000000 /
> --
> rt_rq[0]:
>  .rt_nr_running                 : 2
>  .rt_throttled                  : 1
>  .rt_time                       : 950.005460
>  .rt_runtime                    : 950.000000
>
> runnable tasks:
>            task   PID         tree-key  switches  prio
> exec-runtime         sum-exec        sum-sleep
> ----------------------------------------------------------------------------------------------------------
>           rcuc0     7         0.000000     56869    98
> 0.000000       981.385605         0.000000 /
> --
>

As an addendum:

First call trace is seen after:

[ 651.616057] sched: RT throttling activated
[ 711.616033] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)

- Sedat -
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2011-04-28 15:49:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 8:28 AM, Sedat Dilek <[email protected]> wrote:
>
> From the very beginning it looked as the system is "stable" due to:

Actually, look here, right from the beginning that log is showing
total breakage:

Thu Apr 28 16:49:51 CEST 2011
.rt_time : 233.923773
Thu Apr 28 16:50:06 CEST 2011
.rt_time : 259.446506
Thu Apr 28 16:50:22 CEST 2011
.rt_time : 273.110840
Thu Apr 28 16:50:37 CEST 2011
.rt_time : 282.713537
Thu Apr 28 16:50:52 CEST 2011
.rt_time : 288.136013
Thu Apr 28 16:51:07 CEST 2011
.rt_time : 293.057088
..
Thu Apr 28 16:58:29 CEST 2011
.rt_time : 888.893877
Thu Apr 28 16:58:44 CEST 2011
.rt_time : 950.005460

iow, rt_time just constantly grows. You have that "sleep 15" between
every log entry, so rt_time growing by 10-100 ms every 15 seconds
obviously does mean that it's using real CPU time, but it's still well
in the "much less than 1% CPU" range. So the rcu thread is clearly
doing work, but equally clearly it should NOT be throttled.

But since it is constantly growing, at some point it _will_ hit that
magical "950ms total time used", and then it gets throttled. For no
good reason.

It shouldn't have been throttled in the first place, and then the
other bug - that it isn't apparently ever unthrottled - just makes it
not work at all.

So that whole throttling is totally broken.

Linus

2011-04-28 18:50:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, Sedat Dilek wrote:
> On Thu, Apr 28, 2011 at 3:30 PM, Mike Galbraith <[email protected]> wrote:
> rt_rq[0]:
> .rt_nr_running : 0
> .rt_throttled : 0

> .rt_time : 888.893877

> .rt_time : 950.005460

So rt_time is constantly accumulated, but never decreased. The
decrease happens in the timer callback. Looks like the timer is not
running for whatever reason.

Can you add the following patch as well ?

Thanks,

tglx

--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -172,7 +172,7 @@ static enum hrtimer_restart sched_rt_per
idle = do_sched_rt_period_timer(rt_b, overrun);
}

- return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+ return HRTIMER_RESTART;
}

static

2011-04-28 19:23:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 2011-04-28 at 17:28 +0200, Sedat Dilek wrote:

> OK, I tried with the patch proposed by Thomas (0003):

(thanks)

> patches/0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> patches/0002-sched-Add-warning-when-RT-throttling-is-activated.patch
> patches/0003-sched-Remove-skip_clock_update-check.patch
>
> >From the very beginning it looked as the system is "stable" due to:
>
> .rt_nr_running : 0
> .rt_throttled : 0
>
> This changed when I started a simple tar-job to save my kernel
> build-dir to an external USB-hdd.
> From...
>
> .rt_nr_running : 1
> .rt_throttled : 1
>
> ...To:
>
> .rt_nr_running : 2
> .rt_throttled : 1
>
> Unfortunately, reducing all activities to a minimum load, did not
> change from last known RT throttling state.
>
> Just noticed rt_time exceeds the value of 950 first time here:

That would happen even if we did forced eviction.

> ----------------------------------------------------------------------------------------------------------
> R cat 2652 115108.993460 1 120
> 115108.993460 1.147986 0.000000 /
> --
> rt_rq[0]:
> .rt_nr_running : 1
> .rt_throttled : 1
> .rt_time : 950.005460
> .rt_runtime : 950.000000
> ----------------------------------------------------------------------------------------------------------
> rcuc0 7 0.000000 56869 98
> 0.000000 981.385605 0.000000 /
> --
> rt_rq[0]:
> .rt_nr_running : 2
> .rt_throttled : 1
> .rt_time : 950.005460
> .rt_runtime : 950.000000

Still getting stuck. Eliminates the clock update optimization, but that
seemed unlikely anyway. (I'll build a UP kernel and poke it)

-Mike

2011-04-28 20:23:18

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 April 2011 Thomas Gleixner <[email protected]> wrote:
> On Thu, 28 Apr 2011, Sedat Dilek wrote:
> > On Thu, Apr 28, 2011 at 3:30 PM, Mike Galbraith <[email protected]> wrote:
> > rt_rq[0]:
> > .rt_nr_running : 0
> > .rt_throttled : 0
>
> > .rt_time : 888.893877
>
> > .rt_time : 950.005460
>
> So rt_time is constantly accumulated, but never decreased. The
> decrease happens in the timer callback. Looks like the timer is not
> running for whatever reason.
>
> Can you add the following patch as well ?
>
> Thanks,
>
> tglx
>
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -172,7 +172,7 @@ static enum hrtimer_restart sched_rt_per
> idle = do_sched_rt_period_timer(rt_b, overrun);
> }
>
> - return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> + return HRTIMER_RESTART;

This doesn't help here.
Be it applied on top of the others, full diff attached
or applied alone (with throttling printk).

Could it be that NO_HZ=y has some importance in this matter?


Extended throttling printk (Linus asked what exact values were looking
like):
[ 401.000119] sched: RT throttling activated 950012539 > 950000000


Equivalent to what Sedat sees (/proc/sched_debug):
rt_rq[0]:
.rt_nr_running : 2
.rt_throttled : 1
.rt_time : 950.012539
.rt_runtime : 950.000000


/proc/$(pidof rcu_kthread)/sched captured at regular intervals:
Thu Apr 28 21:33:41 CEST 2011
rcu_kthread (6, #threads: 1)
---------------------------------------------------------
se.exec_start : 0.000000
se.vruntime : 0.000703
se.sum_exec_runtime : 903.067982
nr_switches : 23752
nr_voluntary_switches : 23751
nr_involuntary_switches : 1
se.load.weight : 1024
policy : 1
prio : 98
clock-delta : 912
Thu Apr 28 21:34:11 CEST 2011
rcu_kthread (6, #threads: 1)
---------------------------------------------------------
se.exec_start : 0.000000
se.vruntime : 0.000703
se.sum_exec_runtime : 974.899495
nr_switches : 25721
nr_voluntary_switches : 25720
nr_involuntary_switches : 1
se.load.weight : 1024
policy : 1
prio : 98
clock-delta : 1098
Thu Apr 28 21:34:41 CEST 2011
rcu_kthread (6, #threads: 1)
---------------------------------------------------------
se.exec_start : 0.000000
se.vruntime : 0.000703
se.sum_exec_runtime : 974.899495
nr_switches : 25721
nr_voluntary_switches : 25720
nr_involuntary_switches : 1
se.load.weight : 1024
policy : 1
prio : 98
clock-delta : 1126
Thu Apr 28 21:35:11 CEST 2011
rcu_kthread (6, #threads: 1)



> }
>
> static


Attachments:
(No filename) (3.62 kB)
sched_rt.diff (2.00 kB)
Download all attachments

2011-04-28 20:29:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, Bruno Pr?mont wrote:
> On Thu, 28 April 2011 Thomas Gleixner <[email protected]> wrote:
> > - return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > + return HRTIMER_RESTART;
>
> This doesn't help here.
> Be it applied on top of the others, full diff attached
> or applied alone (with throttling printk).
>
> Could it be that NO_HZ=y has some importance in this matter?

Might be. Can you try with nohz=off on the kernel command line ?

Can you please provide the output of /proc/timer_list ?

Thanks,

tglx

2011-04-28 20:41:11

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 8:49 PM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 28 Apr 2011, Sedat Dilek wrote:
>> On Thu, Apr 28, 2011 at 3:30 PM, Mike Galbraith <[email protected]> wrote:
>> rt_rq[0]:
>>   .rt_nr_running                 : 0
>>   .rt_throttled                  : 0
>
>>   .rt_time                       : 888.893877
>
>>   .rt_time                       : 950.005460
>
> So rt_time is constantly accumulated, but never decreased. The
> decrease happens in the timer callback. Looks like the timer is not
> running for whatever reason.
>
> Can you add the following patch as well ?
>
> Thanks,
>
>        tglx
>
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -172,7 +172,7 @@ static enum hrtimer_restart sched_rt_per
>                idle = do_sched_rt_period_timer(rt_b, overrun);
>        }
>
> -       return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> +       return HRTIMER_RESTART;
>  }
>
>  static
>

See tarball.

- Sedat -


Attachments:
from-dileks-3.tar.xz (20.25 kB)
from-dileks-3.tar.xz.sha256sum (87.00 B)
Download all attachments

2011-04-28 20:44:59

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 April 2011 Thomas Gleixner wrote:
> On Thu, 28 Apr 2011, Bruno Prémont wrote:
> > On Thu, 28 April 2011 Thomas Gleixner wrote:
> > > - return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > > + return HRTIMER_RESTART;
> >
> > This doesn't help here.
> > Be it applied on top of the others, full diff attached
> > or applied alone (with throttling printk).
> >
> > Could it be that NO_HZ=y has some importance in this matter?
>
> Might be. Can you try with nohz=off on the kernel command line ?

Doesn't make any visible difference (tested with "applied alone" kernel
as of above).

> Can you please provide the output of /proc/timer_list ?

See below,
Bruno



Timer List Version: v0.6
HRTIMER_MAX_CLOCK_BASES: 3
now at 1150126155286 nsecs

cpu: 0
clock 0:
.base: c1559360
.index: 0
.resolution: 1 nsecs
.get_time: ktime_get_real
.offset: 1304021489280954699 nsecs
active timers:
#0: def_rt_bandwidth, sched_rt_period_timer, S:01, enqueue_task_rt, swapper/1
# expires at 1304028703000000000-1304028703000000000 nsecs [in 1304027552873844714 to 1304027552873844714 nsecs]
clock 1:
.base: c155938c
.index: 1
.resolution: 1 nsecs
.get_time: ktime_get
.offset: 0 nsecs
active timers:
#0: tick_cpu_sched, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/0
# expires at 1150130000000-1150130000000 nsecs [in 3844714 to 3844714 nsecs]
#1: <dd612844>, it_real_fn, S:01, hrtimer_start, ntpd/1623
# expires at 1150443573670-1150443573670 nsecs [in 317418384 to 317418384 nsecs]
#2: <dd443ad4>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, init/1
# expires at 1150450113736-1150455113735 nsecs [in 323958450 to 328958449 nsecs]
#3: <db6bbad4>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, slabtop/1817
# expires at 1152632990798-1152635990795 nsecs [in 2506835512 to 2509835509 nsecs]
#4: watchdog_hrtimer, watchdog_timer_fn, S:01, hrtimer_start, watchdog/0/7
# expires at 1152742107906-1152742107906 nsecs [in 2615952620 to 2615952620 nsecs]
#5: <dce4be54>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1647
# expires at 1159748146627-1159748196627 nsecs [in 9621991341 to 9622041341 nsecs]
#6: <daf75e54>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1644
# expires at 1159748971801-1159749021801 nsecs [in 9622816515 to 9622866515 nsecs]
#7: <dce49e54>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1646
# expires at 1159749646863-1159749696863 nsecs [in 9623491577 to 9623541577 nsecs]
#8: <daf77e54>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1645
# expires at 1159750273989-1159750323989 nsecs [in 9624118703 to 9624168703 nsecs]
#9: <dbd51e54>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1643
# expires at 1159751170319-1159751220319 nsecs [in 9625015033 to 9625065033 nsecs]
#10: <db687f44>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, collectd/1641
# expires at 1159884463552-1159884513552 nsecs [in 9758308266 to 9758358266 nsecs]
#11: <db6bdb6c>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rpcbind/1699
# expires at 1164510072442-1164540072440 nsecs [in 14383917156 to 14413917154 nsecs]
#12: <dccbbb6c>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, syslog-ng/1599
# expires at 1859759077032-1859859077032 nsecs [in 709632921746 to 709732921746 nsecs]
#13: <dce2bb6c>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, dhcpcd/1557
# expires at 86432406451906-86432506451906 nsecs [in 85282280296620 to 85282380296620 nsecs]
#14: <dccbdad4>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gpm/1659
# expires at 86440042646716-86440142646716 nsecs [in 85289916491430 to 85290016491430 nsecs]
clock 2:
.base: c15593b8
.index: 7
.resolution: 1 nsecs
.get_time: ktime_get_boottime
.offset: 0 nsecs
active timers:
.expires_next : 1150130000000 nsecs
.hres_active : 1
.nr_events : 62851
.nr_retries : 1232
.nr_hangs : 0
.max_hang_time : 0 nsecs
.nohz_mode : 2
.idle_tick : 1150120000000 nsecs
.tick_stopped : 0
.idle_jiffies : 85011
.idle_calls : 59192
.idle_sleeps : 23733
.idle_entrytime : 1150123805083 nsecs
.idle_waketime : 1150123805083 nsecs
.idle_exittime : 1150123876750 nsecs
.idle_sleeptime : 861310470458 nsecs
.iowait_sleeptime: 72683738430 nsecs
.last_jiffies : 85011
.next_jiffies : 85017
.idle_expires : 1150170000000 nsecs
jiffies: 85012


Tick Device: mode: 1
Broadcast device
Clock Event Device: pit
max_delta_ns: 27461866
min_delta_ns: 12571
mult: 5124677
shift: 32
mode: 3
next_event: 9223372036854775807 nsecs
set_next_event: pit_next_event
set_mode: init_pit_timer
event_handler: tick_handle_oneshot_broadcast
retries: 0
tick_broadcast_mask: 00000000
tick_broadcast_oneshot_mask: 00000000


Tick Device: mode: 1
Per CPU device: 0
Clock Event Device: lapic
max_delta_ns: 128554655331
min_delta_ns: 1000
mult: 71746698
shift: 32
mode: 3
next_event: 1150130000000 nsecs
set_next_event: lapic_next_event
set_mode: lapic_timer_setup
event_handler: hrtimer_interrupt
retries: 1

2011-04-28 21:04:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, Bruno Prémont wrote:
> Timer List Version: v0.6
> HRTIMER_MAX_CLOCK_BASES: 3
> now at 1150126155286 nsecs
>
> cpu: 0
> clock 0:
> .base: c1559360
> .index: 0
> .resolution: 1 nsecs
> .get_time: ktime_get_real
> .offset: 1304021489280954699 nsecs
> active timers:
> #0: def_rt_bandwidth, sched_rt_period_timer, S:01, enqueue_task_rt, swapper/1
> # expires at 1304028703000000000-1304028703000000000 nsecs [in 1304027552873844714 to 1304027552873844714 nsecs]

Ok, that expiry time is obviously bogus as it does not account the offset:

So in reality it's: expires in: 6063592890015ns

Which is still completely wrong. The timer should expire at max a
second from now. But it's going to expire in 6063.592890015 seconds
from now, which is pretty much explaining the after 2hrs stuff got
going again.

But the real interesting question is why he heck is that timer on
CLOCK_REALTIME ???? It is initalized for CLOCK_MONOTONIC.

/me suspects hrtimer changes to be the real culprit.

John ????

Thanks,

tglx



2011-04-28 21:51:29

by john stultz

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> On Thu, 28 Apr 2011, Bruno Prémont wrote:
> > Timer List Version: v0.6
> > HRTIMER_MAX_CLOCK_BASES: 3
> > now at 1150126155286 nsecs
> >
> > cpu: 0
> > clock 0:
> > .base: c1559360
> > .index: 0
> > .resolution: 1 nsecs
> > .get_time: ktime_get_real
> > .offset: 1304021489280954699 nsecs
> > active timers:
> > #0: def_rt_bandwidth, sched_rt_period_timer, S:01, enqueue_task_rt, swapper/1
> > # expires at 1304028703000000000-1304028703000000000 nsecs [in 1304027552873844714 to 1304027552873844714 nsecs]
>
> Ok, that expiry time is obviously bogus as it does not account the offset:
>
> So in reality it's: expires in: 6063592890015ns
>
> Which is still completely wrong. The timer should expire at max a
> second from now. But it's going to expire in 6063.592890015 seconds
> from now, which is pretty much explaining the after 2hrs stuff got
> going again.
>
> But the real interesting question is why he heck is that timer on
> CLOCK_REALTIME ???? It is initalized for CLOCK_MONOTONIC.
>
> /me suspects hrtimer changes to be the real culprit.

I'm not seeing anything on right off, but it does smell like
e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
would crop up.

Bruno, could you try checking out e06383db9ec, confirming it still
occurs (and then maybe seeing if it goes away at e06383db9ec^1)?

I'll keep digging in the meantime.

thanks
-john





2011-04-28 22:02:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, john stultz wrote:
> On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> > /me suspects hrtimer changes to be the real culprit.
>
> I'm not seeing anything on right off, but it does smell like
> e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
> would crop up.
>
> Bruno, could you try checking out e06383db9ec, confirming it still
> occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
>
> I'll keep digging in the meantime.

I found the bug already. The problem is that sched_init() calls
init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
hrtimers_init() is called.

That was unnoticed so far as the CLOCK id to hrtimer base conversion
was hardcoded. Now we use a table which is set up at hrtimers_init(),
so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
in the bss.

The patch below fixes this, by providing the table statically rather
than runtime initialized. Though that whole ordering wants to be
revisited.

Thanks,

tglx

--- linux-2.6.orig/kernel/hrtimer.c
+++ linux-2.6/kernel/hrtimer.c
@@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
}
};

-static int hrtimer_clock_to_base_table[MAX_CLOCKS];
+static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
+ [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
+ [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
+ [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
+};

static inline int hrtimer_clockid_to_base(clockid_t clock_id)
{
@@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda

void __init hrtimers_init(void)
{
- hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
- hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
- hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
-
hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
register_cpu_notifier(&hrtimers_nb);

2011-04-28 23:07:19

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, Apr 29, 2011 at 12:02 AM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 28 Apr 2011, john stultz wrote:
>> On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
>> > /me suspects hrtimer changes to be the real culprit.
>>
>> I'm not seeing anything on right off, but it does smell like
>> e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
>> would crop up.
>>
>> Bruno, could you try checking out e06383db9ec, confirming it still
>> occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
>>
>> I'll keep digging in the meantime.
>
> I found the bug already. The problem is that sched_init() calls
> init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> hrtimers_init() is called.
>
> That was unnoticed so far as the CLOCK id to hrtimer base conversion
> was hardcoded. Now we use a table which is set up at hrtimers_init(),
> so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> in the bss.
>
> The patch below fixes this, by providing the table statically rather
> than runtime initialized. Though that whole ordering wants to be
> revisited.
>
> Thanks,
>
>        tglx
>
> --- linux-2.6.orig/kernel/hrtimer.c
> +++ linux-2.6/kernel/hrtimer.c
> @@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
>        }
>  };
>
> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> +       [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> +       [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> +       [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> +};
>
>  static inline int hrtimer_clockid_to_base(clockid_t clock_id)
>  {
> @@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda
>
>  void __init hrtimers_init(void)
>  {
> -       hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
> -       hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
> -       hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
> -
>        hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
>                          (void *)(long)smp_processor_id());
>        register_cpu_notifier(&hrtimers_nb);
>
>
>

Looks good so far, no stalls or call-traces.

Really stressing with 20+ open tabs in firefox with flash-movie
running in one of them , tar-job, IRC-client etc.
I will run some more tests and collect data and send them later.

- Sedat -

P.S.: Patchset against linux-2.6-rcu.git#sedat.2011.04.23a where 0003
is from [2]

[1] http://git.us.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=shortlog;h=refs/heads/sedat.2011.04.23a
[2] https://patchwork.kernel.org/patch/739782/

$ l ../RCU-HOORAY/
insgesamt 40
drwxr-xr-x 2 sd sd 4096 29. Apr 01:02 .
drwxr-xr-x 35 sd sd 20480 29. Apr 01:01 ..
-rw-r--r-- 1 sd sd 726 29. Apr 01:01
0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
-rw-r--r-- 1 sd sd 735 29. Apr 01:01
0002-sched-Add-warning-when-RT-throttling-is-activated.patch
-rw-r--r-- 1 sd sd 2376 29. Apr 01:01
0003-2.6.39-rc4-Kernel-leaking-memory-during-FS-scanning-.patch

2011-04-28 23:35:50

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, Apr 29, 2011 at 1:06 AM, Sedat Dilek <[email protected]> wrote:
> On Fri, Apr 29, 2011 at 12:02 AM, Thomas Gleixner <[email protected]> wrote:
>> On Thu, 28 Apr 2011, john stultz wrote:
>>> On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
>>> > /me suspects hrtimer changes to be the real culprit.
>>>
>>> I'm not seeing anything on right off, but it does smell like
>>> e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
>>> would crop up.
>>>
>>> Bruno, could you try checking out e06383db9ec, confirming it still
>>> occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
>>>
>>> I'll keep digging in the meantime.
>>
>> I found the bug already. The problem is that sched_init() calls
>> init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
>> hrtimers_init() is called.
>>
>> That was unnoticed so far as the CLOCK id to hrtimer base conversion
>> was hardcoded. Now we use a table which is set up at hrtimers_init(),
>> so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
>> in the bss.
>>
>> The patch below fixes this, by providing the table statically rather
>> than runtime initialized. Though that whole ordering wants to be
>> revisited.
>>
>> Thanks,
>>
>>        tglx
>>
>> --- linux-2.6.orig/kernel/hrtimer.c
>> +++ linux-2.6/kernel/hrtimer.c
>> @@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
>>        }
>>  };
>>
>> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
>> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
>> +       [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
>> +       [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
>> +       [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
>> +};
>>
>>  static inline int hrtimer_clockid_to_base(clockid_t clock_id)
>>  {
>> @@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda
>>
>>  void __init hrtimers_init(void)
>>  {
>> -       hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
>> -       hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
>> -       hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
>> -
>>        hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
>>                          (void *)(long)smp_processor_id());
>>        register_cpu_notifier(&hrtimers_nb);
>>
>>
>>
>
> Looks good so far, no stalls or call-traces.
>
> Really stressing with 20+ open tabs in firefox with flash-movie
> running in one of them , tar-job, IRC-client etc.
> I will run some more tests and collect data and send them later.
>
> - Sedat -
>
> P.S.: Patchset against linux-2.6-rcu.git#sedat.2011.04.23a where 0003
> is from [2]
>
> [1] http://git.us.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=shortlog;h=refs/heads/sedat.2011.04.23a
> [2] https://patchwork.kernel.org/patch/739782/
>
> $ l ../RCU-HOORAY/
> insgesamt 40
> drwxr-xr-x  2 sd sd  4096 29. Apr 01:02 .
> drwxr-xr-x 35 sd sd 20480 29. Apr 01:01 ..
> -rw-r--r--  1 sd sd   726 29. Apr 01:01
> 0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> -rw-r--r--  1 sd sd   735 29. Apr 01:01
> 0002-sched-Add-warning-when-RT-throttling-is-activated.patch
> -rw-r--r--  1 sd sd  2376 29. Apr 01:01
> 0003-2.6.39-rc4-Kernel-leaking-memory-during-FS-scanning-.patch
>

As promised the tarball (at the end of the log I made some XZ compressing).

Wow!
$ uptime
01:35:17 up 45 min, 3 users, load average: 0.45, 0.57, 1.27

Thanks to all involved people helping to kill that bug (Come on Paul, smile!).

- Sedat -


Attachments:
from-dileks-4.tar.xz (107.99 kB)
from-dileks-4.tar.xz.sha256sum (87.00 B)
Download all attachments

2011-04-29 00:43:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, Apr 29, 2011 at 01:35:44AM +0200, Sedat Dilek wrote:
> On Fri, Apr 29, 2011 at 1:06 AM, Sedat Dilek <[email protected]> wrote:
> > On Fri, Apr 29, 2011 at 12:02 AM, Thomas Gleixner <[email protected]> wrote:
> >> On Thu, 28 Apr 2011, john stultz wrote:
> >>> On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> >>> > /me suspects hrtimer changes to be the real culprit.
> >>>
> >>> I'm not seeing anything on right off, but it does smell like
> >>> e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
> >>> would crop up.
> >>>
> >>> Bruno, could you try checking out e06383db9ec, confirming it still
> >>> occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
> >>>
> >>> I'll keep digging in the meantime.
> >>
> >> I found the bug already. The problem is that sched_init() calls
> >> init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> >> hrtimers_init() is called.
> >>
> >> That was unnoticed so far as the CLOCK id to hrtimer base conversion
> >> was hardcoded. Now we use a table which is set up at hrtimers_init(),
> >> so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> >> in the bss.
> >>
> >> The patch below fixes this, by providing the table statically rather
> >> than runtime initialized. Though that whole ordering wants to be
> >> revisited.
> >>
> >> Thanks,
> >>
> >> ? ? ? ?tglx
> >>
> >> --- linux-2.6.orig/kernel/hrtimer.c
> >> +++ linux-2.6/kernel/hrtimer.c
> >> @@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
> >> ? ? ? ?}
> >> ?};
> >>
> >> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> >> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> >> + ? ? ? [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> >> + ? ? ? [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> >> + ? ? ? [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> >> +};
> >>
> >> ?static inline int hrtimer_clockid_to_base(clockid_t clock_id)
> >> ?{
> >> @@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda
> >>
> >> ?void __init hrtimers_init(void)
> >> ?{
> >> - ? ? ? hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
> >> - ? ? ? hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
> >> - ? ? ? hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
> >> -
> >> ? ? ? ?hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ?(void *)(long)smp_processor_id());
> >> ? ? ? ?register_cpu_notifier(&hrtimers_nb);
> >>
> >>
> >>
> >
> > Looks good so far, no stalls or call-traces.
> >
> > Really stressing with 20+ open tabs in firefox with flash-movie
> > running in one of them , tar-job, IRC-client etc.
> > I will run some more tests and collect data and send them later.
> >
> > - Sedat -
> >
> > P.S.: Patchset against linux-2.6-rcu.git#sedat.2011.04.23a where 0003
> > is from [2]
> >
> > [1] http://git.us.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=shortlog;h=refs/heads/sedat.2011.04.23a
> > [2] https://patchwork.kernel.org/patch/739782/
> >
> > $ l ../RCU-HOORAY/
> > insgesamt 40
> > drwxr-xr-x ?2 sd sd ?4096 29. Apr 01:02 .
> > drwxr-xr-x 35 sd sd 20480 29. Apr 01:01 ..
> > -rw-r--r-- ?1 sd sd ? 726 29. Apr 01:01
> > 0001-Revert-rcu-restrict-TREE_RCU-to-SMP-builds-with-PREE.patch
> > -rw-r--r-- ?1 sd sd ? 735 29. Apr 01:01
> > 0002-sched-Add-warning-when-RT-throttling-is-activated.patch
> > -rw-r--r-- ?1 sd sd ?2376 29. Apr 01:01
> > 0003-2.6.39-rc4-Kernel-leaking-memory-during-FS-scanning-.patch
> >
>
> As promised the tarball (at the end of the log I made some XZ compressing).
>
> Wow!
> $ uptime
> 01:35:17 up 45 min, 3 users, load average: 0.45, 0.57, 1.27
>
> Thanks to all involved people helping to kill that bug (Come on Paul, smile!).

Woo-hoo!!!!

Many thanks to Thomas for tracking this down -- it is fair to say that
I never would have thought to look at timer initialization! ;-)

Thanx, Paul

2011-04-29 07:55:50

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, Apr 29, 2011 at 12:02 AM, Thomas Gleixner <[email protected]> wrote:
> On Thu, 28 Apr 2011, john stultz wrote:
>> On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
>> > /me suspects hrtimer changes to be the real culprit.
>>
>> I'm not seeing anything on right off, but it does smell like
>> e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
>> would crop up.
>>
>> Bruno, could you try checking out e06383db9ec, confirming it still
>> occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
>>
>> I'll keep digging in the meantime.
>
> I found the bug already. The problem is that sched_init() calls
> init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> hrtimers_init() is called.
>
> That was unnoticed so far as the CLOCK id to hrtimer base conversion
> was hardcoded. Now we use a table which is set up at hrtimers_init(),
> so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> in the bss.
>
> The patch below fixes this, by providing the table statically rather
> than runtime initialized. Though that whole ordering wants to be
> revisited.
>
> Thanks,
>
>        tglx
>
> --- linux-2.6.orig/kernel/hrtimer.c
> +++ linux-2.6/kernel/hrtimer.c
> @@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
>        }
>  };
>
> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> +       [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> +       [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> +       [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> +};
>
>  static inline int hrtimer_clockid_to_base(clockid_t clock_id)
>  {
> @@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda
>
>  void __init hrtimers_init(void)
>  {
> -       hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
> -       hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
> -       hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
> -
>        hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
>                          (void *)(long)smp_processor_id());
>        register_cpu_notifier(&hrtimers_nb);
>

Will you send this as a separate patch?

Please also feel free to add:

Tested-by: Sedat Dilek <[email protected]>

If you like also a Reported-by... as the issue is not new, I have
first reported it here [1].

- Sedat -

[1] http://lkml.org/lkml/2011/3/25/97

2011-04-29 09:34:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, 28 Apr 2011, Paul E. McKenney wrote:
> On Fri, Apr 29, 2011 at 01:35:44AM +0200, Sedat Dilek wrote:
> > 01:35:17 up 45 min, 3 users, load average: 0.45, 0.57, 1.27
> >
> > Thanks to all involved people helping to kill that bug (Come on Paul, smile!).
>
> Woo-hoo!!!!
>
> Many thanks to Thomas for tracking this down -- it is fair to say that
> I never would have thought to look at timer initialization! ;-)

Many thanks to the reporters who provided all the information and
tested all the random debug patches we threw at them !

tglx

2011-04-29 18:10:14

by Mike Frysinger

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Thu, Apr 28, 2011 at 18:02, Thomas Gleixner wrote:
> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> +       [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> +       [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> +       [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> +};

this would let us constify the array too
-mike

2011-04-29 18:27:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, 29 Apr 2011, Mike Frysinger wrote:

> On Thu, Apr 28, 2011 at 18:02, Thomas Gleixner wrote:
> > -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> > +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> > +       [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> > +       [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> > +       [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> > +};
>
> this would let us constify the array too

Indeed.

2011-04-29 19:31:17

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, 29 April 2011 Thomas Gleixner wrote:
> On Thu, 28 Apr 2011, john stultz wrote:
> > On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> > > /me suspects hrtimer changes to be the real culprit.
> >
> > I'm not seeing anything on right off, but it does smell like
> > e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
> > would crop up.
> >
> > Bruno, could you try checking out e06383db9ec, confirming it still
> > occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
> >
> > I'll keep digging in the meantime.
>
> I found the bug already. The problem is that sched_init() calls
> init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> hrtimers_init() is called.
>
> That was unnoticed so far as the CLOCK id to hrtimer base conversion
> was hardcoded. Now we use a table which is set up at hrtimers_init(),
> so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> in the bss.
>
> The patch below fixes this, by providing the table statically rather
> than runtime initialized. Though that whole ordering wants to be
> revisited.

Works here as well (applied alone), /proc/$(pidof rcu_kthread)/sched shows
total runtime continuing to increase beyond 950 and slubs continue being
released!

Thanks,
Bruno

> Thanks,
>
> tglx
>
> --- linux-2.6.orig/kernel/hrtimer.c
> +++ linux-2.6/kernel/hrtimer.c
> @@ -81,7 +81,11 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base,
> }
> };
>
> -static int hrtimer_clock_to_base_table[MAX_CLOCKS];
> +static int hrtimer_clock_to_base_table[MAX_CLOCKS] = {
> + [CLOCK_REALTIME] = HRTIMER_BASE_REALTIME,
> + [CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC,
> + [CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME,
> +};
>
> static inline int hrtimer_clockid_to_base(clockid_t clock_id)
> {
> @@ -1722,10 +1726,6 @@ static struct notifier_block __cpuinitda
>
> void __init hrtimers_init(void)
> {
> - hrtimer_clock_to_base_table[CLOCK_REALTIME] = HRTIMER_BASE_REALTIME;
> - hrtimer_clock_to_base_table[CLOCK_MONOTONIC] = HRTIMER_BASE_MONOTONIC;
> - hrtimer_clock_to_base_table[CLOCK_BOOTTIME] = HRTIMER_BASE_BOOTTIME;
> -
> hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
> (void *)(long)smp_processor_id());
> register_cpu_notifier(&hrtimers_nb);

2011-04-29 20:11:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, 29 Apr 2011, Bruno Pr?mont wrote:
> On Fri, 29 April 2011 Thomas Gleixner wrote:
> > On Thu, 28 Apr 2011, john stultz wrote:
> > > On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> > > > /me suspects hrtimer changes to be the real culprit.
> > >
> > > I'm not seeing anything on right off, but it does smell like
> > > e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
> > > would crop up.
> > >
> > > Bruno, could you try checking out e06383db9ec, confirming it still
> > > occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
> > >
> > > I'll keep digging in the meantime.
> >
> > I found the bug already. The problem is that sched_init() calls
> > init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> > hrtimers_init() is called.
> >
> > That was unnoticed so far as the CLOCK id to hrtimer base conversion
> > was hardcoded. Now we use a table which is set up at hrtimers_init(),
> > so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> > in the bss.
> >
> > The patch below fixes this, by providing the table statically rather
> > than runtime initialized. Though that whole ordering wants to be
> > revisited.
>
> Works here as well (applied alone), /proc/$(pidof rcu_kthread)/sched shows
> total runtime continuing to increase beyond 950 and slubs continue being
> released!

Does the CPU time show up in top/ps as well now ?

Thanks,

tglx

2011-04-29 20:15:07

by Bruno Prémont

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, 29 April 2011 Thomas Gleixner wrote:
> On Fri, 29 Apr 2011, Bruno Prémont wrote:
> > On Fri, 29 April 2011 Thomas Gleixner wrote:
> > > On Thu, 28 Apr 2011, john stultz wrote:
> > > > On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
> > > > > /me suspects hrtimer changes to be the real culprit.
> > > >
> > > > I'm not seeing anything on right off, but it does smell like
> > > > e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
> > > > would crop up.
> > > >
> > > > Bruno, could you try checking out e06383db9ec, confirming it still
> > > > occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
> > > >
> > > > I'll keep digging in the meantime.
> > >
> > > I found the bug already. The problem is that sched_init() calls
> > > init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
> > > hrtimers_init() is called.
> > >
> > > That was unnoticed so far as the CLOCK id to hrtimer base conversion
> > > was hardcoded. Now we use a table which is set up at hrtimers_init(),
> > > so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
> > > in the bss.
> > >
> > > The patch below fixes this, by providing the table statically rather
> > > than runtime initialized. Though that whole ordering wants to be
> > > revisited.
> >
> > Works here as well (applied alone), /proc/$(pidof rcu_kthread)/sched shows
> > total runtime continuing to increase beyond 950 and slubs continue being
> > released!
>
> Does the CPU time show up in top/ps as well now ?

Yes, it does (currently at 0:09 in ps for 9336.075 in
/proc/$(pidof rcu_kthread)/sched)

Thanks,
Bruno

2011-04-30 09:14:19

by Sedat Dilek

[permalink] [raw]
Subject: Re: 2.6.39-rc4+: Kernel leaking memory during FS scanning, regression?

On Fri, Apr 29, 2011 at 10:14 PM, Bruno Prémont
<[email protected]> wrote:
> On Fri, 29 April 2011 Thomas Gleixner wrote:
>> On Fri, 29 Apr 2011, Bruno Prémont wrote:
>> > On Fri, 29 April 2011 Thomas Gleixner wrote:
>> > > On Thu, 28 Apr 2011, john stultz wrote:
>> > > > On Thu, 2011-04-28 at 23:04 +0200, Thomas Gleixner wrote:
>> > > > > /me suspects hrtimer changes to be the real culprit.
>> > > >
>> > > > I'm not seeing anything on right off, but it does smell like
>> > > > e06383db9ec591696a06654257474b85bac1f8cb would be where such an issue
>> > > > would crop up.
>> > > >
>> > > > Bruno, could you try checking out e06383db9ec, confirming it still
>> > > > occurs (and then maybe seeing if it goes away at e06383db9ec^1)?
>> > > >
>> > > > I'll keep digging in the meantime.
>> > >
>> > > I found the bug already. The problem is that sched_init() calls
>> > > init_rt_bandwidth() which calls hrtimer_init() _BEFORE_
>> > > hrtimers_init() is called.
>> > >
>> > > That was unnoticed so far as the CLOCK id to hrtimer base conversion
>> > > was hardcoded. Now we use a table which is set up at hrtimers_init(),
>> > > so the bandwith hrtimer ends up on CLOCK_REALTIME because the table is
>> > > in the bss.
>> > >
>> > > The patch below fixes this, by providing the table statically rather
>> > > than runtime initialized. Though that whole ordering wants to be
>> > > revisited.
>> >
>> > Works here as well (applied alone), /proc/$(pidof rcu_kthread)/sched shows
>> > total runtime continuing to increase beyond 950 and slubs continue being
>> > released!
>>
>> Does the CPU time show up in top/ps as well now ?
>
> Yes, it does (currently at 0:09 in ps for 9336.075 in
> /proc/$(pidof rcu_kthread)/sched)
>
> Thanks,
> Bruno
>

Just FYI: The patch is now in mainline (2.6.39-rc5-git3).

- Sedat -

[1] http://git.us.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ce31332d3c77532d6ea97ddcb475a2b02dd358b4
[2] http://www.kernel.org/diff/diffview.cgi?file=/pub/linux/kernel/v2.6/snapshots/patch-2.6.39-rc5-git3.bz2