2005-11-22 13:00:04

by Jan Kasprzak

[permalink] [raw]
Subject: 2.6.14 kswapd eating too much CPU

Hello,

I have noticed that on my system kswapd eats too much CPU time every two
hours or so. This started when I upgraded this server to 2.6.14.2 (was 2.6.13.2
before), and added another 4 GB of memory (to the total of 8GB). The server
runs mainly FTP load (proftpd w/ sendfile()), with some other apps (Apache,
Qmail, etc). It is dual Opteron (Tyan S2882 board) with 8 ATA disks
on 3ware 7506 controller with SW raid volumes (RAID-1 and RAID-5),
with XFS and ext3 filesystems.

Here is the top(1) output when the kswapd problem occurs:

top - 13:45:34 up 8 days, 15:49, 2 users, load average: 3.37, 3.49, 3.44
Tasks: 325 total, 4 running, 321 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5% us, 75.6% sy, 0.8% ni, 20.3% id, 1.3% wa, 0.2% hi, 1.3% si
Mem: 8174528k total, 8151076k used, 23452k free, 49960k buffers
Swap: 14651256k total, 640k used, 14650616k free, 7562572k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16 root 15 0 0 0 0 R 96.4 0.0 1542:27 kswapd0
15 root 15 0 0 0 0 D 52.0 0.0 833:48.37 kswapd1
1056 rsyncd 35 19 11196 2608 928 R 3.0 0.0 0:42.29 rsync
2082 named 18 0 72816 17m 1888 S 0.7 0.2 47:47.55 named
18484 ftp 20 5 10028 1556 788 S 0.7 0.0 0:00.02 proftpd
26 root 10 -5 0 0 0 S 0.3 0.0 156:47.10 md5_raid5

It can be seen that one kswapd uses almost 100% of CPU,and the other one
about 50%. The MRTG graph of CPU usage can be seen at

http://www.linux.cz/stats/mrtg-rrd/cpu.html

(orange is the system+irq+softirq time, and red is non-idle (i.e.
user+sys+nice+irq+softirq, without idle and without iowait). The upgrade
of the kernel and memory has been done on Saturday, November 12th.

The peaks of the CPU system time happen every two horus or so. The peak
from ~5am is wider, because at that time various cron daily jobs are run
(updatedb, and so on).

Any clues on what is going on here? Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <


2005-11-23 00:35:35

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jan Kasprzak <[email protected]> wrote:
>
> I have noticed that on my system kswapd eats too much CPU time every two
> hours or so. This started when I upgraded this server to 2.6.14.2 (was 2.6.13.2
> before), and added another 4 GB of memory (to the total of 8GB).

Next time it happens, please gather some memory info (while it's happening):

cat /proc/meminfo
cat /proc/vmstat
cat /proc/slabinfo
dmesg -c > /dev/null
echo m > /proc/sysrq-trigger
dmesg

Thanks.

2005-11-23 01:01:32

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Andrew Morton wrote:
: Jan Kasprzak <[email protected]> wrote:
: >
: > I have noticed that on my system kswapd eats too much CPU time every two
: > hours or so. This started when I upgraded this server to 2.6.14.2 (was 2.6.13.2
: > before), and added another 4 GB of memory (to the total of 8GB).
:
: Next time it happens, please gather some memory info (while it's happening):
:
: cat /proc/meminfo

# cat /proc/meminfo
MemTotal: 8174528 kB
MemFree: 33312 kB
Buffers: 33472 kB
Cached: 7565188 kB
SwapCached: 0 kB
Active: 2457564 kB
Inactive: 5469168 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 8174528 kB
LowFree: 33312 kB
SwapTotal: 14651256 kB
SwapFree: 14650616 kB
Dirty: 67136 kB
Writeback: 0 kB
Mapped: 348848 kB
Slab: 159900 kB
CommitLimit: 18738520 kB
Committed_AS: 746976 kB
PageTables: 15596 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 9500 kB
VmallocChunk: 34359728803 kB

: cat /proc/vmstat

# cat /proc/vmstat
nr_dirty 15117
nr_writeback 0
nr_unstable 0
nr_page_table_pages 3975
nr_mapped 88352
nr_slab 39997
pgpgin 4793492568
pgpgout 470966429
pswpin 0
pswpout 160
pgalloc_high 0
pgalloc_normal 2450703504
pgalloc_dma 982887145
pgfree 3433597403
pgactivate 368526738
pgdeactivate 334812475
pgfault 1595083992
pgmajfault 64619
pgrefill_high 0
pgrefill_normal 533924428
pgrefill_dma 1995932
pgsteal_high 0
pgsteal_normal 923247535
pgsteal_dma 0
pgscan_kswapd_high 0
pgscan_kswapd_normal 940269891
pgscan_kswapd_dma 0
pgscan_direct_high 0
pgscan_direct_normal 13837131
pgscan_direct_dma 0
pginodesteal 11216563
slabs_scanned 160160350534400
kswapd_steal 909876526
kswapd_inodesteal 305039060
pageoutrun 30139677
allocstall 4067783
pgrotated 1198
nr_bounce 0

: cat /proc/slabinfo

# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nfsd4_delegations 0 0 656 6 1 : tunables 54 27 8 : slabdata 0 0 0
nfsd4_stateids 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
nfsd4_files 0 0 72 53 1 : tunables 120 60 8 : slabdata 0 0 0
nfsd4_stateowners 0 0 424 9 1 : tunables 54 27 8 : slabdata 0 0 0
raid5/md5 256 260 1424 5 2 : tunables 24 12 8 : slabdata 52 52 0
rpc_buffers 8 8 2048 2 1 : tunables 24 12 8 : slabdata 4 4 0
rpc_tasks 8 10 384 10 1 : tunables 54 27 8 : slabdata 1 1 0
rpc_inode_cache 8 8 832 4 1 : tunables 54 27 8 : slabdata 2 2 0
fib6_nodes 33 118 64 59 1 : tunables 120 60 8 : slabdata 2 2 0
ip6_dst_cache 26 36 320 12 1 : tunables 54 27 8 : slabdata 3 3 0
ndisc_cache 3 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0
RAWv6 4 4 896 4 1 : tunables 54 27 8 : slabdata 1 1 0
UDPv6 1 4 896 4 1 : tunables 54 27 8 : slabdata 1 1 0
tw_sock_TCPv6 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0
request_sock_TCPv6 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
TCPv6 13 15 1536 5 2 : tunables 24 12 8 : slabdata 3 3 0
UNIX 460 654 640 6 1 : tunables 54 27 8 : slabdata 109 109 0
tcp_bind_bucket 233 2800 32 112 1 : tunables 120 60 8 : slabdata 25 25 0
inet_peer_cache 32 177 64 59 1 : tunables 120 60 8 : slabdata 3 3 0
ip_fib_alias 20 118 64 59 1 : tunables 120 60 8 : slabdata 2 2 0
ip_fib_hash 17 118 64 59 1 : tunables 120 60 8 : slabdata 2 2 0
ip_dst_cache 985 1776 320 12 1 : tunables 54 27 8 : slabdata 148 148 0
arp_cache 8 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0
RAW 3 11 704 11 2 : tunables 54 27 8 : slabdata 1 1 0
UDP 35 40 768 5 1 : tunables 54 27 8 : slabdata 8 8 0
tw_sock_TCP 174 640 192 20 1 : tunables 120 60 8 : slabdata 32 32 0
request_sock_TCP 48 60 128 30 1 : tunables 120 60 8 : slabdata 2 2 0
TCP 333 525 1408 5 2 : tunables 24 12 8 : slabdata 105 105 0
dm_tio 0 0 24 144 1 : tunables 120 60 8 : slabdata 0 0 0
dm_io 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0
scsi_cmd_cache 74 168 512 7 1 : tunables 54 27 8 : slabdata 19 24 30
cfq_ioc_pool 0 0 96 40 1 : tunables 120 60 8 : slabdata 0 0 0
cfq_pool 0 0 160 24 1 : tunables 120 60 8 : slabdata 0 0 0
crq_pool 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0
deadline_drq 0 0 96 40 1 : tunables 120 60 8 : slabdata 0 0 0
as_arq 159 408 112 34 1 : tunables 120 60 8 : slabdata 11 12 5
xfs_acl 0 0 304 13 1 : tunables 54 27 8 : slabdata 0 0 0
xfs_chashlist 4965 8512 32 112 1 : tunables 120 60 8 : slabdata 76 76 0
xfs_ili 213 260 192 20 1 : tunables 120 60 8 : slabdata 13 13 0
xfs_ifork 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0
xfs_efi_item 0 0 352 11 1 : tunables 54 27 8 : slabdata 0 0 0
xfs_efd_item 0 0 360 11 1 : tunables 54 27 8 : slabdata 0 0 0
xfs_buf_item 74 84 184 21 1 : tunables 120 60 8 : slabdata 4 4 0
xfs_dabuf 52 288 24 144 1 : tunables 120 60 8 : slabdata 2 2 0
xfs_da_state 6 16 488 8 1 : tunables 54 27 8 : slabdata 2 2 0
xfs_trans 6 45 872 9 2 : tunables 54 27 8 : slabdata 3 5 0
xfs_inode 31498 31577 528 7 1 : tunables 54 27 8 : slabdata 4511 4511 0
xfs_btree_cur 13 40 192 20 1 : tunables 120 60 8 : slabdata 2 2 0
xfs_bmap_free_item 2 144 24 144 1 : tunables 120 60 8 : slabdata 1 1 0
xfs_buf 88 99 408 9 1 : tunables 54 27 8 : slabdata 11 11 0
xfs_ioend 32 54 144 27 1 : tunables 120 60 8 : slabdata 2 2 0
xfs_vnode 31498 31572 648 6 1 : tunables 54 27 8 : slabdata 5262 5262 0
nfs_write_data 36 36 832 9 2 : tunables 54 27 8 : slabdata 4 4 0
nfs_read_data 32 35 768 5 1 : tunables 54 27 8 : slabdata 7 7 0
nfs_inode_cache 184 184 992 4 1 : tunables 54 27 8 : slabdata 46 46 0
nfs_page 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
isofs_inode_cache 10 12 648 6 1 : tunables 54 27 8 : slabdata 2 2 0
journal_handle 81 288 24 144 1 : tunables 120 60 8 : slabdata 2 2 0
journal_head 681 1160 96 40 1 : tunables 120 60 8 : slabdata 29 29 384
revoke_table 6 202 16 202 1 : tunables 120 60 8 : slabdata 1 1 0
revoke_record 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 3542 3805 808 5 1 : tunables 54 27 8 : slabdata 761 761 0
ext3_xattr 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0
dnotify_cache 1 92 40 92 1 : tunables 120 60 8 : slabdata 1 1 0
dquot 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_pwq 0 0 72 53 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_event_cache 0 0 40 92 1 : tunables 120 60 8 : slabdata 0 0 0
inotify_watch_cache 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0
kioctx 0 0 320 12 1 : tunables 54 27 8 : slabdata 0 0 0
kiocb 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
fasync_cache 0 0 24 144 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 818 825 808 5 1 : tunables 54 27 8 : slabdata 165 165 0
posix_timers_cache 0 0 168 23 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 47 118 64 59 1 : tunables 120 60 8 : slabdata 2 2 0
sgpool-128 32 32 4096 1 1 : tunables 24 12 8 : slabdata 32 32 0
sgpool-64 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0
sgpool-32 65 120 1024 4 1 : tunables 54 27 8 : slabdata 22 30 3
sgpool-16 36 40 512 8 1 : tunables 54 27 8 : slabdata 5 5 0
sgpool-8 109 195 256 15 1 : tunables 120 60 8 : slabdata 13 13 1
blkdev_ioc 231 469 56 67 1 : tunables 120 60 8 : slabdata 7 7 0
blkdev_queue 80 88 712 11 2 : tunables 54 27 8 : slabdata 8 8 0
blkdev_requests 283 375 264 15 1 : tunables 54 27 8 : slabdata 22 25 216
biovec-(256) 260 260 4096 1 1 : tunables 24 12 8 : slabdata 260 260 0
biovec-128 264 264 2048 2 1 : tunables 24 12 8 : slabdata 132 132 0
biovec-64 304 380 1024 4 1 : tunables 54 27 8 : slabdata 81 95 3
biovec-16 285 300 256 15 1 : tunables 120 60 8 : slabdata 20 20 0
biovec-4 284 354 64 59 1 : tunables 120 60 8 : slabdata 6 6 0
biovec-1 841 4040 16 202 1 : tunables 120 60 8 : slabdata 20 20 480
bio 870 2430 128 30 1 : tunables 120 60 8 : slabdata 79 81 480
file_lock_cache 38 48 160 24 1 : tunables 120 60 8 : slabdata 2 2 0
sock_inode_cache 832 1250 704 5 1 : tunables 54 27 8 : slabdata 250 250 0
skbuff_fclone_cache 4963 7224 448 8 1 : tunables 54 27 8 : slabdata 903 903 162
skbuff_head_cache 542 675 256 15 1 : tunables 120 60 8 : slabdata 45 45 0
acpi_operand 1028 1060 72 53 1 : tunables 120 60 8 : slabdata 20 20 0
acpi_parse_ext 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0
acpi_parse 0 0 40 92 1 : tunables 120 60 8 : slabdata 0 0 0
acpi_state 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0
proc_inode_cache 910 954 632 6 1 : tunables 54 27 8 : slabdata 159 159 20
sigqueue 10 46 168 23 1 : tunables 120 60 8 : slabdata 2 2 0
radix_tree_node 70767 79870 536 7 1 : tunables 54 27 8 : slabdata 11410 11410 0
bdev_cache 55 56 832 4 1 : tunables 54 27 8 : slabdata 14 14 0
sysfs_dir_cache 3346 3445 72 53 1 : tunables 120 60 8 : slabdata 65 65 0
mnt_cache 39 60 192 20 1 : tunables 120 60 8 : slabdata 3 3 0
inode_cache 1164 1224 600 6 1 : tunables 54 27 8 : slabdata 204 204 0
dentry_cache 44291 48569 224 17 1 : tunables 120 60 8 : slabdata 2857 2857 51
filp 3106 5955 256 15 1 : tunables 120 60 8 : slabdata 397 397 48
names_cache 6 6 4096 1 1 : tunables 24 12 8 : slabdata 6 6 0
idr_layer_cache 90 98 528 7 1 : tunables 54 27 8 : slabdata 14 14 0
buffer_head 218819 233948 88 44 1 : tunables 120 60 8 : slabdata 5317 5317 0
mm_struct 272 483 1152 7 2 : tunables 24 12 8 : slabdata 69 69 3
vm_area_struct 17201 26670 184 21 1 : tunables 120 60 8 : slabdata 1270 1270 480
fs_cache 267 708 64 59 1 : tunables 120 60 8 : slabdata 12 12 0
files_cache 290 440 896 4 1 : tunables 54 27 8 : slabdata 110 110 0
signal_cache 343 546 640 6 1 : tunables 54 27 8 : slabdata 91 91 0
sighand_cache 316 423 2112 3 2 : tunables 24 12 8 : slabdata 141 141 3
task_struct 333 492 1728 4 2 : tunables 24 12 8 : slabdata 123 123 3
anon_vma 1474 3168 24 144 1 : tunables 120 60 8 : slabdata 22 22 12
shared_policy_node 0 0 56 67 1 : tunables 120 60 8 : slabdata 0 0 0
numa_policy 36 404 16 202 1 : tunables 120 60 8 : slabdata 2 2 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-65536 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-32768 20 20 32768 1 8 : tunables 8 4 0 : slabdata 20 20 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-16384 1 1 16384 1 4 : tunables 8 4 0 : slabdata 1 1 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-8192 36 37 8192 1 2 : tunables 8 4 0 : slabdata 36 37 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 353 357 4096 1 1 : tunables 24 12 8 : slabdata 353 357 15
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 978 988 2048 2 1 : tunables 24 12 8 : slabdata 492 494 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 5016 5800 1024 4 1 : tunables 54 27 8 : slabdata 1450 1450 189
size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 720 776 512 8 1 : tunables 54 27 8 : slabdata 97 97 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 88 135 256 15 1 : tunables 120 60 8 : slabdata 9 9 0
size-192(DMA) 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0
size-192 1786 2600 192 20 1 : tunables 120 60 8 : slabdata 130 130 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0
size-64 18281 33158 64 59 1 : tunables 120 60 8 : slabdata 562 562 0
size-32(DMA) 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 5450 9180 128 30 1 : tunables 120 60 8 : slabdata 306 306 1
size-32 950 1232 32 112 1 : tunables 120 60 8 : slabdata 11 11 0
kmem_cache 156 156 640 6 1 : tunables 54 27 8 : slabdata 26 26 0

: dmesg -c > /dev/null
: echo m > /proc/sysrq-trigger
: dmesg

I don't have sysrq compiled in, sorry. I will recompile/boot
a new kernel maybe tomorrow.

I have found that this two-hour period of system time peaks is from
a cron job which does a full-text indexing of the archive of the mailing
list I run - so access to many small files (mailing list posts), and update
to the large "inverted index" database. However - this time I have also
kswapd0 with 95% CPU, and the next ~50% is proftpd on the other CPU in the
"R" state, which cannot be killed even with SIGKILL:

top - 01:56:18 up 9 days, 4:00, 2 users, load average: 4.91, 4.46, 5.71
Tasks: 310 total, 12 running, 298 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.8% us, 73.8% sy, 2.2% ni, 1.5% id, 17.1% wa, 0.2% hi, 1.5% si
Mem: 8174528k total, 8144104k used, 30424k free, 33552k buffers
Swap: 14651256k total, 640k used, 14650616k free, 7570548k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16 root 15 0 0 0 0 R 91.8 0.0 1628:43 kswapd0
25345 ftp 22 5 12448 3988 872 R 49.4 0.0 14:30.79 proftpd
21680 apache 17 0 28448 9220 2260 R 2.7 0.1 0:01.39 ezarchive.fcgi
1449 rsyncd 35 19 19836 11m 948 S 2.3 0.1 9:35.26 rsync
24152 rsyncd 34 19 12764 4568 936 R 1.7 0.1 0:31.54 rsync
30811 rsyncd 35 19 10424 1664 896 S 1.7 0.0 0:13.97 rsync

Hope this helps,

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-23 01:17:30

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jan Kasprzak wrote:
> Andrew Morton wrote:
> : Jan Kasprzak <[email protected]> wrote:
> : >
> : > I have noticed that on my system kswapd eats too much CPU time every two
> : > hours or so. This started when I upgraded this server to 2.6.14.2 (was 2.6.13.2
> : > before), and added another 4 GB of memory (to the total of 8GB).
> :
> : Next time it happens, please gather some memory info (while it's happening):
> :
> : cat /proc/meminfo
>
[snip]

> Hope this helps,
>

Can't see anything yet. Sysrq-M would be good. cat /proc/zoneinfo gets you
most of the way there though.

A couple of samples would be handy, especially from /proc/vmstat.

cat /proc/vmstat > vmstat.out ; sleep 10 ; cat /proc/vmstat >> vmstat.out

The same for /proc/zoneinfo would be a good idea as well.

Also - when you say "too much cpu time", what does this mean? Does
performance noticably drop compared with 2.6.13 performing the same cron
job? Because kswapd is supposed to unburden other processes from page
reclaim work, so if it is working *properly*, then the more CPU it uses
the better.

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-23 05:14:08

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Nick Piggin wrote:
: Can't see anything yet. Sysrq-M would be good. cat /proc/zoneinfo gets you
: most of the way there though.
:
: A couple of samples would be handy, especially from /proc/vmstat.
:
: cat /proc/vmstat > vmstat.out ; sleep 10 ; cat /proc/vmstat >> vmstat.out
:
: The same for /proc/zoneinfo would be a good idea as well.

I will send it tomorrow - I will try to boot 2.6.15-rc2
to see if the problem is still there.

: Also - when you say "too much cpu time", what does this mean? Does
: performance noticably drop compared with 2.6.13 performing the same cron
: job? Because kswapd is supposed to unburden other processes from page
: reclaim work, so if it is working *properly*, then the more CPU it uses
: the better.

As you can see from the graphs at
http://www.linux.cz/stats/mrtg-rrd/cpu.html
- the total CPU usage in the new kernel/with more memory is significantly
higher. I have not measured the time of the cron job itself, just the total
system utilization.

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-23 13:14:27

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jan Kasprzak wrote:
: Nick Piggin wrote:
: : Can't see anything yet. Sysrq-M would be good. cat /proc/zoneinfo gets you
: : most of the way there though.
: :
: : A couple of samples would be handy, especially from /proc/vmstat.
: :
: : cat /proc/vmstat > vmstat.out ; sleep 10 ; cat /proc/vmstat >> vmstat.out
: :
: : The same for /proc/zoneinfo would be a good idea as well.
:
: I will send it tomorrow - I will try to boot 2.6.15-rc2
: to see if the problem is still there.
:
I am at 2.6.15-rc2 now, the problem is still there.
Currently according to top(1), kswapd1 eats >98% CPU for 50 minutes now
and counting. The last cron job which apparently triggers this has been
started at 13:15, and now it is 14:07; the cron job itself has finished
after few minutes, no processes of the user who runs this cron job are
remaining in the system, but kswapd still eats CPU).

At the end of this mail you can find the top(1) output,
the two outputs of /proc/vmstat and /proc/zoneinfo separated by a 10 second
sleep, as well as sysrq-m ouptut, which has been taken about at the
same time as the vmstat and zoneinfo outputs.

What else can be unusual on my system? I run 8-way RAID-5 volume,
but it is OK according to /proc/vmstat. I run each memory bus at the
different speed - CPU0 has 400MHz memory modules, CPU1 has 333MHz
modules. It is supported configuration according to Tyan docs. Anything else?

-Yenya

=========================== top:
top - 14:04:33 up 7:40, 5 users, load average: 5.05, 3.32, 2.93
Tasks: 418 total, 2 running, 416 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.5% us, 52.1% sy, 0.7% ni, 40.4% id, 3.8% wa, 0.0% hi, 2.5% si
Mem: 8174368k total, 8148188k used, 26180k free, 43956k buffers
Swap: 14651256k total, 656k used, 14650600k free, 7677580k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17 root 25 0 0 0 0 R 98.8 0.0 67:58.25 kswapd1
28 root 10 -5 0 0 0 S 1.0 0.0 4:59.17 md5_raid5
11421 ftp 20 5 10036 1504 764 S 1.0 0.0 0:00.05 proftpd
18 root 15 0 0 0 0 S 0.7 0.0 23:39.88 kswapd0


============================== vmstat:
nr_dirty 1096
nr_writeback 0
nr_unstable 0
nr_page_table_pages 7897
nr_mapped 54381
nr_slab 31427
pgpgin 168267503
pgpgout 16779856
pswpin 0
pswpout 164
pgalloc_high 0
pgalloc_normal 68192735
pgalloc_dma 78742215
pgfree 146943386
pgactivate 11885097
pgdeactivate 10486693
pgfault 94787199
pgmajfault 3619
pgrefill_high 0
pgrefill_normal 7394476
pgrefill_dma 7988611
pgsteal_high 0
pgsteal_normal 13516901
pgsteal_dma 18774322
pgscan_kswapd_high 0
pgscan_kswapd_normal 12889635
pgscan_kswapd_dma 18124029
pgscan_direct_high 0
pgscan_direct_normal 1419165
pgscan_direct_dma 1397649
pginodesteal 1788228
slabs_scanned 8055609881856
kswapd_steal 29616872
kswapd_inodesteal 7587662
pageoutrun 997021
allocstall 168227
pgrotated 900
nr_bounce 0
============================== zoneinfo:
Node 1, zone Normal
pages free 1809
min 1423
low 1778
high 2134
active 251683
inactive 753741
scanned 66 (a: 0 i: 0)
spanned 1048576
present 1013647
protection: (0, 0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 66
low: 0
high: 384
batch: 64
cpu: 0 pcp: 1
count: 127
low: 0
high: 128
batch: 32
numa_hit: 2155
numa_miss: 26392398
numa_foreign: 150
interleave_hit: 2070
local_node: 0
other_node: 26394553
cpu: 1 pcp: 0
count: 68
low: 0
high: 384
batch: 64
cpu: 1 pcp: 1
count: 111
low: 0
high: 128
batch: 32
numa_hit: 41672333
numa_miss: 1482
numa_foreign: 5904214
interleave_hit: 5717
local_node: 41673815
other_node: 0
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 1048576
Node 0, zone DMA
pages free 2855
min 3
low 3
high 4
active 0
inactive 0
scanned 0 (a: 12 i: 12)
spanned 4096
present 2761
protection: (0, 3944, 3944, 3944)
pagesets
cpu: 1 pcp: 0
count: 10
low: 0
high: 12
batch: 2
cpu: 1 pcp: 1
count: 0
low: 0
high: 4
batch: 1
numa_hit: 5286426
numa_miss: 0
numa_foreign: 0
interleave_hit: 3821307
local_node: 0
other_node: 5286426
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0
Node 0, zone DMA32
pages free 2912
min 1418
low 1772
high 2127
active 33310
inactive 946725
scanned 0 (a: 27 i: 0)
spanned 1044480
present 1009704
protection: (0, 0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 51
low: 0
high: 384
batch: 64
cpu: 0 pcp: 1
count: 20
low: 0
high: 128
batch: 32
numa_hit: 47176253
numa_miss: 150
numa_foreign: 26392398
interleave_hit: 1788
local_node: 47176403
other_node: 0
cpu: 1 pcp: 0
count: 345
low: 0
high: 384
batch: 64
cpu: 1 pcp: 1
count: 62
low: 0
high: 128
batch: 32
numa_hit: 5081
numa_miss: 5904214
numa_foreign: 1482
interleave_hit: 4507
local_node: 0
other_node: 5909295
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 4096
============================== sleep 10:
============================== vmstat:
nr_dirty 922
nr_writeback 0
nr_unstable 0
nr_page_table_pages 7873
nr_mapped 54376
nr_slab 31318
pgpgin 168411599
pgpgout 16784255
pswpin 0
pswpout 164
pgalloc_high 0
pgalloc_normal 68198690
pgalloc_dma 78829202
pgfree 147035725
pgactivate 11887559
pgdeactivate 10489812
pgfault 94831072
pgmajfault 3621
pgrefill_high 0
pgrefill_normal 7396819
pgrefill_dma 7997162
pgsteal_high 0
pgsteal_normal 13519173
pgsteal_dma 18807281
pgscan_kswapd_high 0
pgscan_kswapd_normal 12889635
pgscan_kswapd_dma 18155676
pgscan_direct_high 0
pgscan_direct_normal 1421508
pgscan_direct_dma 1399992
pginodesteal 1788228
slabs_scanned 8065934875904
kswapd_steal 29647559
kswapd_inodesteal 7587889
pageoutrun 998045
allocstall 168298
pgrotated 900
nr_bounce 0
============================== zoneinfo:
Node 1, zone Normal
pages free 1765
min 1423
low 1778
high 2134
active 250416
inactive 754970
scanned 198 (a: 0 i: 0)
spanned 1048576
present 1013647
protection: (0, 0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 2
low: 0
high: 384
batch: 64
cpu: 0 pcp: 1
count: 127
low: 0
high: 128
batch: 32
numa_hit: 2155
numa_miss: 26395832
numa_foreign: 151
interleave_hit: 2070
local_node: 0
other_node: 26397987
cpu: 1 pcp: 0
count: 248
low: 0
high: 384
batch: 64
cpu: 1 pcp: 1
count: 84
low: 0
high: 128
batch: 32
numa_hit: 41674836
numa_miss: 1482
numa_foreign: 5908193
interleave_hit: 5718
local_node: 41676318
other_node: 0
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 1048576
Node 0, zone DMA
pages free 2854
min 3
low 3
high 4
active 0
inactive 0
scanned 0 (a: 17 i: 17)
spanned 4096
present 2761
protection: (0, 3944, 3944, 3944)
pagesets
cpu: 1 pcp: 0
count: 11
low: 0
high: 12
batch: 2
cpu: 1 pcp: 1
count: 0
low: 0
high: 4
batch: 1
numa_hit: 5288040
numa_miss: 0
numa_foreign: 0
interleave_hit: 3822015
local_node: 0
other_node: 5288040
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 0
Node 0, zone DMA32
pages free 2143
min 1418
low 1772
high 2127
active 33907
inactive 946983
scanned 0 (a: 0 i: 0)
spanned 1044480
present 1009704
protection: (0, 0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 146
low: 0
high: 384
batch: 64
cpu: 0 pcp: 1
count: 54
low: 0
high: 128
batch: 32
numa_hit: 47239978
numa_miss: 151
numa_foreign: 26395832
interleave_hit: 1790
local_node: 47240129
other_node: 0
cpu: 1 pcp: 0
count: 366
low: 0
high: 384
batch: 64
cpu: 1 pcp: 1
count: 31
low: 0
high: 128
batch: 32
numa_hit: 5082
numa_miss: 5908193
numa_foreign: 1482
interleave_hit: 4508
local_node: 0
other_node: 5913275
all_unreclaimable: 0
prev_priority: 12
temp_priority: 12
start_pfn: 4096

======================== dmesg after sysrq-m ============================
SysRq : Show Memory
Mem-info:
Node 1 DMA per-cpu: empty
Node 1 DMA32 per-cpu: empty
Node 1 Normal per-cpu:
cpu 0 hot: low 0, high 384, batch 64 used:59
cpu 0 cold: low 0, high 128, batch 32 used:24
cpu 1 hot: low 0, high 384, batch 64 used:84
cpu 1 cold: low 0, high 128, batch 32 used:93
Node 1 HighMem per-cpu: empty
Node 0 DMA per-cpu:
cpu 0 hot: low 0, high 12, batch 2 used:1
cpu 0 cold: low 0, high 4, batch 1 used:0
cpu 1 hot: low 0, high 12, batch 2 used:10
cpu 1 cold: low 0, high 4, batch 1 used:0
Node 0 DMA32 per-cpu:
cpu 0 hot: low 0, high 384, batch 64 used:334
cpu 0 cold: low 0, high 128, batch 32 used:31
cpu 1 hot: low 0, high 384, batch 64 used:66
cpu 1 cold: low 0, high 128, batch 32 used:31
Node 0 Normal per-cpu: empty
Node 0 HighMem per-cpu: empty
Free pages: 25780kB (0kB HighMem)
Active:283000 inactive:1704247 dirty:1177 writeback:1 unstable:0 free:6445 slab:31088 mapped:54478 pagetables:7852
Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 3959 3959
Node 1 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 3959 3959
Node 1 Normal free:7320kB min:5692kB low:7112kB high:8536kB active:996400kB inactive:3025596kB present:4054588kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA free:11412kB min:12kB low:12kB high:16kB active:0kB inactive:0kB present:11044kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 3944 3944 3944
Node 0 DMA32 free:7048kB min:5672kB low:7088kB high:8508kB active:135612kB inactive:3791400kB present:4038816kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 DMA: empty
Node 1 DMA32: empty
Node 1 Normal: 10*4kB 28*8kB 13*16kB 4*32kB 1*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 7320kB
Node 1 HighMem: empty
Node 0 DMA: 19*4kB 19*8kB 19*16kB 18*32kB 17*64kB 12*128kB 6*256kB 6*512kB 3*1024kB 0*2048kB 0*4096kB = 11412kB
Node 0 DMA32: 0*4kB 1*8kB 14*16kB 1*32kB 2*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 1*4096kB = 7048kB
Node 0 Normal: empty
Node 0 HighMem: empty
Swap cache: add 164, delete 164, find 0/0, race 0+0
Free swap = 14650600kB
Total swap = 14651256kB
Free swap: 14650600kB
2097152 pages of RAM
53560 reserved pages
278668 pages shared
0 pages swap cached

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-23 19:03:00

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jan Kasprzak <[email protected]> wrote:
>
> I am at 2.6.15-rc2 now, the problem is still there.
> Currently according to top(1), kswapd1 eats >98% CPU for 50 minutes now
> and counting.

When it's doing this, could you do sysrq-p a few times? The output of that
should tell us where the CPU is executing.

2005-11-23 20:25:05

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Andrew Morton wrote:
: Jan Kasprzak <[email protected]> wrote:
: >
: > I am at 2.6.15-rc2 now, the problem is still there.
: > Currently according to top(1), kswapd1 eats >98% CPU for 50 minutes now
: > and counting.
:
: When it's doing this, could you do sysrq-p a few times? The output of that
: should tell us where the CPU is executing.

Hmm, it does not show anything but the header. Should I enable
something special in the kernel?

# dmesg -c >/dev/null; echo -n p >/proc/sysrq-trigger ; sleep 5; dmesg
SysRq : Show Regs
#

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-23 20:34:05

by Jesper Juhl

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

On 11/23/05, Jan Kasprzak <[email protected]> wrote:
> Andrew Morton wrote:
> : Jan Kasprzak <[email protected]> wrote:
> : >
> : > I am at 2.6.15-rc2 now, the problem is still there.
> : > Currently according to top(1), kswapd1 eats >98% CPU for 50 minutes now
> : > and counting.
> :
> : When it's doing this, could you do sysrq-p a few times? The output of that
> : should tell us where the CPU is executing.
>
> Hmm, it does not show anything but the header. Should I enable
> something special in the kernel?
>

CONFIG_MAGIC_SYSRQ=y
(it's in 'Kernel hacking')


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2005-11-23 20:35:57

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jesper Juhl wrote:
: On 11/23/05, Jan Kasprzak <[email protected]> wrote:
: >
: > Hmm, it does not show anything but the header. Should I enable
: > something special in the kernel?
: >
:
: CONFIG_MAGIC_SYSRQ=y
: (it's in 'Kernel hacking')
:
I of course have this:

# zgrep -i SYSRQ /proc/config.gz
CONFIG_MAGIC_SYSRQ=y
#

otherwise I would not have /proc/sysrq-trigger file at all, and I would not
be able to post the sysrq-m outputs like I did earlier today.

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-23 20:37:23

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Jan Kasprzak <[email protected]> wrote:
>
> Andrew Morton wrote:
> : Jan Kasprzak <[email protected]> wrote:
> : >
> : > I am at 2.6.15-rc2 now, the problem is still there.
> : > Currently according to top(1), kswapd1 eats >98% CPU for 50 minutes now
> : > and counting.
> :
> : When it's doing this, could you do sysrq-p a few times? The output of that
> : should tell us where the CPU is executing.
>
> Hmm, it does not show anything but the header. Should I enable
> something special in the kernel?

Try `dmesg -n 7' first.

> # dmesg -c >/dev/null; echo -n p >/proc/sysrq-trigger ; sleep 5; dmesg
> SysRq : Show Regs
> #

You won't get anything useful from sysrq-p via /proc/sysrq-trigger - it'll
just show the backtrace of the process `echo'. It has to be via the
keyboard.

If there's no keyboard, do `echo t > /proc/sysrq-trigger' to get an
all-task backtrace, then locate the trace for kswapd in the resulting
output.

Thanks.

2005-11-24 08:31:56

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Andrew Morton wrote:
: > # dmesg -c >/dev/null; echo -n p >/proc/sysrq-trigger ; sleep 5; dmesg
: > SysRq : Show Regs
: > #
:
: You won't get anything useful from sysrq-p via /proc/sysrq-trigger - it'll
: just show the backtrace of the process `echo'. It has to be via the
: keyboard.

OK, at the end of this mail I am attaching few instances of
( top -n 1 -b -p 17,18 ; dmesg -c ) after sysrq-p entered via the
serial console. However, I have now kswapd0 at 99% CPU and kswapd1 at
about 50%, and sysrq-p is now apparently handled by the CPU1, so I get
traces of kswapd1 only. How can I switch the serial interrupt to CPU0?

: If there's no keyboard, do `echo t > /proc/sysrq-trigger' to get an
: all-task backtrace, then locate the trace for kswapd in the resulting
: output.

I had to increase CONFIG_LOG_BUF_SHIFT first, because sysrq-t
output was too long, but nevertheless - for the running processes there
are no call traces, just the note that the process is in the "R" state. Here
are few excerpts of sysrq-t:

-----
kswapd1 S ffff810100000000 0 17 1 18 11 (L-TLB)
ffff8101047bfea8 0000000000000046 ffff8101047bfe08 0000000000000292
0000000000000292 ffffffff80160589 ffff8101ae564240 000000008012cf35
0000000000000076 ffff8101ae564240
Call Trace:<ffffffff80160589>{balance_pgdat+297} <ffffffff80160936>{kswapd+278}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
<ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
kswapd0 R running task 0 18 1 33 17 (L-TLB)
-----
kswapd1 S ffff810100000000 0 17 1 18 11 (L-TLB)
ffff8101047bfea8 0000000000000046 ffff8101047bfe08 0000000000000292
0000000000000292 ffffffff80160589 ffff8101ae564240 000000008012cf35
0000000000000076 ffff8101ae564240
Call Trace:<ffffffff80160589>{balance_pgdat+297} <ffffffff80160936>{kswapd+278}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
<ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
kswapd0 R running task 0 18 1 33 17 (L-TLB)
-----

And here are the sysrq-p outputs:

top - 09:17:11 up 10:09, 2 users, load average: 13.56, 7.14, 3.45
Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2% us, 20.4% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
Mem: 8173472k total, 5769752k used, 2403720k free, 35436k buffers
Swap: 14651256k total, 656k used, 14650600k free, 5388176k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18 root 15 0 0 0 0 R 91.9 0.0 118:24.00 kswapd0
17 root 15 0 0 0 0 R 49.8 0.0 35:17.26 kswapd1

SysRq : Show Regs
CPU 1:
Modules linked in: binfmt_misc floppy
Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
RIP: 0010:[<ffffffff8019266b>] <ffffffff8019266b>{shrink_icache_memory+107}
RSP: 0000:ffff8101047bfd78 EFLAGS: 00000246
RAX: 00000000ffffffff RBX: 00000000000000c8 RCX: ffff8101046f6140
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffffffff804683e0
RBP: 0000000000200200 R08: ffff8101047be000 R09: 0000000000000003
R10: 0000000000000000 R11: ffffffff8028caf0 R12: ffffffff8012c533
R13: ffff8101047bfd18 R14: 0000000000000003 R15: 0000000000000296
FS: 00002aaaaac3ed20(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaacc1000 CR3: 00000001eeef9000 CR4: 00000000000006e0

Call Trace:<ffffffff80192814>{shrink_icache_memory+532} <ffffffff8015f20b>{shrink_slab+219}
<ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
<ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}

top - 09:17:28 up 10:09, 2 users, load average: 17.42, 8.31, 3.89
Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2% us, 20.4% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
Mem: 8173472k total, 5980464k used, 2193008k free, 36076k buffers
Swap: 14651256k total, 656k used, 14650600k free, 5593984k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18 root 15 0 0 0 0 R 96.4 0.0 118:39.98 kswapd0
17 root 15 0 0 0 0 R 60.3 0.0 35:26.37 kswapd1

SysRq : Show Regs
CPU 1:
Modules linked in: binfmt_misc floppy
Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
RIP: 0010:[<ffffffff803e8ab3>] <ffffffff803e8ab3>{_spin_lock+3}
RSP: 0000:ffff8101047bfd40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffff803e8951 RCX: ffff8101046f6140
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffffffff80468400
RBP: 0000000000000296 R08: ffff8101047be000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8101047bfd88
R13: ffff8101047bfcc8 R14: ffff8101046f6140 R15: ffffffff804683e8
FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000056f448 CR3: 000000003f0ee000 CR4: 00000000000006e0

Call Trace:<ffffffff801925dc>{dispose_list+220} <ffffffff8019260b>{shrink_icache_memory+11}
<ffffffff801927dc>{shrink_icache_memory+476} <ffffffff8015f20b>{shrink_slab+219}
<ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
<ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}

top - 09:17:46 up 10:10, 2 users, load average: 15.03, 8.36, 4.00
Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2% us, 20.5% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
Mem: 8173472k total, 6117468k used, 2056004k free, 36460k buffers
Swap: 14651256k total, 656k used, 14650600k free, 5731640k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18 root 15 0 0 0 0 R 86.2 0.0 118:56.67 kswapd0
17 root 15 0 0 0 0 R 60.1 0.0 35:36.46 kswapd1

SysRq : Show Regs
CPU 1:
Modules linked in: binfmt_misc floppy
Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
RIP: 0010:[<ffffffff803e760e>] <ffffffff803e760e>{thread_return+94}
RSP: 0000:ffff8101047bfc08 EFLAGS: 00000246
RAX: ffff810103841860 RBX: 000000000050fb40 RCX: 0000000000000018
RDX: ffff810058446e40 RSI: ffff8101046f6140 RDI: 0000000000000000
RBP: 0000000000000246 R08: ffff8101047be000 R09: 0000000000000004
R10: 000000000050fb40 R11: 00000000ffffffff R12: ffffffff802645a8
R13: 0000000000000296 R14: ffff8101de939700 R15: 0000000000000008
FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaaaaf0b8 CR3: 000000003f0ee000 CR4: 00000000000006e0

Call Trace:<ffffffff8012c4c0>{__wake_up_common+64} <ffffffff803e88f8>{__down+152}
<ffffffff8012c470>{default_wake_function+0} <ffffffff803e86e8>{__down_failed+53}
<ffffffff80156440>{mempool_free_slab+0} <ffffffff80192cf6>{.text.lock.inode+5}
<ffffffff8015f20b>{shrink_slab+219} <ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff8010e7d6>{child_rip+8}
<ffffffff8011ac50>{flat_send_IPI_mask+0} <ffffffff80160820>{kswapd+0}
<ffffffff8010e7ce>{child_rip+0}
top - 09:18:32 up 10:10, 2 users, load average: 10.25, 8.02, 4.09
Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2% us, 20.5% sy, 3.0% ni, 65.8% id, 6.6% wa, 0.1% hi, 1.8% si
Mem: 8173472k total, 6259616k used, 1913856k free, 37192k buffers
Swap: 14651256k total, 656k used, 14650600k free, 5860380k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18 root 15 0 0 0 0 R 96.1 0.0 119:38.63 kswapd0
17 root 15 0 0 0 0 R 56.0 0.0 36:03.12 kswapd1

SysRq : Show Regs
CPU 1:
Modules linked in: binfmt_misc floppy
Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
RIP: 0010:[<ffffffff80192827>] <ffffffff80192827>{shrink_icache_memory+551}
RSP: 0000:ffff8101047bfd78 EFLAGS: 00000256
RAX: 0000000051eb851f RBX: 0000000000000000 RCX: 0000000000000018
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
RBP: 0000000000200200 R08: ffff8101047be000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80156440 R12: 0000000000100100
R13: ffffffff8012c470 R14: ffff8101046f6140 R15: 0000000000000001
FS: 00002aaaaac3ed20(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaae70000 CR3: 00000000f8625000 CR4: 00000000000006e0

Call Trace:<ffffffff80192814>{shrink_icache_memory+532} <ffffffff8015f1fe>{shrink_slab+206}
<ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
<ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}

top - 09:18:48 up 10:11, 2 users, load average: 9.96, 8.08, 4.17
Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.2% us, 20.6% sy, 3.0% ni, 65.8% id, 6.6% wa, 0.1% hi, 1.8% si
Mem: 8173472k total, 6361680k used, 1811792k free, 37592k buffers
Swap: 14651256k total, 656k used, 14650600k free, 5962320k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18 root 15 0 0 0 0 R 90.2 0.0 119:53.39 kswapd0
17 root 15 0 0 0 0 R 70.2 0.0 36:12.62 kswapd1

SysRq : Show Regs
CPU 1:
Modules linked in: binfmt_misc floppy
Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
RIP: 0010:[<ffffffff80156e54>] <ffffffff80156e54>{__mod_page_state+36}
RSP: 0000:ffff8101047bfdc8 EFLAGS: 00000292
RAX: ffff810103844e00 RBX: ffffffff801925dc RCX: ffff8101046f6140
RDX: ffffffff8066dca0 RSI: 0000000000000080 RDI: 00000000000000f8
RBP: 0000000000000000 R08: ffff8101047be000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff80156440 R12: 0000000000000000
R13: ffffffff80156440 R14: ffffffff803e86e8 R15: ffffffff804683e0
FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000004472f0 CR3: 000000003f0ee000 CR4: 00000000000006e0

Call Trace:<ffffffff8015f22f>{shrink_slab+255} <ffffffff801606c9>{balance_pgdat+617}
<ffffffff80160957>{kswapd+311} <ffffffff801472a0>{autoremove_wake_function+0}
<ffffffff801472a0>{autoremove_wake_function+0} <ffffffff8010e7d6>{child_rip+8}
<ffffffff8011ac50>{flat_send_IPI_mask+0} <ffffffff80160820>{kswapd+0}
<ffffffff8010e7ce>{child_rip+0}


-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-27 14:22:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

On Thu, Nov 24, 2005 at 09:31:41AM +0100, Jan Kasprzak wrote:
> Andrew Morton wrote:
> : > # dmesg -c >/dev/null; echo -n p >/proc/sysrq-trigger ; sleep 5; dmesg
> : > SysRq : Show Regs
> : > #
> :
> : You won't get anything useful from sysrq-p via /proc/sysrq-trigger - it'll
> : just show the backtrace of the process `echo'. It has to be via the
> : keyboard.
>
> OK, at the end of this mail I am attaching few instances of
> ( top -n 1 -b -p 17,18 ; dmesg -c ) after sysrq-p entered via the
> serial console. However, I have now kswapd0 at 99% CPU and kswapd1 at
> about 50%, and sysrq-p is now apparently handled by the CPU1, so I get
> traces of kswapd1 only. How can I switch the serial interrupt to CPU0?
>
> : If there's no keyboard, do `echo t > /proc/sysrq-trigger' to get an
> : all-task backtrace, then locate the trace for kswapd in the resulting
> : output.
>
> I had to increase CONFIG_LOG_BUF_SHIFT first, because sysrq-t
> output was too long, but nevertheless - for the running processes there
> are no call traces, just the note that the process is in the "R" state. Here
> are few excerpts of sysrq-t:
>
> -----
> kswapd1 S ffff810100000000 0 17 1 18 11 (L-TLB)
> ffff8101047bfea8 0000000000000046 ffff8101047bfe08 0000000000000292
> 0000000000000292 ffffffff80160589 ffff8101ae564240 000000008012cf35
> 0000000000000076 ffff8101ae564240
> Call Trace:<ffffffff80160589>{balance_pgdat+297} <ffffffff80160936>{kswapd+278}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
> <ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
> kswapd0 R running task 0 18 1 33 17 (L-TLB)
> -----
> kswapd1 S ffff810100000000 0 17 1 18 11 (L-TLB)
> ffff8101047bfea8 0000000000000046 ffff8101047bfe08 0000000000000292
> 0000000000000292 ffffffff80160589 ffff8101ae564240 000000008012cf35
> 0000000000000076 ffff8101ae564240
> Call Trace:<ffffffff80160589>{balance_pgdat+297} <ffffffff80160936>{kswapd+278}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
> <ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
> kswapd0 R running task 0 18 1 33 17 (L-TLB)
> -----

Out of curiosity, what is the size of the inode cache from /proc/slabinfo
at this moment?

Because even though the traces shows kswapd trying to reclaim i-cache, the percentage
of i-cache is really small:

inode_cache 1164 1224 600 6 1 : tunables 54 27 8 : slabdata 204 204 0
dentry_cache 44291 48569 224 17 1 : tunables 120 60 8 : slabdata 2857 2857 51


Maybe you should also try profile/oprofile during the kswapd peeks?

Just my two cents.

>
> And here are the sysrq-p outputs:
>
> top - 09:17:11 up 10:09, 2 users, load average: 13.56, 7.14, 3.45
> Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
> Cpu(s): 2.2% us, 20.4% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
> Mem: 8173472k total, 5769752k used, 2403720k free, 35436k buffers
> Swap: 14651256k total, 656k used, 14650600k free, 5388176k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18 root 15 0 0 0 0 R 91.9 0.0 118:24.00 kswapd0
> 17 root 15 0 0 0 0 R 49.8 0.0 35:17.26 kswapd1
>
> SysRq : Show Regs
> CPU 1:
> Modules linked in: binfmt_misc floppy
> Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
> RIP: 0010:[<ffffffff8019266b>] <ffffffff8019266b>{shrink_icache_memory+107}
> RSP: 0000:ffff8101047bfd78 EFLAGS: 00000246
> RAX: 00000000ffffffff RBX: 00000000000000c8 RCX: ffff8101046f6140
> RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffffffff804683e0
> RBP: 0000000000200200 R08: ffff8101047be000 R09: 0000000000000003
> R10: 0000000000000000 R11: ffffffff8028caf0 R12: ffffffff8012c533
> R13: ffff8101047bfd18 R14: 0000000000000003 R15: 0000000000000296
> FS: 00002aaaaac3ed20(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002aaaaacc1000 CR3: 00000001eeef9000 CR4: 00000000000006e0
>
> Call Trace:<ffffffff80192814>{shrink_icache_memory+532} <ffffffff8015f20b>{shrink_slab+219}
> <ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
> <ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
>
> top - 09:17:28 up 10:09, 2 users, load average: 17.42, 8.31, 3.89
> Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
> Cpu(s): 2.2% us, 20.4% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
> Mem: 8173472k total, 5980464k used, 2193008k free, 36076k buffers
> Swap: 14651256k total, 656k used, 14650600k free, 5593984k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18 root 15 0 0 0 0 R 96.4 0.0 118:39.98 kswapd0
> 17 root 15 0 0 0 0 R 60.3 0.0 35:26.37 kswapd1
>
> SysRq : Show Regs
> CPU 1:
> Modules linked in: binfmt_misc floppy
> Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
> RIP: 0010:[<ffffffff803e8ab3>] <ffffffff803e8ab3>{_spin_lock+3}
> RSP: 0000:ffff8101047bfd40 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: ffffffff803e8951 RCX: ffff8101046f6140
> RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffffffff80468400
> RBP: 0000000000000296 R08: ffff8101047be000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8101047bfd88
> R13: ffff8101047bfcc8 R14: ffff8101046f6140 R15: ffffffff804683e8
> FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 000000000056f448 CR3: 000000003f0ee000 CR4: 00000000000006e0
>
> Call Trace:<ffffffff801925dc>{dispose_list+220} <ffffffff8019260b>{shrink_icache_memory+11}
> <ffffffff801927dc>{shrink_icache_memory+476} <ffffffff8015f20b>{shrink_slab+219}
> <ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
> <ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
>
> top - 09:17:46 up 10:10, 2 users, load average: 15.03, 8.36, 4.00
> Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
> Cpu(s): 2.2% us, 20.5% sy, 3.0% ni, 65.9% id, 6.6% wa, 0.1% hi, 1.8% si
> Mem: 8173472k total, 6117468k used, 2056004k free, 36460k buffers
> Swap: 14651256k total, 656k used, 14650600k free, 5731640k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18 root 15 0 0 0 0 R 86.2 0.0 118:56.67 kswapd0
> 17 root 15 0 0 0 0 R 60.1 0.0 35:36.46 kswapd1
>
> SysRq : Show Regs
> CPU 1:
> Modules linked in: binfmt_misc floppy
> Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
> RIP: 0010:[<ffffffff803e760e>] <ffffffff803e760e>{thread_return+94}
> RSP: 0000:ffff8101047bfc08 EFLAGS: 00000246
> RAX: ffff810103841860 RBX: 000000000050fb40 RCX: 0000000000000018
> RDX: ffff810058446e40 RSI: ffff8101046f6140 RDI: 0000000000000000
> RBP: 0000000000000246 R08: ffff8101047be000 R09: 0000000000000004
> R10: 000000000050fb40 R11: 00000000ffffffff R12: ffffffff802645a8
> R13: 0000000000000296 R14: ffff8101de939700 R15: 0000000000000008
> FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002aaaaaaaf0b8 CR3: 000000003f0ee000 CR4: 00000000000006e0
>
> Call Trace:<ffffffff8012c4c0>{__wake_up_common+64} <ffffffff803e88f8>{__down+152}
> <ffffffff8012c470>{default_wake_function+0} <ffffffff803e86e8>{__down_failed+53}
> <ffffffff80156440>{mempool_free_slab+0} <ffffffff80192cf6>{.text.lock.inode+5}
> <ffffffff8015f20b>{shrink_slab+219} <ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff8010e7d6>{child_rip+8}
> <ffffffff8011ac50>{flat_send_IPI_mask+0} <ffffffff80160820>{kswapd+0}
> <ffffffff8010e7ce>{child_rip+0}
> top - 09:18:32 up 10:10, 2 users, load average: 10.25, 8.02, 4.09
> Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
> Cpu(s): 2.2% us, 20.5% sy, 3.0% ni, 65.8% id, 6.6% wa, 0.1% hi, 1.8% si
> Mem: 8173472k total, 6259616k used, 1913856k free, 37192k buffers
> Swap: 14651256k total, 656k used, 14650600k free, 5860380k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18 root 15 0 0 0 0 R 96.1 0.0 119:38.63 kswapd0
> 17 root 15 0 0 0 0 R 56.0 0.0 36:03.12 kswapd1
>
> SysRq : Show Regs
> CPU 1:
> Modules linked in: binfmt_misc floppy
> Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
> RIP: 0010:[<ffffffff80192827>] <ffffffff80192827>{shrink_icache_memory+551}
> RSP: 0000:ffff8101047bfd78 EFLAGS: 00000256
> RAX: 0000000051eb851f RBX: 0000000000000000 RCX: 0000000000000018
> RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
> RBP: 0000000000200200 R08: ffff8101047be000 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffffff80156440 R12: 0000000000100100
> R13: ffffffff8012c470 R14: ffff8101046f6140 R15: 0000000000000001
> FS: 00002aaaaac3ed20(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002aaaaae70000 CR3: 00000000f8625000 CR4: 00000000000006e0
>
> Call Trace:<ffffffff80192814>{shrink_icache_memory+532} <ffffffff8015f1fe>{shrink_slab+206}
> <ffffffff801606c9>{balance_pgdat+617} <ffffffff80160957>{kswapd+311}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff8010e7d6>{child_rip+8} <ffffffff8011ac50>{flat_send_IPI_mask+0}
> <ffffffff80160820>{kswapd+0} <ffffffff8010e7ce>{child_rip+0}
>
> top - 09:18:48 up 10:11, 2 users, load average: 9.96, 8.08, 4.17
> Tasks: 2 total, 2 running, 0 sleeping, 0 stopped, 0 zombie
> Cpu(s): 2.2% us, 20.6% sy, 3.0% ni, 65.8% id, 6.6% wa, 0.1% hi, 1.8% si
> Mem: 8173472k total, 6361680k used, 1811792k free, 37592k buffers
> Swap: 14651256k total, 656k used, 14650600k free, 5962320k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18 root 15 0 0 0 0 R 90.2 0.0 119:53.39 kswapd0
> 17 root 15 0 0 0 0 R 70.2 0.0 36:12.62 kswapd1
>
> SysRq : Show Regs
> CPU 1:
> Modules linked in: binfmt_misc floppy
> Pid: 17, comm: kswapd1 Not tainted 2.6.15-rc2 #2
> RIP: 0010:[<ffffffff80156e54>] <ffffffff80156e54>{__mod_page_state+36}
> RSP: 0000:ffff8101047bfdc8 EFLAGS: 00000292
> RAX: ffff810103844e00 RBX: ffffffff801925dc RCX: ffff8101046f6140
> RDX: ffffffff8066dca0 RSI: 0000000000000080 RDI: 00000000000000f8
> RBP: 0000000000000000 R08: ffff8101047be000 R09: 0000000000000000
> R10: 0000000000000000 R11: ffffffff80156440 R12: 0000000000000000
> R13: ffffffff80156440 R14: ffffffff803e86e8 R15: ffffffff804683e0
> FS: 00002aaaaaab79e0(0000) GS:ffffffff80624880(0000) knlGS:00000000555565a0
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000004472f0 CR3: 000000003f0ee000 CR4: 00000000000006e0
>
> Call Trace:<ffffffff8015f22f>{shrink_slab+255} <ffffffff801606c9>{balance_pgdat+617}
> <ffffffff80160957>{kswapd+311} <ffffffff801472a0>{autoremove_wake_function+0}
> <ffffffff801472a0>{autoremove_wake_function+0} <ffffffff8010e7d6>{child_rip+8}
> <ffffffff8011ac50>{flat_send_IPI_mask+0} <ffffffff80160820>{kswapd+0}
> <ffffffff8010e7ce>{child_rip+0}
>
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
> | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
> | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> > Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> > for implementing software. --Linus Torvalds <
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-11-27 20:39:46

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Marcelo Tosatti wrote:
: Out of curiosity, what is the size of the inode cache from /proc/slabinfo
: at this moment?
:
: Because even though the traces shows kswapd trying to reclaim i-cache, the percentage
: of i-cache is really small:
:
: inode_cache 1164 1224 600 6 1 : tunables 54 27 8 : slabdata 204 204 0
: dentry_cache 44291 48569 224 17 1 : tunables 120 60 8 : slabdata 2857 2857 51
:
The size of icache is similar to that shown above:

# egrep '^(inode|dentry)_cache' /proc/slabinfo

inode_cache 1189 1326 600 6 1 : tunables 54 27 8 : slabdata 221 221 0
dentry_cache 41845 45509 224 17 1 : tunables 120 60 8 : slabdata 2677 2677 0

inode_cache 1212 1326 600 6 1 : tunables 54 27 8 : slabdata 221 221 0
dentry_cache 42662 48892 224 17 1 : tunables 120 60 8 : slabdata 2876 2876 288

However, the traces I have sent are traces of kswapd1, which at that
time was eating around 50% of CPU, while kswapd0 was at >95%. I have not
managed to get the trace of kswapd0 yet.

I have tried to bind the serial IRQ to CPU0 to get the trace of
kswapd0 (echo 1 >/proc/irq/4/smp_affinity). After sysrq-p I get the dump
of registers at CPU0, but the strange thing is, that I get the stack trace
of kacpid instead of kswapd0 (kacpid is not even visible in top(1) output,
and it has a total of 0 seconds of CPU time consumed since boot, while kswapd0
is first in top(1) eating >95% of CPU). Why kacpid? When I bind the serial IRQ
to CPU1, I get the trace of PID 0 (swapper).

The task that probably triggers this problem is a cron job
doing full-text indexing of mailing list archive, so it accesses lots
of small files, and then recreates the inverted index, which is one big
file. So maybe inode cache shrinking or something may be the problem there.
However, the cron job does an incremental reindexing only, so I think it
reads less than 100 files per each run.
:
: Maybe you should also try profile/oprofile during the kswapd peeks?
:
Do you have any details on it? I can of course RTFdocs of oprofile,
but should I try to catch something special?

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-27 21:25:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

On Sun, Nov 27, 2005 at 09:39:24PM +0100, Jan Kasprzak wrote:
> Marcelo Tosatti wrote:
> : Out of curiosity, what is the size of the inode cache from /proc/slabinfo
> : at this moment?
> :
> : Because even though the traces shows kswapd trying to reclaim i-cache, the percentage
> : of i-cache is really small:
> :
> : inode_cache 1164 1224 600 6 1 : tunables 54 27 8 : slabdata 204 204 0
> : dentry_cache 44291 48569 224 17 1 : tunables 120 60 8 : slabdata 2857 2857 51
> :
> The size of icache is similar to that shown above:
>
> # egrep '^(inode|dentry)_cache' /proc/slabinfo
>
> inode_cache 1189 1326 600 6 1 : tunables 54 27 8 : slabdata 221 221 0
> dentry_cache 41845 45509 224 17 1 : tunables 120 60 8 : slabdata 2677 2677 0
>
> inode_cache 1212 1326 600 6 1 : tunables 54 27 8 : slabdata 221 221 0
> dentry_cache 42662 48892 224 17 1 : tunables 120 60 8 : slabdata 2876 2876 288
>
> However, the traces I have sent are traces of kswapd1, which at that
> time was eating around 50% of CPU, while kswapd0 was at >95%. I have not
> managed to get the trace of kswapd0 yet.
>
> I have tried to bind the serial IRQ to CPU0 to get the trace of
> kswapd0 (echo 1 >/proc/irq/4/smp_affinity). After sysrq-p I get the dump
> of registers at CPU0, but the strange thing is, that I get the stack trace
> of kacpid instead of kswapd0 (kacpid is not even visible in top(1) output,
> and it has a total of 0 seconds of CPU time consumed since boot, while kswapd0
> is first in top(1) eating >95% of CPU). Why kacpid? When I bind the serial IRQ
> to CPU1, I get the trace of PID 0 (swapper).

Have no idea, sorry.

> The task that probably triggers this problem is a cron job
> doing full-text indexing of mailing list archive, so it accesses lots
> of small files, and then recreates the inverted index, which is one big
> file. So maybe inode cache shrinking or something may be the problem there.
> However, the cron job does an incremental reindexing only, so I think it
> reads less than 100 files per each run.
> :
> : Maybe you should also try profile/oprofile during the kswapd peeks?
> :
> Do you have any details on it? I can of course RTFdocs of oprofile,
> but should I try to catch something special?

Try to isolate the profile trace to the period which kswapd goes mad.

That is, reset the profiling buffer when you notice kswapd going crazy,
let it profile for a few minutes (no idea how long the load takes),
and stop before kswapd is back to idleness (preferably while its still
burning the CPU).

I think you can try the old profiler first (CONFIG_PROFILE=y, boot with
readprofile=2, use readprofile..).

If that does not give meaningful data then try oprofile.

2005-11-27 21:42:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Hi Jan,

> The task that probably triggers this problem is a cron job
> doing full-text indexing of mailing list archive, so it accesses lots
> of small files, and then recreates the inverted index, which is one big
> file. So maybe inode cache shrinking or something may be the problem there.
> However, the cron job does an incremental reindexing only, so I think it
> reads less than 100 files per each run.
> :
> : Maybe you should also try profile/oprofile during the kswapd peeks?
> :
> Do you have any details on it? I can of course RTFdocs of oprofile,
> but should I try to catch something special?

It does seem to scan SLABs intensively:

pgscan_kswapd_high 0
pgscan_kswapd_normal 940269891
pgscan_kswapd_dma 0
pgscan_direct_high 0
pgscan_direct_normal 13837131
pgscan_direct_dma 0
pginodesteal 11216563
slabs_scanned 160160350534400
kswapd_steal 909876526
kswapd_inodesteal 305039060
pageoutrun 30139677
allocstall 4067783

If you take the amont of scanned slabs and divide by the sum of
direct/kswapd pagescans:

160160350534400 / (940269891+13837131) = 167864

Which means that for each page scanned about 168000 slab entries are
scanned. Does not look very good.

Other than the profiling can you please also try Bharata's
shrinkable slab cache statistics patch?

http://lkml.org/lkml/2005/10/26/1


2005-11-27 23:21:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Marcelo Tosatti <[email protected]> wrote:
>
> It does seem to scan SLABs intensively:
>

It might be worth trying the below. Mainly for the debugging check.


From: Andrea Arcangeli <[email protected]>

Greg Edwards found some deadlock in the icache shrinker.

I believe the major bug is that the VM is currently potentially setting nr
= LONG_MAX before shrinking the icache (and the icache shrinker never
returns -1, which means the api doesn't currently require shrinkers to
return -1 when they're finished).

The below is the most obviously safe way I could address this problem
(still untested).

This is not necessairly the way we want to fix it in mainline, but it at
least shows what I believe to be the major cuplrit in the code (i.e. nr
growing insane ;).

Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/vmscan.c | 18 +++++++++++++++---
1 files changed, 15 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~shrinker-nr-=-long_max-means-deadlock-for-icache mm/vmscan.c
--- devel/mm/vmscan.c~shrinker-nr-=-long_max-means-deadlock-for-icache 2005-11-19 02:53:18.000000000 -0800
+++ devel-akpm/mm/vmscan.c 2005-11-19 03:00:25.000000000 -0800
@@ -201,13 +201,25 @@ static int shrink_slab(unsigned long sca
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
unsigned long total_scan;
+ unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);

delta = (4 * scanned) / shrinker->seeks;
- delta *= (*shrinker->shrinker)(0, gfp_mask);
+ delta *= max_pass;
do_div(delta, lru_pages + 1);
shrinker->nr += delta;
- if (shrinker->nr < 0)
- shrinker->nr = LONG_MAX; /* It wrapped! */
+ if (shrinker->nr < 0) {
+ printk(KERN_ERR "%s: nr=%ld\n",
+ __FUNCTION__, shrinker->nr);
+ shrinker->nr = max_pass;
+ }
+
+ /*
+ * Avoid risking looping forever due to too large nr value:
+ * never try to free more than twice the estimate number of
+ * freeable entries.
+ */
+ if (shrinker->nr > max_pass * 2)
+ shrinker->nr = max_pass * 2;

total_scan = shrinker->nr;
shrinker->nr = 0;
_

2005-11-28 13:17:10

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Andrew Morton wrote:
: Marcelo Tosatti <[email protected]> wrote:
: >
: > It does seem to scan SLABs intensively:
: >
: It might be worth trying the below. Mainly for the debugging check.
:
I have compiled a new kernel - 2.6.15-rc2 with the patch you
recommended and with the slab statistics patch Marcelo mentioned.
I have add the oprofile support, but apart from that it is the same
kernel. It seems that the kswapd system time peaks has disappeared,
or at least they are much lower - kswapd0 has eaten ~3 minutes from
11 hours of uptime (in one of my previous mails I found that it used
to be 117 minutes after ~10 hours of uptime). On my MRTG graphs
at http://www.linux.cz/stats/mrtg-rrd/cpu.html some _small_ peaks
can be seen at 15 minutes after every odd-numbered hour. I have booted
this kernel around 2am local time.

I have no unusual error messages in dmesg output, so this must
be this part of the patch:

: + /*
: + * Avoid risking looping forever due to too large nr value:
: + * never try to free more than twice the estimate number of
: + * freeable entries.
: + */
: + if (shrinker->nr > max_pass * 2)
: + shrinker->nr = max_pass * 2;

The shrinker statistics displayed in /proc/slabinfo are
# egrep '^(inode|dentry)_cache' /proc/slabinfo
inode_cache 1338 1380 600 6 1 : tunables 54 27 8 : slabdata 230 230 0 : shrinker stat 261765504 16831100
dentry_cache 40195 49130 224 17 1 : tunables 120 60 8 : slabdata 2890 2890 204 : shrinker stat 57946368 28877600

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-11-28 13:46:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU


On Mon, Nov 28, 2005 at 02:16:48PM +0100, Jan Kasprzak wrote:
> Andrew Morton wrote:
> : Marcelo Tosatti <[email protected]> wrote:
> : >
> : > It does seem to scan SLABs intensively:
> : >
> : It might be worth trying the below. Mainly for the debugging check.
> :
> I have compiled a new kernel - 2.6.15-rc2 with the patch you
> recommended and with the slab statistics patch Marcelo mentioned.
> I have add the oprofile support, but apart from that it is the same
> kernel. It seems that the kswapd system time peaks has disappeared,
> or at least they are much lower - kswapd0 has eaten ~3 minutes from
> 11 hours of uptime (in one of my previous mails I found that it used
> to be 117 minutes after ~10 hours of uptime). On my MRTG graphs
> at http://www.linux.cz/stats/mrtg-rrd/cpu.html some _small_ peaks
> can be seen at 15 minutes after every odd-numbered hour. I have booted
> this kernel around 2am local time.
>
> I have no unusual error messages in dmesg output, so this must
> be this part of the patch:
>
> : + /*
> : + * Avoid risking looping forever due to too large nr value:
> : + * never try to free more than twice the estimate number of
> : + * freeable entries.
> : + */
> : + if (shrinker->nr > max_pass * 2)
> : + shrinker->nr = max_pass * 2;

Yep, great.

>
> The shrinker statistics displayed in /proc/slabinfo are
> # egrep '^(inode|dentry)_cache' /proc/slabinfo
> inode_cache 1338 1380 600 6 1 : tunables 54 27 8 : slabdata 230 230 0 : shrinker stat 261765504 16831100
> dentry_cache 40195 49130 224 17 1 : tunables 120 60 8 : slabdata 2890 2890 204 : shrinker stat 57946368 28877600

Interesting, the success/attempt reclaim ratio for the dentry cache is
about 1/2 (pretty good), while the inode cache ratio is 1/15 (not so good).

I wonder why prune_icache() does not move inodes with positive i_count
to inode_inuse list, letting iput() take care of moving to unused
once the count reaches zero.

inode = list_entry(inode_unused.prev, struct inode, i_list);
if (inode->i_state || atomic_read(&inode->i_count)) {
list_move(&inode->i_list, &inode_unused);
continue;
}

Couldnt it be
list_move(&inode->i_list, &inode_inuse);

?





2005-12-06 00:10:47

by Jan Kasprzak

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

Marcelo Tosatti wrote:
: I wonder why prune_icache() does not move inodes with positive i_count
: to inode_inuse list, letting iput() take care of moving to unused
: once the count reaches zero.
:
: inode = list_entry(inode_unused.prev, struct inode, i_list);
: if (inode->i_state || atomic_read(&inode->i_count)) {
: list_move(&inode->i_list, &inode_unused);
: continue;
: }
:
: Couldnt it be
: list_move(&inode->i_list, &inode_inuse);
:
: ?

Hmm, this code is indeed strange. Why does it move the inode
to the inode_unused list, when the inode has in fact been _found_ while
scanning the inode_unused list? And how can an inode with positive
->i_count end up on the inode_unused list?

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E |
| http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ |
> Specs are a basis for _talking_about_ things. But they are _not_ a basis <
> for implementing software. --Linus Torvalds <

2005-12-06 18:14:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.6.14 kswapd eating too much CPU

On Tue, Dec 06, 2005 at 01:10:06AM +0100, Jan Kasprzak wrote:
> Marcelo Tosatti wrote:
> : I wonder why prune_icache() does not move inodes with positive i_count
> : to inode_inuse list, letting iput() take care of moving to unused
> : once the count reaches zero.
> :
> : inode = list_entry(inode_unused.prev, struct inode, i_list);
> : if (inode->i_state || atomic_read(&inode->i_count)) {
> : list_move(&inode->i_list, &inode_unused);
> : continue;
> : }
> :
> : Couldnt it be
> : list_move(&inode->i_list, &inode_inuse);
> :
> : ?
>
> Hmm, this code is indeed strange. Why does it move the inode
> to the inode_unused list, when the inode has in fact been _found_ while
> scanning the inode_unused list?

It just moves to the head of the list, for later scanning.

> And how can an inode with positive ->i_count end up on the
> inode_unused list?

Such inodes only end up in the unused list during superblock
shutdown, so they should not be a problem actually (my bad).