LinuxLists.cc - [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

2007-09-17 22:24:17

Subject: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

Back in early 2006 I reported persistent hangs on the NFS server,
whereby all of a sudden about ten minutes after boot my primary NFS
server would cease responding to NFS requests until it was rebooted.
<http://www.ussg.iu.edu/hypermail/linux/kernel/0601.3/1631.html>

That time, the problem vanished when I switched to NFS-over-TCP:
<http://www.ussg.iu.edu/hypermail/linux/kernel/0601.3/1773.html>

Well, I just rebooted --- post-glibc-upgrade from 2.5 to 2.6.1, no
kernel upgrade or anything, so this bug has been latent during at
least my last three weeks of uptime. And it's back. (I've been
seeing strange long pauses doing things like ls, and I suspect
they are related: see below.)

/proc/sys/nfsd/exports on the freezing server:

# Version 1.1
# Path Client(Flags) # IPs
/usr/packages.bin/non-free hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=90a98d8a:a8be4806:aea3a4e1:fe3437a0)
/home/.loki.wkstn.nix esperi.srvr.nix(rw,root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/home/.loki.wkstn.nix hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/usr/lib/X11/fonts hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xplanet hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xemacs hades.wkstn.nix(rw,no_root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/packages hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=2a35f82a:cca144df:a1123587:23527f53)

I turned on ALL-class nfsd debugging and here's what I see as it freezes
up:

Sep 17 22:57:00 loki warning: kernel: nfsd_dispatch: vers 3 proc 1
Sep 17 22:57:00 loki warning: kernel: nfsd: GETATTR(3) 36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
Sep 17 22:57:00 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:44 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:44 loki warning: kernel: nfsd: ACCESS(3) 36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:44 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:45 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:57:52 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:52 loki warning: kernel: nfsd: ACCESS(3) 36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:52 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:52 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3) 36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still trying

>From then on, *every* fh_verify() request fails the same way, and
obviously if you can't verify any fds you can't do much with NFS.

Looking back in the log I see intermittent malloc failures starting
almost as soon as I've booted (allowing a couple of minutes for me to
turn debugging on):

Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
[...]
Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3) 28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc failure!

A while later we start seeing runs of malloc failures, which I think
correlated with the unexplained pauses in NFS response:

Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 20480 bytes at 4096
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 28672 bytes at 53248
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 86016
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 98304
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 110592
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 176128
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 241664
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 53248 bytes at 307200
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 397312
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 12288 bytes at 413696
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 42652 bytes at 430080
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:34:02 loki warning: kernel: found domain hades.wkstn.nix
Sep 17 22:34:02 loki warning: kernel: found fsidtype 7
Sep 17 22:34:02 loki warning: kernel: found fsid length 24
Sep 17 22:34:02 loki warning: kernel: Path seems to be </usr/lib/X11/fonts>
Sep 17 22:34:02 loki warning: kernel: Found the path /usr/lib/X11/fonts
Sep 17 22:34:02 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:34:02 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 42652 bytes at 430080
Sep 17 22:34:02 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:34:02 loki warning: kernel: nfsd_dispatch: vers 3 proc 6

Fifteen minutes after that the whole thing seized up with malloc
failures as the only response.

We've obviously got some sort of resource leak (although why it's
suddenly appeared I have no idea). There's buckets of memory, though.
The system's not swapping or under especially high memory pressure
(indeed as I've just rebooted it and hardly started anything but the
usual daemon swarm and LVM/MD stuff, it's under rather *low* pressure by
its standards). (Dumps of /proc/meminfo, /proc/zoneinfo and
/proc/slabinfo below.)

I note that this error can be caused by both ENOMEM and EAGAIN, and that
there are numerous points in fh_verify() whose failure can lead to this
error. I'm just instrumenting them now so I can tell which is failing.

MemTotal: 321172 kB
MemFree: 6152 kB
Buffers: 2540 kB
Cached: 106840 kB
SwapCached: 0 kB
Active: 277004 kB
Inactive: 4692 kB
SwapTotal: 1851368 kB
SwapFree: 1851188 kB
Dirty: 160 kB
Writeback: 0 kB
AnonPages: 172340 kB
Mapped: 72692 kB
Slab: 12764 kB
SReclaimable: 4852 kB
SUnreclaim: 7912 kB
PageTables: 2616 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 2011952 kB
Committed_AS: 492284 kB
VmallocTotal: 712684 kB
VmallocUsed: 2648 kB
VmallocChunk: 710008 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

Node 0, zone DMA
pages free 337
min 28
low 35
high 42
scanned 0 (a: 23 i: 24)
spanned 4096
present 4064
nr_free_pages 337
nr_inactive 0
nr_active 2811
nr_anon_pages 1788
nr_mapped 885
nr_file_pages 1023
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 47
nr_slab_unreclaimable 32
nr_page_table_pages 11
nr_unstable 0
nr_bounce 0
nr_vmscan_write 421
protection: (0, 301)
pagesets
cpu: 0 pcp: 0
count: 0
high: 0
batch: 1
cpu: 0 pcp: 1
count: 0
high: 0
batch: 1
all_unreclaimable: 0
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 1227
min 541
low 676
high 811
scanned 0 (a: 0 i: 0)
spanned 77804
present 77197
nr_free_pages 1227
nr_inactive 5698
nr_active 61811
nr_anon_pages 40341
nr_mapped 19329
nr_file_pages 27174
nr_dirty 13
nr_writeback 0
nr_slab_reclaimable 1049
nr_slab_unreclaimable 2047
nr_page_table_pages 698
nr_unstable 0
nr_bounce 0
nr_vmscan_write 4904
protection: (0, 0)
pagesets
cpu: 0 pcp: 0
count: 66
high: 90
batch: 15
cpu: 0 pcp: 1
count: 5
high: 30
batch: 7
all_unreclaimable: 0
prev_priority: 12
start_pfn: 4096

# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jbd_4k 0 0 4096 1 1 : tunables 24 12 0 : slabdata 0 0 0
jbd_1k 0 0 1024 4 1 : tunables 54 27 0 : slabdata 0 0 0
jbd_2k 0 0 2048 2 1 : tunables 24 12 0 : slabdata 0 0 0
raid5-md2 512 517 344 11 1 : tunables 54 27 0 : slabdata 47 47 0
raid5-md1 512 517 344 11 1 : tunables 54 27 0 : slabdata 47 47 0
rpc_buffers 8 8 2048 2 1 : tunables 24 12 0 : slabdata 4 4 0
rpc_tasks 8 24 160 24 1 : tunables 120 60 0 : slabdata 1 1 0
rpc_inode_cache 20 20 384 10 1 : tunables 54 27 0 : slabdata 2 2 0
bridge_fdb_cache 8 59 64 59 1 : tunables 120 60 0 : slabdata 1 1 0
UNIX 91 100 384 10 1 : tunables 54 27 0 : slabdata 10 10 0
dm-snapshot-in 128 134 56 67 1 : tunables 120 60 0 : slabdata 2 2 0
dm-snapshot-ex 0 0 16 203 1 : tunables 120 60 0 : slabdata 0 0 0
dm-crypt_io 0 0 36 101 1 : tunables 120 60 0 : slabdata 0 0 0
dm_tio 4366 4669 16 203 1 : tunables 120 60 0 : slabdata 23 23 0
dm_io 4366 4563 20 169 1 : tunables 120 60 0 : slabdata 27 27 0
uhci_urb_priv 4 127 28 127 1 : tunables 120 60 0 : slabdata 1 1 0
scsi_cmd_cache 39 39 288 13 1 : tunables 54 27 0 : slabdata 3 3 0
cfq_io_context 250 288 80 48 1 : tunables 120 60 0 : slabdata 6 6 0
cfq_queue 247 264 88 44 1 : tunables 120 60 0 : slabdata 6 6 0
mqueue_inode_cache 1 8 480 8 1 : tunables 54 27 0 : slabdata 1 1 0
fuse_request 0 0 360 11 1 : tunables 54 27 0 : slabdata 0 0 0
fuse_inode 0 0 352 11 1 : tunables 54 27 0 : slabdata 0 0 0
nfs_write_data 36 36 448 9 1 : tunables 54 27 0 : slabdata 4 4 0
nfs_read_data 32 36 416 9 1 : tunables 54 27 0 : slabdata 4 4 0
nfs_inode_cache 15 21 560 7 1 : tunables 54 27 0 : slabdata 3 3 0
nfs_page 0 0 64 59 1 : tunables 120 60 0 : slabdata 0 0 0
isofs_inode_cache 0 0 324 12 1 : tunables 54 27 0 : slabdata 0 0 0
hugetlbfs_inode_cache 1 13 296 13 1 : tunables 54 27 0 : slabdata 1 1 0
ext2_inode_cache 5 9 424 9 1 : tunables 54 27 0 : slabdata 1 1 0
ext2_xattr 0 0 44 84 1 : tunables 120 60 0 : slabdata 0 0 0
journal_handle 1 169 20 169 1 : tunables 120 60 0 : slabdata 1 1 0
journal_head 98 360 52 72 1 : tunables 120 60 0 : slabdata 5 5 0
revoke_table 20 254 12 254 1 : tunables 120 60 0 : slabdata 1 1 0
revoke_record 0 0 16 203 1 : tunables 120 60 0 : slabdata 0 0 0
ext3_inode_cache 2392 3429 440 9 1 : tunables 54 27 0 : slabdata 381 381 0
ext3_xattr 0 0 44 84 1 : tunables 120 60 0 : slabdata 0 0 0
reiser_inode_cache 172 180 384 10 1 : tunables 54 27 0 : slabdata 18 18 0
configfs_dir_cache 0 0 48 78 1 : tunables 120 60 0 : slabdata 0 0 0
dnotify_cache 1 169 20 169 1 : tunables 120 60 0 : slabdata 1 1 0
dquot 17 30 128 30 1 : tunables 120 60 0 : slabdata 1 1 0
inotify_event_cache 0 0 28 127 1 : tunables 120 60 0 : slabdata 0 0 0
inotify_watch_cache 4 92 40 92 1 : tunables 120 60 0 : slabdata 1 1 0
kioctx 1 24 160 24 1 : tunables 120 60 0 : slabdata 1 1 0
kiocb 0 0 160 24 1 : tunables 120 60 0 : slabdata 0 0 0
fasync_cache 3 203 16 203 1 : tunables 120 60 0 : slabdata 1 1 0
shmem_inode_cache 950 950 396 10 1 : tunables 54 27 0 : slabdata 95 95 0
posix_timers_cache 0 0 100 39 1 : tunables 120 60 0 : slabdata 0 0 0
uid_cache 15 59 64 59 1 : tunables 120 60 0 : slabdata 1 1 0
UDP-Lite 0 0 480 8 1 : tunables 54 27 0 : slabdata 0 0 0
tcp_bind_bucket 37 203 16 203 1 : tunables 120 60 0 : slabdata 1 1 0
inet_peer_cache 1 59 64 59 1 : tunables 120 60 0 : slabdata 1 1 0
ip_fib_alias 12 113 32 113 1 : tunables 120 60 0 : slabdata 1 1 0
ip_fib_hash 11 113 32 113 1 : tunables 120 60 0 : slabdata 1 1 0
ip_dst_cache 31 90 256 15 1 : tunables 120 60 0 : slabdata 6 6 0
arp_cache 5 30 128 30 1 : tunables 120 60 0 : slabdata 1 1 0
RAW 2 9 448 9 1 : tunables 54 27 0 : slabdata 1 1 0
UDP 29 32 480 8 1 : tunables 54 27 0 : slabdata 4 4 0
tw_sock_TCP 0 0 96 40 1 : tunables 120 60 0 : slabdata 0 0 0
request_sock_TCP 0 0 64 59 1 : tunables 120 60 0 : slabdata 0 0 0
TCP 45 49 1056 7 2 : tunables 24 12 0 : slabdata 7 7 0
eventpoll_pwq 2 101 36 101 1 : tunables 120 60 0 : slabdata 1 1 0
eventpoll_epi 2 40 96 40 1 : tunables 120 60 0 : slabdata 1 1 0
sgpool-128 2 2 2048 2 1 : tunables 24 12 0 : slabdata 1 1 0
sgpool-64 2 4 1024 4 1 : tunables 54 27 0 : slabdata 1 1 0
sgpool-32 2 8 512 8 1 : tunables 54 27 0 : slabdata 1 1 0
sgpool-16 2 15 256 15 1 : tunables 120 60 0 : slabdata 1 1 0
sgpool-8 30 30 128 30 1 : tunables 120 60 0 : slabdata 1 1 0
scsi_io_context 0 0 104 37 1 : tunables 120 60 0 : slabdata 0 0 0
blkdev_ioc 96 113 32 113 1 : tunables 120 60 0 : slabdata 1 1 0
blkdev_queue 26 27 892 9 2 : tunables 54 27 0 : slabdata 3 3 0
blkdev_requests 142 242 176 22 1 : tunables 120 60 0 : slabdata 11 11 0
biovec-256 274 274 3072 2 2 : tunables 24 12 0 : slabdata 137 137 0
biovec-128 274 275 1536 5 2 : tunables 24 12 0 : slabdata 55 55 0
biovec-64 274 275 768 5 1 : tunables 54 27 0 : slabdata 55 55 0
biovec-16 274 280 192 20 1 : tunables 120 60 0 : slabdata 14 14 0
biovec-4 274 295 64 59 1 : tunables 120 60 0 : slabdata 5 5 0
biovec-1 321 609 16 203 1 : tunables 120 60 0 : slabdata 3 3 0
bio 319 354 64 59 1 : tunables 120 60 0 : slabdata 6 6 0
sock_inode_cache 178 187 352 11 1 : tunables 54 27 0 : slabdata 17 17 0
skbuff_fclone_cache 9 12 320 12 1 : tunables 54 27 0 : slabdata 1 1 0
skbuff_head_cache 75 120 160 24 1 : tunables 120 60 0 : slabdata 5 5 0
file_lock_cache 4 42 92 42 1 : tunables 120 60 0 : slabdata 1 1 0
proc_inode_cache 168 168 312 12 1 : tunables 54 27 0 : slabdata 14 14 0
sigqueue 8 27 144 27 1 : tunables 120 60 0 : slabdata 1 1 0
radix_tree_node 1734 3250 288 13 1 : tunables 54 27 0 : slabdata 250 250 0
bdev_cache 44 45 416 9 1 : tunables 54 27 0 : slabdata 5 5 0
sysfs_dir_cache 3254 3276 48 78 1 : tunables 120 60 0 : slabdata 42 42 0
mnt_cache 67 90 128 30 1 : tunables 120 60 0 : slabdata 3 3 0
inode_cache 801 871 296 13 1 : tunables 54 27 0 : slabdata 67 67 0
dentry 4131 6300 128 30 1 : tunables 120 60 0 : slabdata 210 210 0
filp 1332 1392 160 24 1 : tunables 120 60 0 : slabdata 58 58 0
names_cache 4 4 4096 1 1 : tunables 24 12 0 : slabdata 4 4 0
idr_layer_cache 152 174 136 29 1 : tunables 120 60 0 : slabdata 6 6 0
buffer_head 4991 27144 52 72 1 : tunables 120 60 0 : slabdata 377 377 0
mm_struct 126 126 416 9 1 : tunables 54 27 0 : slabdata 14 14 0
vm_area_struct 12759 13570 84 46 1 : tunables 120 60 0 : slabdata 295 295 0
fs_cache 108 113 32 113 1 : tunables 120 60 0 : slabdata 1 1 0
files_cache 93 100 192 20 1 : tunables 120 60 0 : slabdata 5 5 0
signal_cache 135 135 416 9 1 : tunables 54 27 0 : slabdata 15 15 0
sighand_cache 129 129 1312 3 1 : tunables 24 12 0 : slabdata 43 43 0
task_struct 135 135 1280 3 1 : tunables 24 12 0 : slabdata 45 45 0
anon_vma 1179 1356 8 339 1 : tunables 120 60 0 : slabdata 4 4 0
pid 150 202 36 101 1 : tunables 120 60 0 : slabdata 2 2 0
size-4194304(DMA) 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
size-4194304 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
size-2097152(DMA) 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-2097152 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
size-1048576(DMA) 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-1048576 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
size-524288(DMA) 0 0 524288 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-524288 0 0 524288 1 128 : tunables 1 1 0 : slabdata 0 0 0
size-262144(DMA) 0 0 262144 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-262144 0 0 262144 1 64 : tunables 1 1 0 : slabdata 0 0 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-65536 1 1 65536 1 16 : tunables 8 4 0 : slabdata 1 1 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-32768 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-16384 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-8192 5 5 8192 1 2 : tunables 8 4 0 : slabdata 5 5 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 0 : slabdata 0 0 0
size-4096 96 96 4096 1 1 : tunables 24 12 0 : slabdata 96 96 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 0 : slabdata 0 0 0
size-2048 126 134 2048 2 1 : tunables 24 12 0 : slabdata 67 67 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 0 : slabdata 0 0 0
size-1024 128 128 1024 4 1 : tunables 54 27 0 : slabdata 32 32 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27 0 : slabdata 0 0 0
size-512 416 416 512 8 1 : tunables 54 27 0 : slabdata 52 52 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 0 : slabdata 0 0 0
size-256 446 465 256 15 1 : tunables 120 60 0 : slabdata 31 31 0
size-192(DMA) 0 0 192 20 1 : tunables 120 60 0 : slabdata 0 0 0
size-192 133 140 192 20 1 : tunables 120 60 0 : slabdata 7 7 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60 0 : slabdata 0 0 0
size-128 707 1140 128 30 1 : tunables 120 60 0 : slabdata 38 38 0
size-96(DMA) 0 0 96 40 1 : tunables 120 60 0 : slabdata 0 0 0
size-96 517 800 96 40 1 : tunables 120 60 0 : slabdata 20 20 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60 0 : slabdata 0 0 0
size-32(DMA) 0 0 32 113 1 : tunables 120 60 0 : slabdata 0 0 0
size-64 3140 3776 64 59 1 : tunables 120 60 0 : slabdata 64 64 0
size-32 2339 2825 32 113 1 : tunables 120 60 0 : slabdata 25 25 0
kmem_cache 143 160 96 40 1 : tunables 120 60 0 : slabdata 4 4 0

2007-09-17 22:36:17

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
> Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
> Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3) 36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
> Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
> Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc failure!
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still trying
>
>
> >From then on, *every* fh_verify() request fails the same way, and
> obviously if you can't verify any fds you can't do much with NFS.
>
> Looking back in the log I see intermittent malloc failures starting
> almost as soon as I've booted (allowing a couple of minutes for me to
> turn debugging on):
>
> Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
> [...]
> Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
> Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3) 28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
> Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
> Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc failure!
>
> A while later we start seeing runs of malloc failures, which I think
> correlated with the unexplained pauses in NFS response:

Actually, they're nothing to do with malloc failures--the message
printed here is misleading, and isn't even an error; it gets printed
whenever an upcall to mountd is made. The problem is almost certainly a
problem with kernel<->mountd communication--the kernel depends on mountd
to answer questions about exported filesystems as part of the fh_verify
code.

It's just a shot in the dark, but you might try the latest nfs-utils
(get the latest out of git://linux-nfs.org/nfs-utils if you're already
on the most recent your distro will give you). Or just apply the
following--which did fix a problem whose symptoms varied depending on
libc behavior.

If that doesn't work, I'd try

strace -s0 `pidof rpc.mountd`

and also look at the contents of /proc/net/rpc/nfsd.fh/contents.

--b.

commit dd087896285da9e160e13ee9f7d75381b67895e3
Author: J. Bruce Fields <[email protected]>
Date: Thu Jul 26 16:30:46 2007 -0400

Use __fpurge to ensure single-line writes to cache files

On a recent Debian/Sid machine, I saw libc retrying stdio writes that
returned write errors. The result is that if an export downcall returns
an error (which it can in normal operation, since it currently
(incorrectly) returns -ENOENT on any negative downcall), then subsequent
downcalls will write multiple lines (including the original line that
received the error).

The result is that the server fails to respond to any rpc call that
refers to an unexported mount point (such as a readdir of a directory
containing such a mountpoint), so client commands hang.

I don't know whether this libc behavior is correct or expected, but it
seems safest to add the __fpurge() (suggested by Neil) to ensure data is
thrown away.

Signed-off-by: "J. Bruce Fields" <[email protected]>
Signed-off-by: Neil Brown <[email protected]>

diff --git a/support/nfs/cacheio.c b/support/nfs/cacheio.c
index a76915b..9d271cd 100644
--- a/support/nfs/cacheio.c
+++ b/support/nfs/cacheio.c
@@ -17,6 +17,7 @@

#include <nfslib.h>
#include <stdio.h>
+#include <stdio_ext.h>
#include <ctype.h>
#include <unistd.h>
#include <sys/types.h>
@@ -111,7 +112,18 @@ void qword_printint(FILE *f, int num)

int qword_eol(FILE *f)
{
+ int err;
+
fprintf(f,"\n");
+ err = fflush(f);
+ /*
+ * We must send one line (and one line only) in a single write
+ * call. In case of a write error, libc may accumulate the
+ * unwritten data and try to write it again later, resulting in a
+ * multi-line write. So we must explicitly ask it to throw away
+ * any such cached data:
+ */
+ __fpurge(f);
return fflush(f);
}

2007-09-17 23:54:25

by Nix

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On 17 Sep 2007, J. Bruce Fields stated:
> On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
>> A while later we start seeing runs of malloc failures, which I think
>> correlated with the unexplained pauses in NFS response:
>
> Actually, they're nothing to do with malloc failures--the message
> printed here is misleading, and isn't even an error; it gets printed
> whenever an upcall to mountd is made.

Indeed, with more debugging, all the failures I see come from the call
to exp_find(), which is digging out exports...

> The problem is almost certainly a
> problem with kernel<->mountd communication--the kernel depends on mountd
> to answer questions about exported filesystems as part of the fh_verify
> code.

Ah! I keep forgetting that mountd isn't just used at mount time: damned
misleading names, grumble.

Restarting mountd clears the problem up temporarily, so you are
definitely right.

> commit dd087896285da9e160e13ee9f7d75381b67895e3
> Author: J. Bruce Fields <[email protected]>
> Date: Thu Jul 26 16:30:46 2007 -0400

Aha! I'm on 3b55934b9baefecee17aefc3ea139e261a4b03b8, over a month older.

> On a recent Debian/Sid machine, I saw libc retrying stdio writes that
> returned write errors.

Debian Sid recently upgraded to glibc 2.6.x, as did I... earlier
versions of glibc will have had this behaviour too, but it may have been
less frequent.

> I don't know whether this libc behavior is correct or expected, but it
> seems safest to add the __fpurge() (suggested by Neil) to ensure data is
> thrown away.

It is expected, judging from my reading of the
code. stdio-common/vfprintf.c emits single chars using the outchar()
macro, and strings using the outstring() macro, using functions in libio
to do the job. The string output routine then calls _IO_file_xsputn(),
which, tracing through libio's jump tables and symbol aliases, ends up
calling _IO_new_file_xsputn() in libio/fileops.c. (I've only just
started to understand libio. It's basically undocumented as far as I can
tell, but it's deeply nifty. Think of stdio, only made entirely out of
hookable components. :) )

(Actual writing then thunks down through _IO_new_do_write() and
new_do_write() in the same file, which finally calls __write(). If
there's any kind of error this returns EOF after some opaque messing
about with a _cur_column value, which is as far as I can tell never
used!)

The code which calls new_do_write() looks like this:

,----[ libio/fileops.c:_IO_new_file_xsputn() ]
| if (do_write)
| {
| count = new_do_write (f, s, do_write);
| to_do -= count;
| if (count < do_write)
| return n - to_do;
| }
`----

This code handles partial writes followed by errors by returning a
suitable nonzero value, and immediate errors by returning -1.

In either case the buffer will have been filled as much as possible by
that point, and will still be filled when (vf)printf() is next called.

This behaviour is, IIRC, mandated by the C Standard: I can find no
reference in the Standard to streams being flushed on error, only
on fclose(), fflush(), or program termination.

I'm upgrading now: thank you!

--
`Some people don't think performance issues are "real bugs", and I think
such people shouldn't be allowed to program.' --- Linus Torvalds

2007-09-18 01:13:21

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On Tue, Sep 18, 2007 at 12:54:07AM +0100, Nix wrote:
> The code which calls new_do_write() looks like this:
>
> ,----[ libio/fileops.c:_IO_new_file_xsputn() ]
> | if (do_write)
> | {
> | count = new_do_write (f, s, do_write);
> | to_do -= count;
> | if (count < do_write)
> | return n - to_do;
> | }
> `----
>
> This code handles partial writes followed by errors by returning a
> suitable nonzero value, and immediate errors by returning -1.
>
> In either case the buffer will have been filled as much as possible by
> that point, and will still be filled when (vf)printf() is next called.

OK, I'm a little lost at this point (what's n? What's to_do?), but I'll
take your word for it.

I'd be kinda curious when exactly the behavior changed and why.

Also I suppose we should check which version of nfs-utils that fix is in
and make sure distributions are getting the fixed nfs-utils before they
get the new libc, or we're going to see this bug a lot....

> This behaviour is, IIRC, mandated by the C Standard: I can find no
> reference in the Standard to streams being flushed on error, only
> on fclose(), fflush(), or program termination.

OK!

Let me know if the problem's fixed.

--b.

2007-09-18 06:19:05

by Nix

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On 18 Sep 2007, J. Bruce Fields stated:

> On Tue, Sep 18, 2007 at 12:54:07AM +0100, Nix wrote:
>> The code which calls new_do_write() looks like this:
>>
>> ,----[ libio/fileops.c:_IO_new_file_xsputn() ]
>> | if (do_write)
>> | {
>> | count = new_do_write (f, s, do_write);
>> | to_do -= count;
>> | if (count < do_write)
>> | return n - to_do;
>> | }
>> `----
>>
>> This code handles partial writes followed by errors by returning a
>> suitable nonzero value, and immediate errors by returning -1.
>>
>> In either case the buffer will have been filled as much as possible by
>> that point, and will still be filled when (vf)printf() is next called.
>
> OK, I'm a little lost at this point (what's n? What's to_do?), but I'll
> take your word for it.

n is the total amount to write: to_do is the amount still unwritten. The
rest is buffered but written.

> I'd be kinda curious when exactly the behavior changed and why.

Same here. I'm sort of surprised that it *did* change. The last change
to anything in that function was in 2005, and the fragment shown is
ten years old.

I suspect some other change made it easier to see this pre-existing
behaviour. I wonder if something in glibc used to call __fpurge() for
you?

> Also I suppose we should check which version of nfs-utils that fix is in
> and make sure distributions are getting the fixed nfs-utils before they
> get the new libc, or we're going to see this bug a lot....

And since it looks like a kernel bug idiots like me are going to keep on
bugging the l-k list with a non-kernel bug.

> Let me know if the problem's fixed.

Well, the machine's still running after more than six hours, where before
it would freeze solid in half an hour or less.

So I'd say it's fixed :)

--
`Some people don't think performance issues are "real bugs", and I think
such people shouldn't be allowed to program.' --- Linus Torvalds

2007-09-21 18:46:47

by Nix

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On 18 Sep 2007, J. Bruce Fields told this:
> Also I suppose we should check which version of nfs-utils that fix is in
> and make sure distributions are getting the fixed nfs-utils before they
> get the new libc, or we're going to see this bug a lot....

Further info. This behaviour, although it is allowed by POSIX, is not
*specified*, as such, by it (ah! ambiguity! life would be so dull
without you!), and is not replicated by any other C library on the face
of the planet.

Further, it's not even the behaviour of glibc, as such.

It's the behaviour of glibc 2.6.x as patched with this Debian-specific
patch:

,----[ any/submitted-fileops-and-signals.diff ]
| Index: glibc-2.6/libio/fileops.c
| ===================================================================
| --- glibc-2.6.orig/libio/fileops.c
| +++ glibc-2.6/libio/fileops.c
| @@ -517,10 +517,19 @@
| fp->_cur_column = INTUSE(_IO_adjust_column) (fp->_cur_column - 1, data,
| count) + 1;
| _IO_setg (fp, fp->_IO_buf_base, fp->_IO_buf_base, fp->_IO_buf_base);
| - fp->_IO_write_base = fp->_IO_write_ptr = fp->_IO_buf_base;
| - fp->_IO_write_end = (fp->_mode <= 0
| + if (__builtin_expect (count == to_do, 1))
| + {
| + fp->_IO_write_base = fp->_IO_write_ptr = fp->_IO_buf_base;
| + fp->_IO_write_end = (fp->_mode <= 0
| && (fp->_flags & (_IO_LINE_BUF+_IO_UNBUFFERED))
| ? fp->_IO_buf_base : fp->_IO_buf_end);
| + }
| + else if (count > 0 && fp->_IO_write_base == data)
| + {
| + memmove (data, data+count,
| + fp->_IO_write_ptr - fp->_IO_write_base - count);
| + fp->_IO_write_ptr -= count;
| + }
| return count;
| }
|
`----

This patch also breaks at least bash and zsh as well as nfs-utils (see
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=429021>) and I sort of
suspect it's not long for this world as a result.

So this is much less nasty than we feared (well, and more nasty if you
happen to be a Debian unstable or testing user).

--
`Some people don't think performance issues are "real bugs", and I think
such people shouldn't be allowed to program.' --- Linus Torvalds

2007-09-21 21:13:52

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

On Fri, Sep 21, 2007 at 07:46:32PM +0100, Nix wrote:
> On 18 Sep 2007, J. Bruce Fields told this:
> > Also I suppose we should check which version of nfs-utils that fix is in
> > and make sure distributions are getting the fixed nfs-utils before they
> > get the new libc, or we're going to see this bug a lot....
>
> Further info. This behaviour, although it is allowed by POSIX, is not
> *specified*, as such, by it (ah! ambiguity! life would be so dull
> without you!), and is not replicated by any other C library on the face
> of the planet.
>
> Further, it's not even the behaviour of glibc, as such.
>
> It's the behaviour of glibc 2.6.x as patched with this Debian-specific
> patch:

Good grief. Well, I'd still count it as a bug in nfs-utils, then, but
nfs-util's share of the blame is much less than we'd previously
thought....

> This patch also breaks at least bash and zsh as well as nfs-utils (see
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=429021>) and I sort of
> suspect it's not long for this world as a result.
>
> So this is much less nasty than we feared (well, and more nasty if you
> happen to be a Debian unstable or testing user).

OK! Sounds like good news.

--b.