2005-01-20 03:24:10

by Andreas Gruenbacher

[permalink] [raw]
Subject: [ea-in-inode 0/5] Further fixes

Hello,

here is a set of fixes for ext3 in-inode attributes:

patches/ea-xattr-nolock.diff
No lock needed when freeing inode

The effect of the additional lock taking is very minor, but
it's still unnecessary.

patches/ea-xattr-update-sb.diff
Set the EXT3_FEATURE_COMPAT_EXT_ATTR for in-inode xattrs

The EXT_ATTR filesystem feature was set for xattr blocks, but not
for in-inode attributes.

patches/ea-xattr-doc.diff
Documentation fix

patches/ea-xattr-no-extra_isize.diff
Fix i_extra_isize check

Filesystem corruption fix.

patches/ea-xattr-reserved-inodes.diff
Disallow in-inode attributes for reserved inodes

Filesystem corruption fix.


Regards,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH


2005-01-21 22:59:27

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

Hi Andreas,

On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:

> here is a set of fixes for ext3 in-inode attributes:

Obvious first question --- have these diffs survived the same
torture-by-tridgell that the previous batch suffered?

Cheers,
Stephen

2005-01-21 23:51:43

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

On Fri, 2005-01-21 at 23:58, Stephen C. Tweedie wrote:
> Hi Andreas,
>
> On Thu, 2005-01-20 at 02:01, Andreas Gruenbacher wrote:
>
> > here is a set of fixes for ext3 in-inode attributes:
>
> Obvious first question --- have these diffs survived the same
> torture-by-tridgell that the previous batch suffered?

No. The fixes are a lot less intrusive than the full xattr rework
though. I obviously ran tests; this included dbench.

Tridge, can you beat the code some more?

Andrew has the five fixes in 2.6.11-rc1-mm2.

-- Andreas.

2005-01-23 13:25:31

by Andrew Tridgell

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

Andreas,

> Tridge, can you beat the code some more?
> Andrew has the five fixes in 2.6.11-rc1-mm2.

sorry for the delay. I've started to test 2.6.11-rc1-mm2 tonight. No
problems so far.

Cheers, Tridge

2005-01-23 22:12:40

by Andrew Tridgell

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

Andreas,

> Tridge, can you beat the code some more?
>
> Andrew has the five fixes in 2.6.11-rc1-mm2.

It seemed to pass dbench runs OK, but then I started simultaneously
running dbench and nbench on two different disks (I have a new test
machine with more disks available). I am getting failures like this:

Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdc1
Jan 23 06:54:38 dev4-003 kernel: Aborting journal on device sdc1.


Jan 23 13:19:43 dev4-003 kernel: journal_bmap: journal block not found at offset 1036 on sdd1
Jan 23 13:19:43 dev4-003 kernel: Aborting journal on device sdd1.
Jan 23 13:19:43 dev4-003 kernel: EXT3-fs error (device sdd1) in start_transaction: Readonly filesystem


The first failure was on the disk I am using for nbench (sdc). The
second is during a later run on the disk I am using the dbench
(sdd). I rebooted between the runs. It's interesting that its failing
at exactly the same offset both times. Is there anything magic about
offset 1036?

The new test machine is a 4 way PIII, with 4G ram, and 4 36G SCSI
disks. The test machine I have been using previously was a 2 way
(+hyperthreaded) Xeon.

I'll see if I can reproduce the problem with just dbench or just
nbench.

Cheers, Tridge

2005-01-23 22:58:26

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

Hello,

On Sunday 23 January 2005 23:09, Andrew Tridgell wrote:
> Andreas,
>
> > Tridge, can you beat the code some more?
> >
> > Andrew has the five fixes in 2.6.11-rc1-mm2.
>
> It seemed to pass dbench runs OK, but then I started simultaneously
> running dbench and nbench on two different disks (I have a new test
> machine with more disks available). I am getting failures like this:
>
> Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> device sdc1.

Are you using data journaling on that filesystem? Does this test pass with the
patches backed out? With an external journal?

Thanks,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

2005-01-23 23:42:57

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

On Sunday 23 January 2005 23:58, Andreas Gruenbacher wrote:
> > Jan 23 06:54:38 dev4-003 kernel: journal_bmap: journal block not found at
> > offset 1036 on sdc1 Jan 23 06:54:38 dev4-003 kernel: Aborting journal on
> > device sdc1.
>
> Are you using data journaling on that filesystem? Does this test pass with
> the patches backed out? With an external journal?

There are 12 direct and 1024 indirect blocks on a filesystem with 4k
blocksize, so block 1036 should be the first double-indirect block. It may be
that something is messing up the double-indirect link or one of its fields.
Interesting. Could you maybe try this patch as well?

Index: linux-2.6.11-rc1-mm2/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-rc1-mm2.orig/fs/ext3/inode.c
+++ linux-2.6.11-rc1-mm2/fs/ext3/inode.c
@@ -2653,7 +2653,7 @@ static int ext3_do_update_inode(handle_t
} else for (block = 0; block < EXT3_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];

- if (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE)
+ if (ei->i_extra_isize)
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);

BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");

Thanks,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

2005-01-24 11:28:42

by Andrew Tridgell

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

Andreas,

I'm starting to think the bug I saw is hardware error. I just got this
while trying to reproduce it tonight:

Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3
channel 0 id 1 lun 0

I'll see if I can get this confirmed tomorrow.

Cheers, Tridge

2005-01-24 11:42:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

On Mon, Jan 24, 2005 at 10:24:55PM +1100, Andrew Tridgell wrote:
> Andreas,
>
> I'm starting to think the bug I saw is hardware error. I just got this
> while trying to reproduce it tonight:
>
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : abort failed
> Jan 24 02:43:32 dev4-003 kernel: qlogicfc0 : firmware status is 4000 4
> Jan 24 02:43:32 dev4-003 kernel: scsi: Device offlined - not ready after error recovery: host 3
> channel 0 id 1 lun 0
>
> I'll see if I can get this confirmed tomorrow.

Don't use the qlogicfc driver ever but qla2xxx instead. It's only still
in the kernel tree because davem's a lazy bastard :)

2005-01-24 14:11:24

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [ea-in-inode 0/5] Further fixes

On Mon, 2005-01-24 at 12:24, Andrew Tridgell wrote:
> Andreas,
>
> I'm starting to think the bug I saw is hardware error. [...]

The patch in my prevous message (Mon, 24 Jan 2005 00:32:16 +0100) still
makes sense, if only for cleanliness: Without it, the i_extra_isize
field of reserved inodes is zeroed even though we're not allowing
in-inode attributes there. It's probably better to leave the garbage in
these inodes untouched; we could in theory disable in-inode attributes
for arbitrary inodes with this change.

Cheers,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX GMBH

2005-01-25 03:22:41

by Andrew Tridgell

[permalink] [raw]
Subject: memory leak in 2.6.11-rc2

I've fixed up the problems I had with raid, and am now testing the
recent xattr changes with dbench and nbench.

The problem I've hit now is a severe memory leak. I have applied the
patch from Linus for the leak in free_pipe_info(), and still I'm
leaking memory at the rate of about 100Mbyte/minute.

I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
the pipe leak fix. The setup is:

- 4 way PIII with 4G ram
- qla2200 adapter with ibm fastt200 disk array
- running dbench -x and nbench on separate disks, in a loop

The oom killer kicks in after about 30 minutes. Naturally the oom
killer decided to kill my sshd, which was running vmstat :-)

Killing the dbench and nbench processes does not recovery the memory.
Here is what it looks like after I kill all processes except a sshd
after about 15 minutes of running:

total used free shared buffers cached
Mem: 3734304 2471608 1262696 0 26820 921440
-/+ buffers/cache: 1523348 2210956
Swap: 4194272 0 4194272

so we've lost nearly 1.5G in that time.

here is /proc/meminfo:

dev4-003:~# cat /proc/meminfo
MemTotal: 3734304 kB
MemFree: 1266944 kB
Buffers: 27000 kB
Cached: 921464 kB
SwapCached: 0 kB
Active: 292524 kB
Inactive: 2108928 kB
HighTotal: 2850752 kB
HighFree: 472768 kB
LowTotal: 883552 kB
LowFree: 794176 kB
SwapTotal: 4194272 kB
SwapFree: 4194272 kB
Dirty: 4 kB
Writeback: 0 kB
Mapped: 6872 kB
Slab: 54236 kB
CommitLimit: 6061424 kB
Committed_AS: 41660 kB
PageTables: 344 kB
VmallocTotal: 114680 kB
VmallocUsed: 2204 kB
VmallocChunk: 112472 kB

and here is /proc/slabinfo

slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata
<active_slabs> <num_slabs> <sharedavail>
ip_fib_alias 9 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0
ip_fib_hash 9 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0
fib6_nodes 7 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0
ip6_dst_cache 7 36 224 18 1 : tunables 120 60 8 : slabdata 2 2 0
ndisc_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0
rawv6_sock 6 6 672 6 1 : tunables 54 27 8 : slabdata 1 1 0
udpv6_sock 0 0 640 6 1 : tunables 54 27 8 : slabdata 0 0 0
tcpv6_sock 4 6 1184 6 2 : tunables 24 12 8 : slabdata 1 1 0
unix_sock 2 9 416 9 1 : tunables 54 27 8 : slabdata 1 1 0
ip_mrt_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
tcp_tw_bucket 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
tcp_bind_bucket 2 226 16 226 1 : tunables 120 60 8 : slabdata 1 1 0
tcp_open_request 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
inet_peer_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
secpath_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
xfrm_dst_cache 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
ip_dst_cache 6 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0
arp_cache 1 25 160 25 1 : tunables 120 60 8 : slabdata 1 1 0
raw_sock 5 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
udp_sock 3 8 512 8 1 : tunables 54 27 8 : slabdata 1 1 0
tcp_sock 1 7 1056 7 2 : tunables 24 12 8 : slabdata 1 1 0
flow_cache 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
qla2xxx_srbs 256 310 128 31 1 : tunables 120 60 8 : slabdata 10 10 0
scsi_cmd_cache 22 22 352 11 1 : tunables 54 27 8 : slabdata 2 2 0
cfq_ioc_pool 0 0 24 156 1 : tunables 120 60 8 : slabdata 0 0 0
cfq_pool 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0
crq_pool 0 0 56 70 1 : tunables 120 60 8 : slabdata 0 0 0
deadline_drq 0 0 52 75 1 : tunables 120 60 8 : slabdata 0 0 0
as_arq 61 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
mqueue_inode_cache 1 7 512 7 1 : tunables 54 27 8 : slabdata 1 1 0
devfsd_event 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0
ext2_inode_cache 0 0 456 8 1 : tunables 54 27 8 : slabdata 0 0 0
ext2_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0
journal_handle 2 185 20 185 1 : tunables 120 60 8 : slabdata 1 1 0
journal_head 3316 155844 48 81 1 : tunables 120 60 8 : slabdata 1924 1924 0
revoke_table 14 290 12 290 1 : tunables 120 60 8 : slabdata 1 1 0
revoke_record 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0
ext3_inode_cache 3626 3745 520 7 1 : tunables 54 27 8 : slabdata 535 535 0
ext3_xattr 0 0 48 81 1 : tunables 120 60 8 : slabdata 0 0 0
dnotify_cache 0 0 20 185 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_pwq 0 0 36 107 1 : tunables 120 60 8 : slabdata 0 0 0
eventpoll_epi 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0
kioctx 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0
kiocb 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
fasync_cache 0 0 16 226 1 : tunables 120 60 8 : slabdata 0 0 0
shmem_inode_cache 8 18 416 9 1 : tunables 54 27 8 : slabdata 2 2 0
posix_timers_cache 0 0 104 38 1 : tunables 120 60 8 : slabdata 0 0 0
uid_cache 1 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0
sgpool-128 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0
sgpool-64 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0
sgpool-32 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0
sgpool-16 32 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0
sgpool-8 50 62 128 31 1 : tunables 120 60 8 : slabdata 2 2 0
blkdev_ioc 40 270 28 135 1 : tunables 120 60 8 : slabdata 2 2 0
blkdev_queue 37 50 380 10 1 : tunables 54 27 8 : slabdata 5 5 0
blkdev_requests 70 81 148 27 1 : tunables 120 60 8 : slabdata 3 3 0
biovec-(256) 256 256 3072 2 2 : tunables 24 12 8 : slabdata 128 128 0
biovec-128 256 260 1536 5 2 : tunables 24 12 8 : slabdata 52 52 0
biovec-64 256 260 768 5 1 : tunables 54 27 8 : slabdata 52 52 0
biovec-16 260 260 192 20 1 : tunables 120 60 8 : slabdata 13 13 0
biovec-4 256 305 64 61 1 : tunables 120 60 8 : slabdata 5 5 0
biovec-1 322 452 16 226 1 : tunables 120 60 8 : slabdata 2 2 0
bio 303 328 96 41 1 : tunables 120 60 8 : slabdata 8 8 0
file_lock_cache 0 0 92 43 1 : tunables 120 60 8 : slabdata 0 0 0
sock_inode_cache 26 60 384 10 1 : tunables 54 27 8 : slabdata 6 6 0
skbuff_head_cache 137 475 160 25 1 : tunables 120 60 8 : slabdata 19 19 0
sock 4 11 352 11 1 : tunables 54 27 8 : slabdata 1 1 0
key_jar 90 123 96 41 1 : tunables 120 60 8 : slabdata 3 3 0
proc_inode_cache 265 336 336 12 1 : tunables 54 27 8 : slabdata 28 28 0
sigqueue 1 27 148 27 1 : tunables 120 60 8 : slabdata 1 1 0
radix_tree_node 8235 8750 276 14 1 : tunables 54 27 8 : slabdata 625 625 0
bdev_cache 14 18 448 9 1 : tunables 54 27 8 : slabdata 2 2 0
sysfs_dir_cache 2837 2996 36 107 1 : tunables 120 60 8 : slabdata 28 28 0
mnt_cache 21 41 96 41 1 : tunables 120 60 8 : slabdata 1 1 0
inode_cache 946 948 320 12 1 : tunables 54 27 8 : slabdata 79 79 0
dentry_cache 10903 11032 140 28 1 : tunables 120 60 8 : slabdata 394 394 0
filp 300 750 160 25 1 : tunables 120 60 8 : slabdata 30 30 0
names_cache 3 3 4096 1 1 : tunables 24 12 8 : slabdata 3 3 0
idr_layer_cache 80 87 136 29 1 : tunables 120 60 8 : slabdata 3 3 0
buffer_head 593099 649500 52 75 1 : tunables 120 60 8 : slabdata 8660 8660 0
mm_struct 54 54 608 6 1 : tunables 54 27 8 : slabdata 9 9 0
vm_area_struct 543 1260 88 45 1 : tunables 120 60 8 : slabdata 28 28 0
fs_cache 82 122 64 61 1 : tunables 120 60 8 : slabdata 2 2 0
files_cache 50 63 416 9 1 : tunables 54 27 8 : slabdata 7 7 0
signal_cache 68 150 256 15 1 : tunables 120 60 8 : slabdata 10 10 0
sighand_cache 76 78 1312 3 1 : tunables 24 12 8 : slabdata 26 26 0
task_struct 78 81 1312 3 1 : tunables 24 12 8 : slabdata 27 27 0
anon_vma 273 870 12 290 1 : tunables 120 60 8 : slabdata 3 3 0
pgd 24 24 4096 1 1 : tunables 24 12 8 : slabdata 24 24 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-65536 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-32768 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0
size-16384 5 5 16384 1 4 : tunables 8 4 0 : slabdata 5 5 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0
size-8192 75 75 8192 1 2 : tunables 8 4 0 : slabdata 75 75 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0
size-4096 29 29 4096 1 1 : tunables 24 12 8 : slabdata 29 29 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0
size-2048 248 258 2048 2 1 : tunables 24 12 8 : slabdata 129 129 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0
size-1024 132 132 1024 4 1 : tunables 54 27 8 : slabdata 33 33 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0
size-512 403 432 512 8 1 : tunables 54 27 8 : slabdata 54 54 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0
size-256 255 255 256 15 1 : tunables 120 60 8 : slabdata 17 17 0
size-128(DMA) 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0
size-128 2403 2542 128 31 1 : tunables 120 60 8 : slabdata 82 82 0
size-64(DMA) 0 0 64 61 1 : tunables 120 60 8 : slabdata 0 0 0
size-64 670 793 64 61 1 : tunables 120 60 8 : slabdata 13 13 0
size-32(DMA) 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0
size-32 1415 3213 32 119 1 : tunables 120 60 8 : slabdata 27 27 0
kmem_cache 175 175 160 25 1 : tunables 120 60 8 : slabdata 7 7 0

2005-01-25 03:28:56

by Randy.Dunlap

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

Andrew Tridgell wrote:
> I've fixed up the problems I had with raid, and am now testing the
> recent xattr changes with dbench and nbench.
>
> The problem I've hit now is a severe memory leak. I have applied the
> patch from Linus for the leak in free_pipe_info(), and still I'm
> leaking memory at the rate of about 100Mbyte/minute.
>
> I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> the pipe leak fix. The setup is:
>
> - 4 way PIII with 4G ram
> - qla2200 adapter with ibm fastt200 disk array
> - running dbench -x and nbench on separate disks, in a loop
>
> The oom killer kicks in after about 30 minutes. Naturally the oom
> killer decided to kill my sshd, which was running vmstat :-)

Do you have today's memleak patch applied? (cut-n-paste below).

--
~Randy


----
--- 1.40/fs/pipe.c 2005-01-15 12:01:16 -08:00
+++ edited/fs/pipe.c 2005-01-24 14:35:09 -08:00
@@ -630,13 +630,13 @@
struct pipe_inode_info *info = inode->i_pipe;

inode->i_pipe = NULL;
- if (info->tmp_page)
- __free_page(info->tmp_page);
for (i = 0; i < PIPE_BUFFERS; i++) {
struct pipe_buffer *buf = info->bufs + i;
if (buf->ops)
buf->ops->release(info, buf);
}
+ if (info->tmp_page)
+ __free_page(info->tmp_page);
kfree(info);
}

-

2005-01-25 03:34:05

by Andrew Tridgell

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

Randy,

> I have applied the patch from Linus for the leak in
> free_pipe_info()
...
> Do you have today's memleak patch applied? (cut-n-paste below).

yes :-)

I'm trying the little leak tracking patch from Alexander Nyberg now.

Cheers, Tridge

2005-01-25 03:46:01

by Dave Jones

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
> The problem I've hit now is a severe memory leak. I have applied the
> patch from Linus for the leak in free_pipe_info(), and still I'm
> leaking memory at the rate of about 100Mbyte/minute.
> I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> the pipe leak fix. The setup is:

That's a little more extreme than what I'm seeing, so it may be
something else, but my firewall box needs rebooting every
few days. It leaks around 50MB a day for some reason.
Given it's not got a lot of ram, after 4-5 days or so, it's
completely exhausted its swap too.

It's currently on a 2.6.10-ac kernel, so it's entirely possible that
we're not looking at the same issue, though it could be something
thats been there for a while if your workload makes it appear
quicker than a firewall/ipsec gateway would.
Do you see the same leaks with an earlier kernel ?

post OOM (when there was about 2K free after named got oom-killed)
this is what slabinfo looked like..

dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0
vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0
size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0
size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0
avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0
size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0

Dave

2005-01-25 04:54:23

by Andrew Tridgell

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

> I'm trying the little leak tracking patch from Alexander Nyberg now.

Here are the results (only backtraces with more than 10k counts
included). The leak was at 1G of memory at the time I ran this, so its
safe to say 10k page allocations ain't enough to explain it :-)

I also attach a hacked version of the pgown sort program that sorts
the output by count, and isn't O(n^2). It took 10 minutes to run the
old version :-)

I'm guessing the leak is in the new xattr code given that is what
dbench and nbench were beating on. Andreas, can you look at the
following and see if you can spot anything?

This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
machine had leaked 1G of ram in 10 minutes, and was idle (only thing
running was sshd).

Cheers, Tridge


175485 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0

141512 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc015f6de] fcntl_setlk64+286
[0xc01295a8] autoremove_wake_function+0
[0xc014b8d5] vfs_write+157

67641 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169

52758 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc014dc69] __getblk+29
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc018d368] ext3_xattr_user_get+108
[0xc0132e70] generic_file_aio_write+100
[0xc017e586] ext3_file_write+38
[0xc014b7f5] do_sync_write+169
[0xc0120c0b] kill_proc_info+47

19610 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc011b7cb] __do_softirq+95
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0

16874 times:
Page allocated via order 0
[0xc0132258] generic_file_buffered_write+280
[0xc011b6a9] current_fs_time+77
[0xc0132a1e] __generic_file_aio_write_nolock+642
[0xc0132c3a] generic_file_aio_write_nolock+54
[0xc0132def] generic_file_write_nolock+151
[0xc01295a8] autoremove_wake_function+0
[0xc01295a8] autoremove_wake_function+0
[0xc0152d4a] blkdev_file_write+38


Attachments:
pgown-fast.c (2.37 kB)

2005-01-25 06:06:57

by Andrew Morton

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

Andrew Tridgell <[email protected]> wrote:
>
> > I'm trying the little leak tracking patch from Alexander Nyberg now.
>
> Here are the results (only backtraces with more than 10k counts
> included). The leak was at 1G of memory at the time I ran this, so its
> safe to say 10k page allocations ain't enough to explain it :-)
>
> I also attach a hacked version of the pgown sort program that sorts
> the output by count, and isn't O(n^2). It took 10 minutes to run the
> old version :-)
>
> I'm guessing the leak is in the new xattr code given that is what
> dbench and nbench were beating on. Andreas, can you look at the
> following and see if you can spot anything?
>
> This was on 2.6.11-rc2 with the pipe leak patch from Linus. The
> machine had leaked 1G of ram in 10 minutes, and was idle (only thing
> running was sshd).
>
> 175485 times:
> Page allocated via order 0
> [0xc0132258] generic_file_buffered_write+280
> [0xc011b6a9] current_fs_time+77
> [0xc0132a1e] __generic_file_aio_write_nolock+642
> [0xc0132e70] generic_file_aio_write+100
> [0xc017e586] ext3_file_write+38
> [0xc014b7f5] do_sync_write+169
> [0xc015f6de] fcntl_setlk64+286
> [0xc01295a8] autoremove_wake_function+0

It would be pretty strange for plain old pagecache pages to leak in this
manner. A few things come to mind.

- The above trace is indistinguishable from the normal situation of
having a lot of pagecache floating about. IOW: we don't know if the
above pages have really leaked or not.

- It's sometimes possible for ext3 pages to remain alive (on the page
LRU) after a truncate, but with no other references to them. These pages
are trivially reclaimable. So even if you've deleted the files which the
benchmark created, there _could_ be pages left over. Although it would
be unusual.

So what you should do before generating the leak tool output is to put
heavy memory pressure on the machine to try to get it to free up as much of
that pagecache as possible. bzero(malloc(lots)) will do it - create a real
swapstorm, then do swapoff to kill remaining swapcache as well.

If we still see a lot of pages with the above trace then something really
broke. Where does one get the appropriate dbench version? How are you
mkfsing the filesystem? Was mke2fs patched? How is the benchmark being
invoked?

Thanks.

2005-01-25 11:43:07

by Andrew Tridgell

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

Andrew,

> So what you should do before generating the leak tool output is to put
> heavy memory pressure on the machine to try to get it to free up as much of
> that pagecache as possible. bzero(malloc(lots)) will do it - create a real
> swapstorm, then do swapoff to kill remaining swapcache as well.

As you saw when you logged into the machine earlier tonight, when you
suspend the dbench processes and run a memory filler the memory is
reclaimed.

I still think its a bug though, as the oom killer is being triggered
when it shouldn't be. I have 4G of ram in this machine, and I'm only
running a couple of hundred processes that should be using maybe 500M
in total, so for the oom killer to kick in might mean that the memory
isn't being reclaimed under normal memory pressure. Certainly a ps
shows no process using more than a few MB.

The oom killer report is below. This is with 2.6.11-rc2, with the pipe
leak fix, and the pgown monitoring patch. It was running one nbench of
size 50 and one dbench of size 40 at the time.

If this isn't a leak, then it would also be good to fix /usr/bin/free
so the -/+ buffers line becomes meaningful again. With the machine
completely idle (just a sshd running) I see:

Mem: 3701184 3483188 217996 0 130092 1889440
-/+ buffers/cache: 1463656 2237528

which looks like a leak. This persists even after the disks are
unmounted. After filling/freeing memory I see:

Mem: 3701184 28176 3673008 0 520 3764
-/+ buffers/cache: 23892 3677292

so it can recover it, but under normal usage it doesn't before the oom
killer kicks in.

Cheers, Tridge


Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18910 (smbd).
Jan 25 00:49:14 dev4-003 kernel: oom-killer: gfp_mask=0xd0
Jan 25 00:49:14 dev4-003 kernel: DMA per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 2, high 6, batch 1
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 2, batch 1
Jan 25 00:49:14 dev4-003 kernel: Normal per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: HighMem per-cpu:
Jan 25 00:49:14 dev4-003 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 25 00:49:14 dev4-003 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 25 00:49:14 dev4-003 kernel:
Jan 25 00:49:14 dev4-003 kernel: Free pages: 5884kB (640kB HighMem)
Jan 25 00:49:14 dev4-003 kernel: Active:84764 inactive:806173 dirty:338229 writeback:36 unstable:0 free:1471 slab:23814 mapped:31610 pagetables:1616
Jan 25 00:49:14 dev4-003 kernel: DMA free:284kB min:68kB low:84kB high:100kB active:440kB inactive:10048kB present:16384kB pages_scanned:140 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: Normal free:4960kB min:3756kB low:4692kB high:5632kB active:72072kB inactive:632096kB present:901120kB pages_scanned:58999 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: HighMem free:640kB min:512kB low:640kB high:768kB active:266948kB inactive:2581960kB present:2850752kB pages_scanned:0 all_unreclaimable? no
Jan 25 00:49:14 dev4-003 kernel: protections[]: 0 0 0
Jan 25 00:49:14 dev4-003 kernel: DMA: 53*4kB 3*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 284kB
Jan 25 00:49:14 dev4-003 kernel: Normal: 400*4kB 16*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 2*512kB 0*1024kB 1*2048kB 0*4096kB = 4960kB
Jan 25 00:49:14 dev4-003 kernel: HighMem: 6*4kB 11*8kB 1*16kB 0*32kB 0*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB
Jan 25 00:49:14 dev4-003 kernel: Swap cache: add 1584899, delete 1585025, find 294/351, race 0+0
Jan 25 00:49:14 dev4-003 kernel: Out of Memory: Killed process 18914 (smbd).

2005-01-25 12:11:49

by Nick Piggin

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2




---

linux-2.6-npiggin/mm/vmscan.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~oom-helper mm/vmscan.c
--- linux-2.6/mm/vmscan.c~oom-helper 2005-01-25 23:04:28.000000000 +1100
+++ linux-2.6-npiggin/mm/vmscan.c 2005-01-25 23:05:06.000000000 +1100
@@ -914,12 +914,12 @@ int try_to_free_pages(struct zone **zone
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
- if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) {
+ total_scanned += sc.nr_scanned;
+ total_reclaimed += sc.nr_reclaimed;
+ if (total_reclaimed >= SWAP_CLUSTER_MAX) {
ret = 1;
goto out;
}
- total_scanned += sc.nr_scanned;
- total_reclaimed += sc.nr_reclaimed;

/*
* Try to write back as many pages as we just scanned. This

_


Attachments:
less-oom.patch (800.00 B)

2005-01-25 12:52:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

On Mon, Jan 24, 2005 at 10:45:47PM -0500, Dave Jones wrote:
> On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
> > The problem I've hit now is a severe memory leak. I have applied the
> > patch from Linus for the leak in free_pipe_info(), and still I'm
> > leaking memory at the rate of about 100Mbyte/minute.
> > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
> > the pipe leak fix. The setup is:
>
> That's a little more extreme than what I'm seeing, so it may be
> something else, but my firewall box needs rebooting every
> few days. It leaks around 50MB a day for some reason.
> Given it's not got a lot of ram, after 4-5 days or so, it's
> completely exhausted its swap too.
>
> It's currently on a 2.6.10-ac kernel, so it's entirely possible that
> we're not looking at the same issue, though it could be something
> thats been there for a while if your workload makes it appear
> quicker than a firewall/ipsec gateway would.
> Do you see the same leaks with an earlier kernel ?
>
> post OOM (when there was about 2K free after named got oom-killed)
> this is what slabinfo looked like..
>
> dentry_cache 1502 3775 160 25 1 : tunables 120 60 0 : slabdata 151 151 0
> vm_area_struct 1599 2021 84 47 1 : tunables 120 60 0 : slabdata 43 43 0
> size-128 3431 6262 128 31 1 : tunables 120 60 0 : slabdata 202 202 0
> size-64 4352 4575 64 61 1 : tunables 120 60 0 : slabdata 75 75 0
> avtab_node 7073 7140 32 119 1 : tunables 120 60 0 : slabdata 60 60 0
> size-32 7256 7616 32 119 1 : tunables 120 60 0 : slabdata 64 64 0

What is avtab_node? there's no such thing in my kernel. But the above
can be ok. Can you show meminfo too after oom kill?

Just another datapoint my firewall runs a kernel based on 2.6.11-rc1-bk8 with
all the needed oom fixes and I've no problems on it yet. I run it oom
and this is what I get after the oom:

athlon:/home/andrea # free
total used free shared buffers cached
Mem: 511136 50852 460284 0 572 15764
-/+ buffers/cache: 34516 476620
Swap: 1052248 0 1052248
athlon:/home/andrea #

The above is sane, 34M is very reasonable for what's loaded there
(there's the X server running, named too, and various other non standard
daemons, one even has a virtual size of >100m so it's not a tiny thing),
so I'm quite sure I'm not hitting a memleak, at least not on the
firewal. No ipsec on it btw, and it's a pure IDE without anything
special, just quite a few nics and USB usermode running all the time.

athlon:/home/andrea # uptime
1:34pm up 2 days 12:08, 1 user, load average: 0.98, 1.13, 0.54
athlon:/home/andrea # iptables -L -v |grep -A2 FORWARD
Chain FORWARD (policy ACCEPT 65 packets, 9264 bytes)
pkts bytes target prot opt in out source destination
3690K 2321M block all -- any any anywhere anywhere
athlon:/home/andrea #

So if there's a memleak in rc1-bk8, it's probably not in the core of the
kernel, but in some driver or things like ipsec. Either that or it broke
after 2.6.11-rc1-bk8. The kernel I'm running is quite heavily patched
too, but I'm not aware of any memleak fix in the additional patches.

Anyway I'll try again in a few days to verify it goes back down again to
exactly 34M of anonymous/random and 15M of cache.

No apparent problem on my desktop system either, it's running the same
kernel with different config.

If somebody could fix the kernel CVS I could have a look at the
interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.

2005-01-25 13:31:18

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> If somebody could fix the kernel CVS I could have a look at the
> interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.

What's not okay?

--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX GMBH

2005-01-25 13:56:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: memory leak in 2.6.11-rc2

On Tue, Jan 25, 2005 at 02:31:03PM +0100, Andreas Gruenbacher wrote:
> On Tue, 2005-01-25 at 13:51, Andrea Arcangeli wrote:
> > If somebody could fix the kernel CVS I could have a look at the
> > interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
>
> What's not okay?

I already prepared a separated deatiled bugreport. I'm reproducing one
last time before sending it, after a "su - nobody" to be sure it's not a
local problem in my environment that might have changed, but I'm 99%
sure it'll reproduce just fine in a fresh account too.

Are you using cvsps at all?