2002-10-21 20:39:30

by Martin J. Bligh

[permalink] [raw]
Subject: ZONE_NORMAL exhaustion (dcache slab)

My big NUMA box went OOM over the weekend and started killing things
for no good reason (2.5.43-mm2). Probably running some background
updatedb for locate thing, not doing any real work.

meminfo:

MemTotal: 16077728 kB
MemFree: 14950708 kB
MemShared: 0 kB
Buffers: 492 kB
Cached: 384976 kB
SwapCached: 0 kB
Active: 372608 kB
Inactive: 13380 kB
HighTotal: 15335424 kB
HighFree: 14949000 kB
LowTotal: 742304 kB
LowFree: 1708 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 2248 kB
Slab: 724744 kB
Reserved: 570464 kB
Committed_AS: 1100 kB
PageTables: 140 kB
ReverseMaps: 1518

Big things out of slabinfo:

ext2_inode_cache 554556 554598 416 61622 61622 1 : 120 60
dentry_cache 2791320 2791320 160 116305 116305 1 : 248 124

By my reckoning, that's over 450Mb of dentry cache that's refusing to shrink
under pressure. ext2_inode_cache ain't exactly anorexic either. Hmmm .....
Any good ways to debug this?

M.




2002-10-21 21:07:31

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> My big NUMA box went OOM over the weekend and started killing things
> for no good reason (2.5.43-mm2). Probably running some background
> updatedb for locate thing, not doing any real work.
>
> meminfo:
>

Looks like a plain dentry leak to me. Very weird.

Did the machine recover and run normally?

Was it possible to force the dcache to shrink? (a cat /dev/hda1
would do that nicely)

Is it reproducible?

2002-10-21 21:15:18

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

>> My big NUMA box went OOM over the weekend and started killing things
>> for no good reason (2.5.43-mm2). Probably running some background
>> updatedb for locate thing, not doing any real work.
>>
>> meminfo:
>>
>
> Looks like a plain dentry leak to me. Very weird.
>
> Did the machine recover and run normally?

Nope, kept OOMing and killing everything .

> Was it possible to force the dcache to shrink? (a cat /dev/hda1
> would do that nicely)

Well, I didn't try that, but even looking at man pages got oom killed,
so I guess not ... were you looking at the cat /dev/hda1 to fill pagecache
or something? I have 16Gb of highmem (pretty much all ununsed) so
presumably that'd fill the highmem first (pagecache?)

> Is it reproducible?

Will try again. Presumably "find /" should do it? ;-)

M.

2002-10-21 21:27:42

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> >> My big NUMA box went OOM over the weekend and started killing things
> >> for no good reason (2.5.43-mm2). Probably running some background
> >> updatedb for locate thing, not doing any real work.
> >>
> >> meminfo:
> >>
> >
> > Looks like a plain dentry leak to me. Very weird.
> >
> > Did the machine recover and run normally?
>
> Nope, kept OOMing and killing everything .

Something broke.

> > Was it possible to force the dcache to shrink? (a cat /dev/hda1
> > would do that nicely)
>
> Well, I didn't try that, but even looking at man pages got oom killed,
> so I guess not ... were you looking at the cat /dev/hda1 to fill pagecache
> or something? I have 16Gb of highmem (pretty much all ununsed) so
> presumably that'd fill the highmem first (pagecache?)

Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
selectively put pressure on the normal zone (and DMA zone, of course).

> > Is it reproducible?
>
> Will try again. Presumably "find /" should do it? ;-)

You must have a lot of files.

Actually, I expect a `find /' will only stat directories,
whereas an `ls -lR /' will stat plain files as well. Same
thing for dcache, but the ls will push the icache harder.

I don't know if updatedb stats regular files. Presumably not.

2002-10-21 21:33:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

>> Nope, kept OOMing and killing everything .
>
> Something broke.

Even I worked that out ;-)

> Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
> selectively put pressure on the normal zone (and DMA zone, of course).

Ah, I recall that now. That's fundamentally screwed.

>> Will try again. Presumably "find /" should do it? ;-)
>
> You must have a lot of files.

Nothing too ridiculous. Will try find on a small subset repeatedly and see if
it keeps growing first - maybe that'll show a leak.

Thanks,

M.

2002-10-21 21:43:39

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> >> Nope, kept OOMing and killing everything .
> >
> > Something broke.
>
> Even I worked that out ;-)

Well I'm feeling especially helpful today.

> > Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
> > selectively put pressure on the normal zone (and DMA zone, of course).
>
> Ah, I recall that now. That's fundamentally screwed.

When filesystems want to access metadata, they will typically read
a block into a buffer_head and access the memory directly.

mnm:/usr/src/25> grep -rI b_data fs | wc -l
844

That's a lot of kmaps need adding.

So we constrain blockdev->bd_inode->i_mapping->gfp_mask so that
the blockdev's pagecache memory is always in the direct-addressed
region.

It would be possible to fix on a per-fs basis - teach a filesystem
to kmap bh->b_page appropriately and then set __GFP_HIGHMEM in the
blockdev's gfp_mask.

But it doesn't seem to cause a lot of trouble in practice.

2002-10-21 22:24:17

by Rik van Riel

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

On Mon, 21 Oct 2002, Martin J. Bligh wrote:

> > Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
> > selectively put pressure on the normal zone (and DMA zone, of course).
>
> Ah, I recall that now. That's fundamentally screwed.

It's not too bad since the data can be reclaimed easily.

The problem in your case is that the dentry and inode cache
didn't get reclaimed. Maybe there is a leak so they can't get
reclaimed at all or maybe they just don't get reclaimed fast
enough.

I'm looking into the "can't be reclaimed fast enough" problem
right now. First on 2.4-rmap, but if it works I'll forward-port
the thing to 2.5 soon (before Linus returns from holidays).

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-10-21 22:47:24

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

Rik van Riel wrote:
>
> On Mon, 21 Oct 2002, Martin J. Bligh wrote:
>
> > > Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
> > > selectively put pressure on the normal zone (and DMA zone, of course).
> >
> > Ah, I recall that now. That's fundamentally screwed.
>
> It's not too bad since the data can be reclaimed easily.
>
> The problem in your case is that the dentry and inode cache
> didn't get reclaimed. Maybe there is a leak so they can't get
> reclaimed at all or maybe they just don't get reclaimed fast
> enough.
>

He had 3 million dentries and only 100k pages on the LRU,
so we should have been reclaiming 60 dentries per scanned
page.

Conceivably the multiply in shrink_slab() overflowed, where
we calculate local variable `delta'. But doubtful.

First, we need to make it happen again, then see if this (quick
hack) fixes it up.


--- 25/mm/vmscan.c~shrink_slab-overflow Mon Oct 21 15:40:57 2002
+++ 25-akpm/mm/vmscan.c Mon Oct 21 15:51:28 2002
@@ -147,14 +147,15 @@ static int shrink_slab(int scanned, uns
list_for_each(lh, &shrinker_list) {
struct shrinker *shrinker;
int entries;
- unsigned long delta;
+ long long delta;

shrinker = list_entry(lh, struct shrinker, list);
entries = (*shrinker->shrinker)(0, gfp_mask);
if (!entries)
continue;
- delta = scanned * shrinker->seeks * entries;
- shrinker->nr += delta / (pages + 1);
+ delta = scanned * shrinker->seeks;
+ delta *= entries;
+ shrinker->nr += do_div(delta, pages + 1);
if (shrinker->nr > SHRINK_BATCH) {
int nr = shrinker->nr;


.

2002-10-22 00:30:21

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

>> On Mon, 21 Oct 2002, Martin J. Bligh wrote:
>>
>> > > Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
>> > > selectively put pressure on the normal zone (and DMA zone, of course).
>> >
>> > Ah, I recall that now. That's fundamentally screwed.
>>
>> It's not too bad since the data can be reclaimed easily.
>>
>> The problem in your case is that the dentry and inode cache
>> didn't get reclaimed. Maybe there is a leak so they can't get
>> reclaimed at all or maybe they just don't get reclaimed fast
>> enough.

OK, well "find / | xargs ls -l" results in:

dentry_cache 1125216 1125216 160 46884 46884 1 : 248 124

repeating it gives

dentry_cache 969475 1140960 160 47538 47540 1 : 248 124

Which is only a third of what I eventually ended up with over the weekend,
so presumably that means you're correct and there is a leak.

Hmmm .... but why did it shrink ... I didn't expect mem pressure just
doing a find ....

MemTotal: 16077728 kB
MemFree: 15070304 kB
MemShared: 0 kB
Buffers: 92400 kB
Cached: 266052 kB
SwapCached: 0 kB
Active: 351896 kB
Inactive: 9080 kB
HighTotal: 15335424 kB
HighFree: 15066160 kB
LowTotal: 742304 kB
LowFree: 4144 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 32624 kB
Writeback: 0 kB
Mapped: 4956 kB
Slab: 630216 kB
Reserved: 570464 kB
Committed_AS: 6476 kB
PageTables: 236 kB
ReverseMaps: 3562

Pretty much all in slab ...

ext2_inode_cache 921200 938547 416 104283 104283 1 : 120 60
dentry_cache 1068133 1131096 160 47129 47129 1 : 248 124

So it looks as though it's actually ext2_inode cache that's first against the wall.
For comparison, over the weekend I ended up with:

ext2_inode_cache 554556 554598 416 61622 61622 1 : 120 60
dentry_cache 2791320 2791320 160 116305 116305 1 : 248 124

did a cat of /dev/sda2 > /dev/null ..... after that:

larry:~# egrep '(dentry|inode)' /proc/slabinfo
isofs_inode_cache 0 0 320 0 0 1 : 120 60
ext2_inode_cache 667345 809181 416 89909 89909 1 : 120 60
shmem_inode_cache 3 9 416 1 1 1 : 120 60
sock_inode_cache 16 22 352 2 2 1 : 120 60
proc_inode_cache 12 12 320 1 1 1 : 120 60
inode_cache 385 396 320 33 33 1 : 120 60
dentry_cache 1068289 1131096 160 47129 47129 1 : 248 124

larry:~# cat /proc/meminfo
MemTotal: 16077728 kB
MemFree: 15068684 kB
MemShared: 0 kB
Buffers: 165552 kB
Cached: 266052 kB
SwapCached: 0 kB
Active: 266620 kB
Inactive: 167524 kB
HighTotal: 15335424 kB
HighFree: 15066160 kB
LowTotal: 742304 kB
LowFree: 2524 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 8 kB
Writeback: 0 kB
Mapped: 4956 kB
Slab: 558684 kB
Reserved: 570464 kB
Committed_AS: 6476 kB
PageTables: 236 kB
ReverseMaps: 3563

So it doesn't seem to shrink under mem pressure, but I can't reproduce
the OOM at the moment either ;-(

M.

2002-10-22 03:33:44

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> >> On Mon, 21 Oct 2002, Martin J. Bligh wrote:
> >>
> >> > > Blockdevices only use ZONE_NORMAL for their pagecache. That cat will
> >> > > selectively put pressure on the normal zone (and DMA zone, of course).
> >> >
> >> > Ah, I recall that now. That's fundamentally screwed.
> >>
> >> It's not too bad since the data can be reclaimed easily.
> >>
> >> The problem in your case is that the dentry and inode cache
> >> didn't get reclaimed. Maybe there is a leak so they can't get
> >> reclaimed at all or maybe they just don't get reclaimed fast
> >> enough.
>
> OK, well "find / | xargs ls -l" results in:
>
> dentry_cache 1125216 1125216 160 46884 46884 1 : 248 124
>
> repeating it gives
>
> dentry_cache 969475 1140960 160 47538 47540 1 : 248 124
>
> Which is only a third of what I eventually ended up with over the weekend,
> so presumably that means you're correct and there is a leak.

I cannot make it happen here, either. 2.5.43-mm2 or current devel
stuff. Heisenbug; maybe something broke dcache-rcu? Or the math
overflow (unlikely).

> Hmmm .... but why did it shrink ... I didn't expect mem pressure just
> doing a find ....

Maybe because the ext2 inode cache didn't shrink as it should have.

The dentry/inode caches are pretty much FIFO with this sort of test,
and you're showing the traditional worst-case FIFO replacement behaviour.

> ...
> ext2_inode_cache 921200 938547 416 104283 104283 1 : 120 60
> dentry_cache 1068133 1131096 160 47129 47129 1 : 248 124
>
> So it looks as though it's actually ext2_inode cache that's first against the wall.

Well that's to be expected. Each ext2 directory inode has highmem
pagecache attached to it, which pins the inode. There's no highmem
eviction pressure so your normal zone gets stuffed full of inodes.

There's a fix for this in Andrea's tree, although that's perhaps a
bit heavy on inode_lock for 2.5 purposes. It's a matter of running
invalidate_inode_pages() against the inodes as they come off the
unused_list. I haven't got around to it yet.

> For comparison, over the weekend I ended up with:
>
> ext2_inode_cache 554556 554598 416 61622 61622 1 : 120 60
> dentry_cache 2791320 2791320 160 116305 116305 1 : 248 124
>
> did a cat of /dev/sda2 > /dev/null ..... after that:
>
> larry:~# egrep '(dentry|inode)' /proc/slabinfo
> isofs_inode_cache 0 0 320 0 0 1 : 120 60
> ext2_inode_cache 667345 809181 416 89909 89909 1 : 120 60
> shmem_inode_cache 3 9 416 1 1 1 : 120 60
> sock_inode_cache 16 22 352 2 2 1 : 120 60
> proc_inode_cache 12 12 320 1 1 1 : 120 60
> inode_cache 385 396 320 33 33 1 : 120 60
> dentry_cache 1068289 1131096 160 47129 47129 1 : 248 124

OK, so there's reasonable dentry shrinkage there, and the inodes
for regular files whch have no attached pagecache were reaped.
But all the directory inodes are sitting there pinned.

2002-10-22 03:49:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

> I cannot make it happen here, either. 2.5.43-mm2 or current devel
> stuff. Heisenbug; maybe something broke dcache-rcu? Or the math
> overflow (unlikely).

Dipankar is going to give me some debug code once he's slept for
a while ... that should help see if dcache-rcu went wacko.

>> So it looks as though it's actually ext2_inode cache that's first against the wall.
>
> Well that's to be expected. Each ext2 directory inode has highmem
> pagecache attached to it, which pins the inode. There's no highmem
> eviction pressure so your normal zone gets stuffed full of inodes.
>
> There's a fix for this in Andrea's tree, although that's perhaps a
> bit heavy on inode_lock for 2.5 purposes. It's a matter of running
> invalidate_inode_pages() against the inodes as they come off the
> unused_list. I haven't got around to it yet.

Thanks; no urgent problem (though we did seem to have a customer hitting
a very similar situation very easily in 2.4 ... we'll see if Andrea's
fixes that, then I'll try to reproduce their problem on current 2.5).

>> larry:~# egrep '(dentry|inode)' /proc/slabinfo
>> isofs_inode_cache 0 0 320 0 0 1 : 120 60
>> ext2_inode_cache 667345 809181 416 89909 89909 1 : 120 60
>> shmem_inode_cache 3 9 416 1 1 1 : 120 60
>> sock_inode_cache 16 22 352 2 2 1 : 120 60
>> proc_inode_cache 12 12 320 1 1 1 : 120 60
>> inode_cache 385 396 320 33 33 1 : 120 60
>> dentry_cache 1068289 1131096 160 47129 47129 1 : 248 124
>
> OK, so there's reasonable dentry shrinkage there, and the inodes
> for regular files whch have no attached pagecache were reaped.
> But all the directory inodes are sitting there pinned.

OK, this all makes a lot of sense ... apart from one thing:
from looking at meminfo:

HighTotal: 15335424 kB
HighFree: 15066160 kB

Even if every highmem page is pagecache, that's only 67316 pages by
my reckoning (is pagecache broken out seperately in meminfo? both
Buffers and Cached seem to large). If I only have 67316 page of
pagecache, how can I have 667345 inodes with attatched pagecache pages?
Or am I just missing something obvious and fundamental?

2002-10-22 04:14:34

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> > I cannot make it happen here, either. 2.5.43-mm2 or current devel
> > stuff. Heisenbug; maybe something broke dcache-rcu? Or the math
> > overflow (unlikely).
>
> Dipankar is going to give me some debug code once he's slept for
> a while ... that should help see if dcache-rcu went wacko.

Well if it doesn't happen again...

> >> So it looks as though it's actually ext2_inode cache that's first against the wall.
> >
> > Well that's to be expected. Each ext2 directory inode has highmem
> > pagecache attached to it, which pins the inode. There's no highmem
> > eviction pressure so your normal zone gets stuffed full of inodes.
> >
> > There's a fix for this in Andrea's tree, although that's perhaps a
> > bit heavy on inode_lock for 2.5 purposes. It's a matter of running
> > invalidate_inode_pages() against the inodes as they come off the
> > unused_list. I haven't got around to it yet.
>
> Thanks; no urgent problem (though we did seem to have a customer hitting
> a very similar situation very easily in 2.4 ... we'll see if Andrea's
> fixes that, then I'll try to reproduce their problem on current 2.5).

Oh it's reproduceable OK. Just run

make-teeny-files 7 7

against a few filesystems and watch the fun

http://www.zip.com.au/~akpm/linux/patches/stuff/make-teeny-files.c

> >> larry:~# egrep '(dentry|inode)' /proc/slabinfo
> >> isofs_inode_cache 0 0 320 0 0 1 : 120 60
> >> ext2_inode_cache 667345 809181 416 89909 89909 1 : 120 60
> >> shmem_inode_cache 3 9 416 1 1 1 : 120 60
> >> sock_inode_cache 16 22 352 2 2 1 : 120 60
> >> proc_inode_cache 12 12 320 1 1 1 : 120 60
> >> inode_cache 385 396 320 33 33 1 : 120 60
> >> dentry_cache 1068289 1131096 160 47129 47129 1 : 248 124
> >
> > OK, so there's reasonable dentry shrinkage there, and the inodes
> > for regular files whch have no attached pagecache were reaped.
> > But all the directory inodes are sitting there pinned.
>
> OK, this all makes a lot of sense ... apart from one thing:
> from looking at meminfo:
>
> HighTotal: 15335424 kB
> HighFree: 15066160 kB
>
> Even if every highmem page is pagecache, that's only 67316 pages by
> my reckoning (is pagecache broken out seperately in meminfo? both
> Buffers and Cached seem to large). If I only have 67316 page of
> pagecache, how can I have 667345 inodes with attatched pagecache pages?
> Or am I just missing something obvious and fundamental?

Maybe you didn't cat /dev/sda2 for long enough?

You should end up with very little dcache and tons of icache.
Here's what I get:

ext2_inode_cache: 420248KB 420256KB 99.99
buffer_head: 40422KB 41648KB 97.5
dentry_cache: 667KB 10211KB 6.54
biovec-BIO_MAX_PAGES: 768KB 780KB 98.46

Massive internal fragmentation of the dcache there. But it takes
a long time.

Generally, I feel that the proportional-shrink on slab is applying
too much pressure when there's not much slab and too little when
there's a lot. If you have 400 megs of inodes I don't really think
they are likely to be used again soon.

Perhaps we need to multiply the slab cache scanning pressure by the
slab occupancy. That's simple to do.

2002-10-22 05:45:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

> Oh it's reproduceable OK. Just run
>
> make-teeny-files 7 7

Excellent - thanks for that ... will try it.

> Maybe you didn't cat /dev/sda2 for long enough?

Well, it's a multi-gigabyte partition. IIRC, I just ran it until
it died with "input/output error" ... which I assumed at the time
was the end of the partition, but it should be able to find that
without error, so maybe it just ran out of ZONE_NORMAL ;-)

> Perhaps we need to multiply the slab cache scanning pressure by the
> slab occupancy. That's simple to do.

That'd make a lot of sense (to me, at least). I presume you mean
occupancy on a per-slab basis, not global.

M.

2002-10-22 06:15:03

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

"Martin J. Bligh" wrote:
>
> > Oh it's reproduceable OK. Just run
> >
> > make-teeny-files 7 7
>
> Excellent - thanks for that ... will try it.

When it goes stupid, you can then run and kill some big memory-hog
to force reclaim of lots of highmem pages. Once you've done that,
you can watch the inode cache fall away as the inodes which used
to have pagecache become reclaimable.

> > Maybe you didn't cat /dev/sda2 for long enough?
>
> Well, it's a multi-gigabyte partition. IIRC, I just ran it until
> it died with "input/output error" ... which I assumed at the time
> was the end of the partition, but it should be able to find that
> without error, so maybe it just ran out of ZONE_NORMAL ;-)

Oh. Well it should have just hit eof. Maybe you have a dud
sector and it terminated early.

> > Perhaps we need to multiply the slab cache scanning pressure by the
> > slab occupancy. That's simple to do.
>
> That'd make a lot of sense (to me, at least). I presume you mean
> occupancy on a per-slab basis, not global.

It's already performing slab cache scanning proportional to
the size of the slab. Multiplied by the rate of page scanning.

But I'm thinking that this linear pressure isn't right
at either end of the scale, so it needs to become nonlinear - even
less pressure when there's little slab, and more pressure when
there's a lot. So multiply the slab scanning ratio by
amount_of_slab/amount_of_normal_zone. Maybe.

2002-10-22 16:10:08

by Dipankar Sarma

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

On Mon, Oct 21, 2002 at 09:34:56PM +0000, Andrew Morton wrote:
> "Martin J. Bligh" wrote:
> >
> > >> My big NUMA box went OOM over the weekend and started killing things
> > >> for no good reason (2.5.43-mm2). Probably running some background
> > >> updatedb for locate thing, not doing any real work.
> > >>
> > >> meminfo:
> > >>
> > >
> > > Looks like a plain dentry leak to me. Very weird.
> > >
> > > Did the machine recover and run normally?
> >
> > Nope, kept OOMing and killing everything .
>
> Something broke.

Yes, in RCU and ever since interrupt counters were changed to use
thread_info->preempt_count :)

There is a check in RCU for idle CPUs which signifies quiescent
state (and hence no reference to RCU protected data) which was
broken -

--- kernel/rcupdate.c Sat Oct 19 09:31:07 2002
+++ /tmp/rcupdate.c.mod Tue Oct 22 21:20:07 2002
@@ -192,7 +192,8 @@
void rcu_check_callbacks(int cpu, int user)
{
if (user ||
- (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
+ (idle_cpu(cpu) && !in_softirq() &&
+ hardirq_count() <= (1 << HARDIRQ_SHIFT)))
RCU_qsctr(cpu)++;
tasklet_schedule(&RCU_tasklet(cpu));
}

Martin's machine with large number of CPUs which were idle was not
completing any RCU grace period because RCU was forever waiting
for idle CPUs to context switch. Had the idle check worked, this
would not have happened. With no RCU happening, the dentries were
getting "freed" (dentry stats showing that) but not getting returned
to slab. This would not show up in systems that are generally busy
as context switches then would happen in all CPUs and the per-CPU
quiescent state counter would get incremented during context switch.

I have included here a patch that adds statistics to RCU which are
very helpful in detecting problems. That patch also includes the
idle CPU check fix. The stats are in /proc/rcu. I still need to
check if seq_file access of a /proc file is serialized or not.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.


diff -ruN linux-2.5.44-base/fs/proc/proc_misc.c linux-2.5.44-rcu/fs/proc/proc_misc.c
--- linux-2.5.44-base/fs/proc/proc_misc.c Sat Oct 19 09:31:14 2002
+++ linux-2.5.44-rcu/fs/proc/proc_misc.c Tue Oct 22 18:47:00 2002
@@ -253,6 +253,18 @@
.release = seq_release,
};

+extern struct seq_operations rcu_op;
+static int rcu_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &rcu_op);
+}
+static struct file_operations proc_rcu_operations = {
+ .open = rcu_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
extern struct seq_operations vmstat_op;
static int vmstat_open(struct inode *inode, struct file *file)
{
@@ -631,6 +643,7 @@
if (entry)
entry->proc_fops = &proc_kmsg_operations;
create_seq_entry("cpuinfo", 0, &proc_cpuinfo_operations);
+ create_seq_entry("rcu", 0, &proc_rcu_operations);
create_seq_entry("partitions", 0, &proc_partitions_operations);
#if !defined(CONFIG_ARCH_S390)
create_seq_entry("interrupts", 0, &proc_interrupts_operations);
diff -ruN linux-2.5.44-base/include/linux/rcupdate.h linux-2.5.44-rcu/include/linux/rcupdate.h
--- linux-2.5.44-base/include/linux/rcupdate.h Sat Oct 19 09:31:18 2002
+++ linux-2.5.44-rcu/include/linux/rcupdate.h Tue Oct 22 18:47:00 2002
@@ -68,6 +68,7 @@
long maxbatch; /* Max requested batch number. */
unsigned long rcu_cpu_mask; /* CPUs that need to switch in order */
/* for current batch to proceed. */
+ long nr_batches;
};

/* Is batch a before batch b ? */
@@ -94,6 +95,10 @@
long batch; /* Batch # for current RCU batch */
struct list_head nxtlist;
struct list_head curlist;
+ long nr_newreqs;
+ long nr_curreqs;
+ long nr_pendreqs;
+ long nr_rcupdates;
} ____cacheline_aligned_in_smp;

extern struct rcu_data rcu_data[NR_CPUS];
@@ -104,6 +109,10 @@
#define RCU_batch(cpu) (rcu_data[(cpu)].batch)
#define RCU_nxtlist(cpu) (rcu_data[(cpu)].nxtlist)
#define RCU_curlist(cpu) (rcu_data[(cpu)].curlist)
+#define RCU_nr_newreqs(cpu) (rcu_data[(cpu)].nr_newreqs)
+#define RCU_nr_curreqs(cpu) (rcu_data[(cpu)].nr_curreqs)
+#define RCU_nr_pendreqs(cpu) (rcu_data[(cpu)].nr_pendreqs)
+#define RCU_nr_rcupdates(cpu) (rcu_data[(cpu)].nr_rcupdates)

#define RCU_QSCTR_INVALID 0

diff -ruN linux-2.5.44-base/kernel/rcupdate.c linux-2.5.44-rcu/kernel/rcupdate.c
--- linux-2.5.44-base/kernel/rcupdate.c Sat Oct 19 09:31:07 2002
+++ linux-2.5.44-rcu/kernel/rcupdate.c Tue Oct 22 18:54:04 2002
@@ -41,6 +41,7 @@
#include <linux/module.h>
#include <linux/completion.h>
#include <linux/percpu.h>
+#include <linux/seq_file.h>
#include <linux/rcupdate.h>

/* Definition for rcupdate control block. */
@@ -74,6 +75,7 @@
local_irq_save(flags);
cpu = smp_processor_id();
list_add_tail(&head->list, &RCU_nxtlist(cpu));
+ RCU_nr_newreqs(cpu)++;
local_irq_restore(flags);
}

@@ -81,7 +83,7 @@
* Invoke the completed RCU callbacks. They are expected to be in
* a per-cpu list.
*/
-static void rcu_do_batch(struct list_head *list)
+static void rcu_do_batch(int cpu, struct list_head *list)
{
struct list_head *entry;
struct rcu_head *head;
@@ -91,7 +93,9 @@
list_del(entry);
head = list_entry(entry, struct rcu_head, list);
head->func(head->arg);
+ RCU_nr_rcupdates(cpu)++;
}
+ RCU_nr_pendreqs(cpu) = 0;
}

/*
@@ -99,7 +103,7 @@
* active batch and the batch to be registered has not already occurred.
* Caller must hold the rcu_ctrlblk lock.
*/
-static void rcu_start_batch(long newbatch)
+static void rcu_start_batch(int cpu, long newbatch)
{
if (rcu_batch_before(rcu_ctrlblk.maxbatch, newbatch)) {
rcu_ctrlblk.maxbatch = newbatch;
@@ -109,6 +113,8 @@
return;
}
rcu_ctrlblk.rcu_cpu_mask = cpu_online_map;
+ RCU_nr_pendreqs(cpu) = RCU_nr_curreqs(cpu);
+ RCU_nr_curreqs(cpu) = 0;
}

/*
@@ -149,7 +155,8 @@
return;
}
rcu_ctrlblk.curbatch++;
- rcu_start_batch(rcu_ctrlblk.maxbatch);
+ rcu_ctrlblk.nr_batches++;
+ rcu_start_batch(cpu, rcu_ctrlblk.maxbatch);
spin_unlock(&rcu_ctrlblk.mutex);
}

@@ -172,6 +179,8 @@
if (!list_empty(&RCU_nxtlist(cpu)) && list_empty(&RCU_curlist(cpu))) {
list_splice(&RCU_nxtlist(cpu), &RCU_curlist(cpu));
INIT_LIST_HEAD(&RCU_nxtlist(cpu));
+ RCU_nr_curreqs(cpu) = RCU_nr_newreqs(cpu);
+ RCU_nr_newreqs(cpu) = 0;
local_irq_enable();

/*
@@ -179,20 +188,21 @@
*/
spin_lock(&rcu_ctrlblk.mutex);
RCU_batch(cpu) = rcu_ctrlblk.curbatch + 1;
- rcu_start_batch(RCU_batch(cpu));
+ rcu_start_batch(cpu, RCU_batch(cpu));
spin_unlock(&rcu_ctrlblk.mutex);
} else {
local_irq_enable();
}
rcu_check_quiescent_state();
if (!list_empty(&list))
- rcu_do_batch(&list);
+ rcu_do_batch(cpu, &list);
}

void rcu_check_callbacks(int cpu, int user)
{
if (user ||
- (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
+ (idle_cpu(cpu) && !in_softirq() &&
+ hardirq_count() <= (1 << HARDIRQ_SHIFT)))
RCU_qsctr(cpu)++;
tasklet_schedule(&RCU_tasklet(cpu));
}
@@ -240,3 +250,60 @@

EXPORT_SYMBOL(call_rcu);
EXPORT_SYMBOL(synchronize_kernel);
+
+#ifdef CONFIG_PROC_FS
+
+static void *rcu_start(struct seq_file *m, loff_t *pos)
+{
+ static int cpu;
+ cpu = *pos;
+ return *pos < NR_CPUS ? &cpu : NULL;
+}
+
+static void *rcu_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ ++*pos;
+ return rcu_start(m, pos);
+
+}
+
+static void rcu_stop(struct seq_file *m, void *v)
+{
+}
+
+static int show_rcu(struct seq_file *m, void *v)
+{
+ int cpu = *(int *)v;
+
+ if (!cpu_online(cpu))
+ return 0;
+ if (cpu == 0) {
+ seq_printf(m, "RCU Current Batch : %ld\n",
+ rcu_ctrlblk.curbatch);
+ seq_printf(m, "RCU Max Batch : %ld\n",
+ rcu_ctrlblk.maxbatch);
+ seq_printf(m, "RCU Global Cpumask : 0x%lx\n",
+ rcu_ctrlblk.rcu_cpu_mask);
+ seq_printf(m, "RCU Total Batches : %ld\n\n\n",
+ rcu_ctrlblk.nr_batches);
+ }
+ seq_printf(m, "CPU : %d\n", cpu);
+ seq_printf(m, "RCU qsctr : %ld\n", RCU_qsctr(cpu));
+ seq_printf(m, "RCU last qsctr : %ld\n", RCU_last_qsctr(cpu));
+ seq_printf(m, "RCU batch : %ld\n", RCU_batch(cpu));
+ seq_printf(m, "RCU new requests : %ld\n", RCU_nr_newreqs(cpu));
+ seq_printf(m, "RCU current requests : %ld\n", RCU_nr_curreqs(cpu));
+ seq_printf(m, "RCU pending requests : %ld\n", RCU_nr_pendreqs(cpu));
+ seq_printf(m, "RCU updated : %ld\n\n\n", RCU_nr_rcupdates(cpu));
+ return 0;
+}
+
+struct seq_operations rcu_op = {
+ .start = rcu_start,
+ .next = rcu_next,
+ .stop = rcu_stop,
+ .show = show_rcu,
+};
+
+#endif
+

2002-10-22 16:10:47

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)


>> > Maybe you didn't cat /dev/sda2 for long enough?
>>
>> Well, it's a multi-gigabyte partition. IIRC, I just ran it until
>> it died with "input/output error" ... which I assumed at the time
>> was the end of the partition, but it should be able to find that
>> without error, so maybe it just ran out of ZONE_NORMAL ;-)
>
> Oh. Well it should have just hit eof. Maybe you have a dud
> sector and it terminated early.

OK, I catted an 18Gb disk completely. The beast still didn't shrink.

larry:~# cat /proc/meminfo
MemTotal: 16078192 kB
MemFree: 15043280 kB
MemShared: 0 kB
Buffers: 79152 kB
Cached: 287248 kB
SwapCached: 0 kB
Active: 263056 kB
Inactive: 105136 kB
HighTotal: 15335424 kB
HighFree: 15039616 kB
LowTotal: 742768 kB
LowFree: 3664 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 3736 kB
Slab: 641352 kB
Reserved: 570000 kB
Committed_AS: 2400 kB
PageTables: 180 kB
ReverseMaps: 2236

ext2_inode_cache 476254 541125 416 60125 60125 1 : 120
dentry_cache 2336272 2336280 160 97345 97345 1 : 248 124

Note that dentry cache seems to have grown overnight ....

I guess I'll add some debug code to the slab cache shrinkers
and try to see what it's doing.

M.

2002-10-22 16:27:53

by Rik van Riel

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

On Mon, 21 Oct 2002, Andrew Morton wrote:

> He had 3 million dentries and only 100k pages on the LRU,
> so we should have been reclaiming 60 dentries per scanned
> page.
>
> Conceivably the multiply in shrink_slab() overflowed, where
> we calculate local variable `delta'. But doubtful.

What if there were no pages left to scan for shrink_caches ?

Could it be possible that for some strange reason the machine
ended up scanning 0 slab objects ?

60 * 0 is still 0, after all ;)

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-10-22 16:48:06

by Steffen Persvold

[permalink] [raw]
Subject: ARP address resolving in kernel.

Hi all,

I have a question regarding ARP address resolving from a kernel module.
The kernel module gets a IPv4 address as input and would like to resolve
this to a ethernet address. Is there a simple way of doing this ?

Regards,
--
Steffen Persvold | Scali AS
mailto:[email protected] | http://www.scali.com
Tel: (+47) 2262 8950 | Olaf Helsets vei 6
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY

2002-10-22 16:59:34

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

Rik van Riel wrote:
>
> On Mon, 21 Oct 2002, Andrew Morton wrote:
>
> > He had 3 million dentries and only 100k pages on the LRU,
> > so we should have been reclaiming 60 dentries per scanned
> > page.
> >
> > Conceivably the multiply in shrink_slab() overflowed, where
> > we calculate local variable `delta'. But doubtful.
>
> What if there were no pages left to scan for shrink_caches ?

Historically, this causes an ints-off lockup, but I think we've
fixed all them now ;)

> Could it be possible that for some strange reason the machine
> ended up scanning 0 slab objects ?
>
> 60 * 0 is still 0, after all ;)
>

More by good luck than by good judgement, if there are zero inactive
pages in a zone we come out of shrink_caches with max_scan equal
to SWAP_CLUSTER_MAX*2. So if all of a zone's pages are out in
pagetables/skbuffs/whatever we'll put a lot of pressure on slab.

Which is good. But it'll do that even if the offending zone cannot
contain any slab, which is not so good, but not very serious in
practice. Search for "FIXME"...

2002-10-24 11:34:31

by Ed Tomlinson

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

Hi,

I just experienced this problem on UP with 513M memory. About 400m was
locked in dentries. The system was very unresponsive - suspect it was
spending gobs of time scaning unfreeable dentries. This was with -mm3
up about 24 hours.

The inode caches looked sane. Just the dentries were out of wack.

Ed

2002-10-24 14:24:40

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ZONE_NORMAL exhaustion (dcache slab)

> I just experienced this problem on UP with 513M memory. About 400m was
> locked in dentries. The system was very unresponsive - suspect it was
> spending gobs of time scaning unfreeable dentries. This was with -mm3
> up about 24 hours.
>
> The inode caches looked sane. Just the dentries were out of wack.

I think you want this:

+read-barrier-depends.patch
RCU fix

Which is only in mm4 I believe. Wanna retest? mm4 is the first 44-mmX
that works for me ... seems to have quite a few bugfixes ;-)

M.