I've been noticing this for a while (probably since at least 2.6.11 or
so, but I haven't been keeping close attention), but I haven't had the
time to get some proof that this was the cause, and to write it up
until now.
I have a T40 laptop (Pentium M processor) with 2 gigs of memory, and
from time to time, after the system has been up for a while, the
dentry cache grows huge, as does the ext3_inode_cache:
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dentry_cache 434515 514112 136 29 1 : tunables 120 60 0 : slabdata 17728 17728 0
ext3_inode_cache 587635 589992 464 8 1 : tunables 54 27 0 : slabdata 73748 73749 0
Leading to an impending shortage in low memory:
LowFree: 9268 kB
... and if I don't take corrective measures, very shortly thereafter
the system will become unresponsive and will end up thrashing itself
to death, with symptoms that are identical to a case of 2.4 lowmem
exhaustion --- except this is on a 2.6.13 kernel, where all of these
problems were supposed to be solved.
It turns out I can head off the system lockup by requesting the
formation of hugepages, which will immediately cause a dramatic
reduction of memory usage in both high- and low- memory as various
caches and flushed:
echo 100 > /proc/sys/vm/nr_hugepages
echo 0 > /proc/sys/vm/nr_hugepages
The question is why isn't the kernel able to figure out how to do
release dentry cache entries automatically when it starts thrashing due
to a lack of low memory? Clearly it can, since requesting hugepages
does shrink the dentry cache:
dentry_cache 20097 20097 136 29 1 : tunables 120 60 0 : slabdata 693 693 0
ext3_inode_cache 17782 17784 464 8 1 : tunables 54 27 0 : slabdata 2223 2223 0
LowFree: 835916 kB
Has anyone else seen this, or have some ideas about how to fix it?
Thanks, regards,
- Ted
Hi Ted,
On Sun, Sep 11, 2005 at 06:57:09AM -0400, Theodore Ts'o wrote:
>
> I have a T40 laptop (Pentium M processor) with 2 gigs of memory, and
> from time to time, after the system has been up for a while, the
> dentry cache grows huge, as does the ext3_inode_cache:
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> dentry_cache 434515 514112 136 29 1 : tunables 120 60 0 : slabdata 17728 17728 0
> ext3_inode_cache 587635 589992 464 8 1 : tunables 54 27 0 : slabdata 73748 73749 0
>
> Leading to an impending shortage in low memory:
>
> LowFree: 9268 kB
Do you have the /proc/sys/fs/dentry-state output when such lowmem
shortage happens ?
>
> It turns out I can head off the system lockup by requesting the
> formation of hugepages, which will immediately cause a dramatic
> reduction of memory usage in both high- and low- memory as various
> caches and flushed:
>
> echo 100 > /proc/sys/vm/nr_hugepages
> echo 0 > /proc/sys/vm/nr_hugepages
>
> The question is why isn't the kernel able to figure out how to do
> release dentry cache entries automatically when it starts thrashing due
> to a lack of low memory? Clearly it can, since requesting hugepages
> does shrink the dentry cache:
This is a problem that Bharata has been investigating at the moment.
But he hasn't seen anything that can't be cured by a small memory
pressure - IOW, dentries do get freed under memory pressure. So
your case might be very useful. Bharata is maintaing an instrumentation
patch to collect more information and an alternative dentry aging patch
(using rbtree). Perhaps you could try with those.
Thanks
Dipankar
On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> Do you have the /proc/sys/fs/dentry-state output when such lowmem
> shortage happens ?
Not yet, but the situation occurs on my laptop about 2 or 3 times
(when I'm not travelling and so it doesn't get rebooted). So
reproducing it isn't utterly trivial, but it's does happen often
enough that it should be possible to get the necessary data.
> This is a problem that Bharata has been investigating at the moment.
> But he hasn't seen anything that can't be cured by a small memory
> pressure - IOW, dentries do get freed under memory pressure. So
> your case might be very useful. Bharata is maintaing an instrumentation
> patch to collect more information and an alternative dentry aging patch
> (using rbtree). Perhaps you could try with those.
Send it to me, and I'd be happy to try either the instrumentation
patch or the dentry aging patch.
Thanks, regards,
- Ted
--Theodore Ts'o <[email protected]> wrote (on Sunday, September 11, 2005 23:16:36 -0400):
> On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
>> Do you have the /proc/sys/fs/dentry-state output when such lowmem
>> shortage happens ?
>
> Not yet, but the situation occurs on my laptop about 2 or 3 times
> (when I'm not travelling and so it doesn't get rebooted). So
> reproducing it isn't utterly trivial, but it's does happen often
> enough that it should be possible to get the necessary data.
>
>> This is a problem that Bharata has been investigating at the moment.
>> But he hasn't seen anything that can't be cured by a small memory
>> pressure - IOW, dentries do get freed under memory pressure. So
>> your case might be very useful. Bharata is maintaing an instrumentation
>> patch to collect more information and an alternative dentry aging patch
>> (using rbtree). Perhaps you could try with those.
>
> Send it to me, and I'd be happy to try either the instrumentation
> patch or the dentry aging patch.
Other thing that might be helpful is to shove a printk in prune_dcache
so we can see when it's getting called, and how successful it is, if the
more sophisticated stuff doesn't help ;-)
M.
On Sun, Sep 11, 2005 at 11:16:30PM -0700, Martin J. Bligh wrote:
>
>
> --Theodore Ts'o <[email protected]> wrote (on Sunday, September 11, 2005 23:16:36 -0400):
>
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> >> Do you have the /proc/sys/fs/dentry-state output when such lowmem
> >> shortage happens ?
> >
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted). So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> >
> >> This is a problem that Bharata has been investigating at the moment.
> >> But he hasn't seen anything that can't be cured by a small memory
> >> pressure - IOW, dentries do get freed under memory pressure. So
> >> your case might be very useful. Bharata is maintaing an instrumentation
> >> patch to collect more information and an alternative dentry aging patch
> >> (using rbtree). Perhaps you could try with those.
> >
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
>
> Other thing that might be helpful is to shove a printk in prune_dcache
> so we can see when it's getting called, and how successful it is, if the
> more sophisticated stuff doesn't help ;-)
>
I have incorporated this in the dcache stats patch I have. I will
post it tommorrow after adding some more instrumentation data
(number of inuse and free dentries in lru list) and after a bit of
cleanup and testing.
Regards,
Bharata.
On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > shortage happens ?
>
> Not yet, but the situation occurs on my laptop about 2 or 3 times
> (when I'm not travelling and so it doesn't get rebooted). So
> reproducing it isn't utterly trivial, but it's does happen often
> enough that it should be possible to get the necessary data.
>
> > This is a problem that Bharata has been investigating at the moment.
> > But he hasn't seen anything that can't be cured by a small memory
> > pressure - IOW, dentries do get freed under memory pressure. So
> > your case might be very useful. Bharata is maintaing an instrumentation
> > patch to collect more information and an alternative dentry aging patch
> > (using rbtree). Perhaps you could try with those.
>
> Send it to me, and I'd be happy to try either the instrumentation
> patch or the dentry aging patch.
>
Ted,
I am sending two patches here.
First is dentry_stats patch which collects some dcache statistics
and puts it into /proc/meminfo. This patch provides information
about how dentries are distributed in dcache slab pages, how many
free and in use dentries are present in dentry_unused lru list and
how prune_dcache() performs with respect to freeing the requested
number of dentries.
Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
to improve this dcache fragmentation problem.
These patches apply on 2.6.13-rc7 and 2.6.13 cleanly.
Could you please apply the dcache_stats patch and check if the problem
can be reproduced. When that happens, could you please capture the
/proc/meminfo, /proc/sys/fs/dentry-state and /proc/slabinfo.
It would be nice if you could also try the rbtree patch to check if
it improves the situation. rbtree patch applies on top of the stats
patch.
Regards,
Bharata.
On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
>
> Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> to improve this dcache fragmentation problem.
FYI, in the past I've tried this patch to reduce dcache fragmentation on
an Altix (16k pages, 62 dentries to a slab page) under heavy
fileserver workloads and it had no measurable effect. It appeared
that there was almost always at least one active dentry on each page
in the slab. The story may very well be different on 4k page
machines, however.
Typically, fragmentation was bad enough that reclaim removed ~90% of
the working set of dentries to free about 1% of the memory in the
dentry slab. We had to get down to freeing > 95% of the dentry cache
before fragmentation started to reduce and the system stopped trying to
reclaim the dcache which we then spent the next 10 minutes
repopulating......
We also tried separating out directory dentries into a separate slab
so that (potentially) longer lived dentries were clustered together
rather than sparsely distributed around the slab cache. Once again,
it had no measurable effect on the level of fragmentation (with or
without the rbtree patch).
FWIW, the inode cache was showing very similar levels of fragmentation
under reclaim as well.
Cheers,
Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
On Tuesday 13 September 2005 23:59, David Chinner wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > to improve this dcache fragmentation problem.
>
> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> an Altix (16k pages, 62 dentries to a slab page) under heavy
> fileserver workloads and it had no measurable effect. It appeared
> that there was almost always at least one active dentry on each page
> in the slab. The story may very well be different on 4k page
> machines, however.
I always thought dentry freeing would work much better if it
was turned upside down.
Instead of starting from the high level dcache lists it could
be driven by slab: on memory pressure slab tries to return pages with unused
cache objects. In that case it should check if there are only
a small number of pinned objects on the page set left, and if
yes use a new callback to the higher level user (=dcache) and ask them
to free the object.
The slab datastructures are not completely suited for this right now,
but it could be done by using one more of the list_heads in struct page
for slab backing pages.
It would probably not be very LRU but a simple hack of having slowly
increasing dcache generations. Each dentry use updates the generation.
First slab memory freeing pass only frees objects with older generations.
Using slowly increasing generations has the advantage of timestamps
that you can avoid dirtying cache lines in the common case when
the generation doesn't change on access (= no additional cache line bouncing)
and it would easily allow to tune the aging rate under stress by changing the
length of the generation.
-Andi
Andi Kleen wrote:
>The slab datastructures are not completely suited for this right now,
>but it could be done by using one more of the list_heads in struct page
>for slab backing pages.
>
>
>
I agree, I even started prototyping something a year ago, but ran out of
time.
One tricky point are directory dentries: As far as I see, they are
pinned and unfreeable if a (freeable) directory entry is in the cache.
--
Manfred
Andi Kleen <[email protected]> wrote:
>
> On Tuesday 13 September 2005 23:59, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> >
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab. The story may very well be different on 4k page
> > machines, however.
>
> I always thought dentry freeing would work much better if it
> was turned upside down.
>
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.
Considered doing that with buffer_heads a few years ago. It's impossible
unless you have a global lock, which bh's don't have. dentries _do_ have a
global lock, and we'd be tied to having it for ever more.
The shrinking code would have be able to deal with a dentry which is going
through destruction by other call paths, so dcache_lock coverage would have
to be extended considerably - it would have to cover the kmem_cache_free(),
for example. Or we put some i_am_alive flag into the dentry.
> The slab datastructures are not completely suited for this right now,
> but it could be done by using one more of the list_heads in struct page
> for slab backing pages.
Yes, some help would be needed in the slab code.
There's only one list_head in struct page and slab is already using it.
Manfred Spraul <[email protected]> wrote:
>
> One tricky point are directory dentries: As far as I see, they are
> pinned and unfreeable if a (freeable) directory entry is in the cache.
>
Well. That's the whole problem.
I don't think it's been demonstrated that Ted's problem was caused by
internal fragementation, btw. Ted, could you run slabtop, see what the
dcache occupancy is? Monitor it as you start to manually apply pressure?
If the occupancy falls to 10% and not many slab pages are freed up yet then
yup, it's internal fragmentation.
I've found that internal fragmentation due to pinned directory dentries can
be very high if you're running silly benchmarks which create some
regular-shaped directory tree which can easily create pathological
patterns. For real-world things with irregular creation and access
patterns and irregular directory sizes the fragmentation isn't as easy to
demonstrate.
Another approach would be to do an aging round on a directory's children
when an unfreeable dentry is encountered on the LRU. Something like that.
If internal fragmentation is indeed the problem.
On Wed, Sep 14, 2005 at 02:43:13AM -0700, Andrew Morton wrote:
> Manfred Spraul <[email protected]> wrote:
> >
> > One tricky point are directory dentries: As far as I see, they are
> > pinned and unfreeable if a (freeable) directory entry is in the cache.
> >
> I don't think it's been demonstrated that Ted's problem was caused by
> internal fragementation, btw. Ted, could you run slabtop, see what the
> dcache occupancy is? Monitor it as you start to manually apply pressure?
> If the occupancy falls to 10% and not many slab pages are freed up yet then
> yup, it's internal fragmentation.
>
> I've found that internal fragmentation due to pinned directory dentries can
> be very high if you're running silly benchmarks which create some
> regular-shaped directory tree which can easily create pathological
> patterns. For real-world things with irregular creation and access
> patterns and irregular directory sizes the fragmentation isn't as easy to
> demonstrate.
>
> Another approach would be to do an aging round on a directory's children
> when an unfreeable dentry is encountered on the LRU. Something like that.
> If internal fragmentation is indeed the problem.
One other point to look at is whether fragmentation is due to pinned
dentries or not. We can get that information only from dcache itself.
That is what we need to acertain first using the instrumentation
patch. Solving the problem of large # of pinned dentries and large # of LRU
free dentries will likely require different approaches. Even the
LRU dentries are sometimes pinned due to the lazy-lru stuff that
we did for lock-free dcache. Let us get some accurate dentry
stats first from the instrumentation patch.
Thanks
Dipankar
>> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
>> > to improve this dcache fragmentation problem.
>>
>> FYI, in the past I've tried this patch to reduce dcache fragmentation on
>> an Altix (16k pages, 62 dentries to a slab page) under heavy
>> fileserver workloads and it had no measurable effect. It appeared
>> that there was almost always at least one active dentry on each page
>> in the slab. The story may very well be different on 4k page
>> machines, however.
>
> I always thought dentry freeing would work much better if it
> was turned upside down.
>
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.
>
> The slab datastructures are not completely suited for this right now,
> but it could be done by using one more of the list_heads in struct page
> for slab backing pages.
>
> It would probably not be very LRU but a simple hack of having slowly
> increasing dcache generations. Each dentry use updates the generation.
> First slab memory freeing pass only frees objects with older generations.
If they're freeable, we should easily be able to move them, and therefore
compact a fragmented slab. That way we can preserve the LRU'ness of it.
Stage 1: free the oldest entries. Stage 2: compact the slab into whole
pages. Stage 3: free whole pages back to teh page allocator.
> Using slowly increasing generations has the advantage of timestamps
> that you can avoid dirtying cache lines in the common case when
> the generation doesn't change on access (= no additional cache line bouncing)
> and it would easily allow to tune the aging rate under stress by changing the
> length of the generation.
LRU algorithm may need general tweaking like this anyway ... strict LRU
is expensive to keep.
M.
On Wed, Sep 14, 2005 at 06:57:56AM -0700, Martin J. Bligh wrote:
> >> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> >> > to improve this dcache fragmentation problem.
> >>
> >> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> >> an Altix (16k pages, 62 dentries to a slab page) under heavy
> >> fileserver workloads and it had no measurable effect. It appeared
> >> that there was almost always at least one active dentry on each page
> >> in the slab. The story may very well be different on 4k page
> >> machines, however.
> >
> > I always thought dentry freeing would work much better if it
> > was turned upside down.
> >
> > Instead of starting from the high level dcache lists it could
> > be driven by slab: on memory pressure slab tries to return pages with unused
> > cache objects. In that case it should check if there are only
> > a small number of pinned objects on the page set left, and if
> > yes use a new callback to the higher level user (=dcache) and ask them
> > to free the object.
> >
> > The slab datastructures are not completely suited for this right now,
> > but it could be done by using one more of the list_heads in struct page
> > for slab backing pages.
> >
> > It would probably not be very LRU but a simple hack of having slowly
> > increasing dcache generations. Each dentry use updates the generation.
> > First slab memory freeing pass only frees objects with older generations.
>
> If they're freeable, we should easily be able to move them, and therefore
> compact a fragmented slab. That way we can preserve the LRU'ness of it.
> Stage 1: free the oldest entries. Stage 2: compact the slab into whole
> pages. Stage 3: free whole pages back to teh page allocator.
>
> > Using slowly increasing generations has the advantage of timestamps
> > that you can avoid dirtying cache lines in the common case when
> > the generation doesn't change on access (= no additional cache line bouncing)
> > and it would easily allow to tune the aging rate under stress by changing the
> > length of the generation.
>
> LRU algorithm may need general tweaking like this anyway ... strict LRU
> is expensive to keep.
Based on what I remember, I'd contend it isn't really LRU today, so I
wouldn't try and stick to something that we aren't doing. :)
On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> >
> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > to improve this dcache fragmentation problem.
>
> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> an Altix (16k pages, 62 dentries to a slab page) under heavy
> fileserver workloads and it had no measurable effect. It appeared
> that there was almost always at least one active dentry on each page
> in the slab. The story may very well be different on 4k page
> machines, however.
>
> Typically, fragmentation was bad enough that reclaim removed ~90% of
> the working set of dentries to free about 1% of the memory in the
> dentry slab. We had to get down to freeing > 95% of the dentry cache
> before fragmentation started to reduce and the system stopped trying to
> reclaim the dcache which we then spent the next 10 minutes
> repopulating......
>
> We also tried separating out directory dentries into a separate slab
> so that (potentially) longer lived dentries were clustered together
> rather than sparsely distributed around the slab cache. Once again,
> it had no measurable effect on the level of fragmentation (with or
> without the rbtree patch).
I'm not surprised... With 62 dentrys per page, the likelyhood of
success is very small, and in fact performance could degrade since we
are holding the dcache lock more often and doing less useful work.
It has been over a year and my memory is hazy, but I think I did see
about a 10% improvement on my workload (some sort of SFS simulation
with millions of files being randomly accessed) on an x86 machine but CPU
utilization also went way up which I think was the dcache lock.
Whatever happened to the vfs_cache_pressue band-aid/sledgehammer ?
Is it not considered an option ?
On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > shortage happens ?
> >
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted). So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> >
> > > This is a problem that Bharata has been investigating at the moment.
> > > But he hasn't seen anything that can't be cured by a small memory
> > > pressure - IOW, dentries do get freed under memory pressure. So
> > > your case might be very useful. Bharata is maintaing an instrumentation
> > > patch to collect more information and an alternative dentry aging patch
> > > (using rbtree). Perhaps you could try with those.
> >
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
> >
>
> Ted,
>
> I am sending two patches here.
>
> First is dentry_stats patch which collects some dcache statistics
> and puts it into /proc/meminfo. This patch provides information
> about how dentries are distributed in dcache slab pages, how many
> free and in use dentries are present in dentry_unused lru list and
> how prune_dcache() performs with respect to freeing the requested
> number of dentries.
Hi Bharata,
+void get_dstat_info(void)
+{
+ struct dentry *dentry;
+
+ lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
+ lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(dentry, &dentry_unused, d_lru) {
+ if (atomic_read(&dentry->d_count))
+ lru_dentry_stat.nr_inuse++;
Dentries on dentry_unused list with d_count positive? Is that possible
at all? As far as my limited understanding goes, only dentries with zero
count can be part of the dentry_unused list.
+ if (dentry->d_flags & DCACHE_REFERENCED)
+ lru_dentry_stat.nr_ref++;
+ }
@@ -393,6 +430,9 @@ static inline void prune_one_dentry(stru
static void prune_dcache(int count)
{
+ int nr_requested = count;
+ int nr_freed = 0;
+
spin_lock(&dcache_lock);
for (; count ; count--) {
struct dentry *dentry;
@@ -427,8 +467,13 @@ static void prune_dcache(int count)
continue;
}
prune_one_dentry(dentry);
+ nr_freed++;
}
spin_unlock(&dcache_lock);
+ spin_lock(&prune_dcache_lock);
+ lru_dentry_stat.dprune_req = nr_requested;
+ lru_dentry_stat.dprune_freed = nr_freed;
Don't you mean "+=" ?
+ spin_unlock(&prune_dcache_lock);
On Wed, Sep 14, 2005 at 06:34:04PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> >
> > Ted,
> >
> > I am sending two patches here.
> >
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
>
> Hi Bharata,
>
> +void get_dstat_info(void)
> +{
> + struct dentry *dentry;
> +
> + lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
> + lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
> +
> + spin_lock(&dcache_lock);
> + list_for_each_entry(dentry, &dentry_unused, d_lru) {
> + if (atomic_read(&dentry->d_count))
> + lru_dentry_stat.nr_inuse++;
>
> Dentries on dentry_unused list with d_count positive? Is that possible
> at all? As far as my limited understanding goes, only dentries with zero
> count can be part of the dentry_unused list.
That changed during the lock-free dcache implementation during
2.5. If we strictly update the lru list, we will have to acquire
the dcache_lock in __d_lookup() on a successful lookup. So we
did lazy-lru, leave the dentries with non-zero refcounts
and clean them up later when we acquire dcache_lock for other
purposes.
Thanks
Dipankar
On Wed, Sep 14, 2005 at 11:48:52AM -0400, Sonny Rao wrote:
> On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > >
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> >
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab. The story may very well be different on 4k page
> > machines, however.
....
> I'm not surprised... With 62 dentrys per page, the likelyhood of
> success is very small, and in fact performance could degrade since we
> are holding the dcache lock more often and doing less useful work.
>
> It has been over a year and my memory is hazy, but I think I did see
> about a 10% improvement on my workload (some sort of SFS simulation
> with millions of files being randomly accessed) on an x86 machine but CPU
> utilization also went way up which I think was the dcache lock.
Hmmm - can't say that I've had the same experience. I did not notice
any decrease in fragmentation or increase in CPU usage...
FWIW, SFS is just one workload that produces fragmentation. Any
load that mixes or switches repeatedly between filesystem traversals
to producing memory pressure via the page cache tends to result in
fragmentation of the inode and dentry slabs...
> Whatever happened to the vfs_cache_pressue band-aid/sledgehammer ?
> Is it not considered an option ?
All that did was increase the fragmentation levels. Instead of
seeing a 4-5:1 free/used ratio in the dcache, it would push out to
10-15:1 if vfs_cache_pressue was used to prefer reclaiming dentries
over page cache pages. Going the other way and prefering reclaim of
page cache pages did nothing to change the level of fragmentation.
Reclaim still freed most of the dentries in the working set but it
took a little longer to do it.
Right now our only solution to prevent fragmentation on reclaim is
to throw more memory at the machine to prevent reclaim from
happening as the workload changes.
Cheers,
Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
On Thu, Sep 15, 2005 at 08:02:22AM +1000, David Chinner wrote:
> On Wed, Sep 14, 2005 at 11:48:52AM -0400, Sonny Rao wrote:
> > On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> > > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > >
> > > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > > to improve this dcache fragmentation problem.
> > >
> > > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > > fileserver workloads and it had no measurable effect. It appeared
> > > that there was almost always at least one active dentry on each page
> > > in the slab. The story may very well be different on 4k page
> > > machines, however.
>
> ....
>
> > I'm not surprised... With 62 dentrys per page, the likelyhood of
> > success is very small, and in fact performance could degrade since we
> > are holding the dcache lock more often and doing less useful work.
> >
> > It has been over a year and my memory is hazy, but I think I did see
> > about a 10% improvement on my workload (some sort of SFS simulation
> > with millions of files being randomly accessed) on an x86 machine but CPU
> > utilization also went way up which I think was the dcache lock.
>
> Hmmm - can't say that I've had the same experience. I did not notice
> any decrease in fragmentation or increase in CPU usage...
Well, this was on an x86 machine with 8 cores but relatively poor
scalability and horrific memory latencies ... i.e. it tends to
exaggerate the effects of bad locks compared to what I would see on a
more scalable POWER machine. We actually ran SFS on a 4-way POWER-5
machine with the patch and didn't see any real change in throughput,
and fragmentation was a little better. I can go dig out the data if
someone is really interested.
In your case with 62 dentry objects per page (which is only going to
get much worse as we bump up base page sizes), I think the chances of
success of this approach or anything similar are horrible because we
aren't really solving any of the fundamental issues.
For me, the patch we mostly an experiment to see if the "blunderbus"
effect (to quote mjb) could be controlled any better that we do
today. Mostly, it didn't seem worth it to me -- especially since we
wanted the global dcache lock to go away.
> FWIW, SFS is just one workload that produces fragmentation. Any
> load that mixes or switches repeatedly between filesystem traversals
> to producing memory pressure via the page cache tends to result in
> fragmentation of the inode and dentry slabs...
Yep, and that's more or less how I "simulated" SFS, just had tons of
small files. I wasn't trying to really simulate the networking part
or op mixture etc -- just the slab fragmentation as a "worst-case".
> > Whatever happened to the vfs_cache_pressue band-aid/sledgehammer ?
> > Is it not considered an option ?
>
> All that did was increase the fragmentation levels. Instead of
> seeing a 4-5:1 free/used ratio in the dcache, it would push out to
> 10-15:1 if vfs_cache_pressue was used to prefer reclaiming dentries
> over page cache pages. Going the other way and prefering reclaim of
> page cache pages did nothing to change the level of fragmentation.
> Reclaim still freed most of the dentries in the working set but it
> took a little longer to do it.
Yes, but on systems with smaller pages it does seem to have some
positive effect. I don't really know how well this has been
quantified.
> Right now our only solution to prevent fragmentation on reclaim is
> to throw more memory at the machine to prevent reclaim from
> happening as the workload changes.
That is unfortunate, but interesting because I didn't know if this was
not a "real-problem" as some have contended. I know SPEC SFS is a
somewhat questionable workload (really, what isn't though?), so the
evidence gathered from that didn't seem to convince many people.
What kind of (real) workload are you seeing this on?
Thanks,
Sonny
On Wed, Sep 14, 2005 at 02:43:13AM -0700, Andrew Morton wrote:
> Manfred Spraul <[email protected]> wrote:
> >
> > One tricky point are directory dentries: As far as I see, they are
> > pinned and unfreeable if a (freeable) directory entry is in the cache.
> >
>
> Well. That's the whole problem.
>
> I don't think it's been demonstrated that Ted's problem was caused by
> internal fragementation, btw. Ted, could you run slabtop, see what the
> dcache occupancy is? Monitor it as you start to manually apply pressure?
> If the occupancy falls to 10% and not many slab pages are freed up yet then
> yup, it's internal fragmentation.
The next time I can get my machine into that state, sure, I'll try it.
I used to be able to reproduce it using normal laptop usage patterns
(Lotus notes running under wine, kernel builds, apt-get upgrade's,
openoffice, firefox, etc.) about twice a week with 2.6.13-rc5, but
with 2.6.13, it happened once or twice, but since then I haven't been
able to trigger it. (Predictably, not after I posted about it on
LKML. :-/)
I've been trying a few things in the hopes of deliberately triggering
it, but so far, no luck. Maybe I should go back to 2.6.13-rc5 and see
if I have an easier time of reproducing the failure case.
- Ted
On Wed, Sep 14, 2005 at 11:01:15AM +0200, Andi Kleen wrote:
> On Tuesday 13 September 2005 23:59, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> >
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab. The story may very well be different on 4k page
> > machines, however.
>
> I always thought dentry freeing would work much better if it
> was turned upside down.
>
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.
If you add a slab free object callback, then you have the beginnings
of a more flexible solution to memory reclaim from the slabs.
For example, you can easily implement a reclaim-not-allocate method
for new slab allocations for when there is no memory available or the
size of the slab is passed some configurable high water mark...
Right now these is no way to control the size of a slab cache. Part
of the reason for the fragmentation I have seen is the massive
changes in size of the caches due to the OS making wrong decisions
about memory reclaim when small changes in the workload occur. We
currently have no way to provide hints to help the OS make the right
decision for a given workload....
Cheers,
Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > shortage happens ?
> >
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted). So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> >
> > > This is a problem that Bharata has been investigating at the moment.
> > > But he hasn't seen anything that can't be cured by a small memory
> > > pressure - IOW, dentries do get freed under memory pressure. So
> > > your case might be very useful. Bharata is maintaing an instrumentation
> > > patch to collect more information and an alternative dentry aging patch
> > > (using rbtree). Perhaps you could try with those.
> >
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
> >
>
> Ted,
>
> I am sending two patches here.
>
> First is dentry_stats patch which collects some dcache statistics
> and puts it into /proc/meminfo. This patch provides information
> about how dentries are distributed in dcache slab pages, how many
> free and in use dentries are present in dentry_unused lru list and
> how prune_dcache() performs with respect to freeing the requested
> number of dentries.
Bharata,
Ideally one should move the "nr_requested/nr_freed" counters from your
stats patch into "struct shrinker" (or somewhere else more appropriate
in which per-shrinkable-cache stats are maintained), and use the
"mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
IMO something along that line should be merged into mainline to walk
away from the "what the fuck is going on" state of things.
On Wed, Sep 14, 2005 at 06:40:40PM -0400, Sonny Rao wrote:
> On Thu, Sep 15, 2005 at 08:02:22AM +1000, David Chinner wrote:
> > Right now our only solution to prevent fragmentation on reclaim is
> > to throw more memory at the machine to prevent reclaim from
> > happening as the workload changes.
>
> That is unfortunate, but interesting because I didn't know if this was
> not a "real-problem" as some have contended. I know SPEC SFS is a
> somewhat questionable workload (really, what isn't though?), so the
> evidence gathered from that didn't seem to convince many people.
>
> What kind of (real) workload are you seeing this on?
Nothing special. Here's an example from a local altix build
server (8p, 12GiB RAM):
linvfs_icache 3376574 3891360 672 24 1 : tunables 54 27 8 : slabdata 162140 162140 0
dentry_cache 2632811 3007186 256 62 1 : tunables 120 60 8 : slabdata 48503 48503 0
I just copied and untarred some stuff I need to look at (~2GiB
data) and when that completed we now have:
linvfs_icache 590840 2813328 672 24 1 : tunables 54 27 8 : slabdata 117222 117222
dentry_cache 491984 2717708 256 62 1 : tunables 120 60 8 : slabdata 43834 43834
A few minutes later, with ppl doing normal work (rsync, kernel and
userspace package builds, tar, etc), a bit more had been reclaimed:
linvfs_icache 580589 2797992 672 24 1 : tunables 54 27 8 : slabdata 116583 116583 0
dentry_cache 412009 2418558 256 62 1 : tunables 120 60 8 : slabdata 39009 39009 0
We started with ~2.9GiB of active slab objects in ~210k pages
(3.3GiB RAM) in these two slabs. We've trimmed their active size
down to ~500MiB, but we still have 155k pages (2.5GiB) allocated to
the slabs.
I've seen much worse than this on build servers with more memory and
larger filesystems, especially after the filesystems have been
crawled by a backup program over night and we've ended up with > 10
million objects in each of these caches.
Cheers,
Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
On Wed, Sep 14, 2005 at 06:34:04PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > > shortage happens ?
> > >
> > > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > > (when I'm not travelling and so it doesn't get rebooted). So
> > > reproducing it isn't utterly trivial, but it's does happen often
> > > enough that it should be possible to get the necessary data.
> > >
> > > > This is a problem that Bharata has been investigating at the moment.
> > > > But he hasn't seen anything that can't be cured by a small memory
> > > > pressure - IOW, dentries do get freed under memory pressure. So
> > > > your case might be very useful. Bharata is maintaing an instrumentation
> > > > patch to collect more information and an alternative dentry aging patch
> > > > (using rbtree). Perhaps you could try with those.
> > >
> > > Send it to me, and I'd be happy to try either the instrumentation
> > > patch or the dentry aging patch.
> > >
> >
> > Ted,
> >
> > I am sending two patches here.
> >
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
>
> Hi Bharata,
>
> +void get_dstat_info(void)
> +{
> + struct dentry *dentry;
> +
> + lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
> + lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
> +
> + spin_lock(&dcache_lock);
> + list_for_each_entry(dentry, &dentry_unused, d_lru) {
> + if (atomic_read(&dentry->d_count))
> + lru_dentry_stat.nr_inuse++;
>
> Dentries on dentry_unused list with d_count positive? Is that possible
> at all? As far as my limited understanding goes, only dentries with zero
> count can be part of the dentry_unused list.
As Dipankar mentioned, its now possible to have positive d_count dentires
on unused_list. BTW I think we need a better way to get this data than
going through the entire unused_list linearly, which might not be
scalable with huge number of dentries.
>
> + if (dentry->d_flags & DCACHE_REFERENCED)
> + lru_dentry_stat.nr_ref++;
> + }
>
>
> @@ -393,6 +430,9 @@ static inline void prune_one_dentry(stru
>
> static void prune_dcache(int count)
> {
> + int nr_requested = count;
> + int nr_freed = 0;
> +
> spin_lock(&dcache_lock);
> for (; count ; count--) {
> struct dentry *dentry;
> @@ -427,8 +467,13 @@ static void prune_dcache(int count)
> continue;
> }
> prune_one_dentry(dentry);
> + nr_freed++;
> }
> spin_unlock(&dcache_lock);
> + spin_lock(&prune_dcache_lock);
> + lru_dentry_stat.dprune_req = nr_requested;
> + lru_dentry_stat.dprune_freed = nr_freed;
>
> Don't you mean "+=" ?
No. Actually here I am capturing the number of dentries freed
per invocation of prune_dcache.
Regards,
Bharata.
Martin J. Bligh wrote:
>
>If they're freeable, we should easily be able to move them, and therefore
>compact a fragmented slab. That way we can preserve the LRU'ness of it.
>Stage 1: free the oldest entries. Stage 2: compact the slab into whole
>pages. Stage 3: free whole pages back to teh page allocator.
>
>
That seems like the perfect solution to me. Freeing up 95% or more
gives us clean pages - and moving instead of actually freeing
everything avoids the cost of repopulating the cache later. :-)
Helge Hafting
On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> >
<snip>
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
>
> Bharata,
>
> Ideally one should move the "nr_requested/nr_freed" counters from your
> stats patch into "struct shrinker" (or somewhere else more appropriate
> in which per-shrinkable-cache stats are maintained), and use the
> "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
Yes, I agree that we should have the nr_requested and nr_freed type of
counters in appropriate place. And "struct shrinker" is probably right
place for it.
Essentially you are suggesting that we maintain per cpu statistics
of 'requested to free'(scanned) slab objects and actual freed objects.
And this should be on per shrinkable cache basis.
Is it ok to maintain this requested/freed counters as growing counters
or would it make more sense to have them reflect the statistics from
the latest/last attempt of cache shrink ? And where would be right
place to export this information ? (/proc/slabinfo ?, since it already
gives details of all caches)
If I understand correctly, "slabs_scanned" is the sum total number
of objects from all shrinkable caches scanned for possible freeeing.
I didn't get why this is part of page_state which mostly includes
page related statistics.
Regards,
Bharata.
On Thu, Sep 15, 2005 at 03:09:45PM +0530, Bharata B Rao wrote:
> On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > >
> <snip>
> > > First is dentry_stats patch which collects some dcache statistics
> > > and puts it into /proc/meminfo. This patch provides information
> > > about how dentries are distributed in dcache slab pages, how many
> > > free and in use dentries are present in dentry_unused lru list and
> > > how prune_dcache() performs with respect to freeing the requested
> > > number of dentries.
> >
> > Bharata,
> >
> > Ideally one should move the "nr_requested/nr_freed" counters from your
> > stats patch into "struct shrinker" (or somewhere else more appropriate
> > in which per-shrinkable-cache stats are maintained), and use the
> > "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> > break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
>
> Yes, I agree that we should have the nr_requested and nr_freed type of
> counters in appropriate place. And "struct shrinker" is probably right
> place for it.
>
> Essentially you are suggesting that we maintain per cpu statistics
> of 'requested to free'(scanned) slab objects and actual freed objects.
> And this should be on per shrinkable cache basis.
Yep.
> Is it ok to maintain this requested/freed counters as growing counters
> or would it make more sense to have them reflect the statistics from
> the latest/last attempt of cache shrink ?
It makes a lot more sense to account for all shrink attempts: it is necessary
to know how the reclaiming process is behaving over time. Thats why I wondered
about using "=" instead of "+=" in your patch.
> And where would be right place to export this information ?
> (/proc/slabinfo ?, since it already gives details of all caches)
My feeling is that changing /proc/slabinfo format might break userspace
applications.
> If I understand correctly, "slabs_scanned" is the sum total number
> of objects from all shrinkable caches scanned for possible freeeing.
Yep.
> I didn't get why this is part of page_state which mostly includes
> page related statistics.
Well, page_state contains most of the reclaiming statistics - its scope
is broader than "struct page" information.
To me it seems like the best place.
On Thu, Sep 15, 2005 at 10:29:10AM -0300, Marcelo Tosatti wrote:
> On Thu, Sep 15, 2005 at 03:09:45PM +0530, Bharata B Rao wrote:
> > On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > >
> > <snip>
> > > > First is dentry_stats patch which collects some dcache statistics
> > > > and puts it into /proc/meminfo. This patch provides information
> > > > about how dentries are distributed in dcache slab pages, how many
> > > > free and in use dentries are present in dentry_unused lru list and
> > > > how prune_dcache() performs with respect to freeing the requested
> > > > number of dentries.
> > >
> > > Bharata,
> > >
> > > Ideally one should move the "nr_requested/nr_freed" counters from your
> > > stats patch into "struct shrinker" (or somewhere else more appropriate
> > > in which per-shrinkable-cache stats are maintained), and use the
> > > "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> > > break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
> >
> > Yes, I agree that we should have the nr_requested and nr_freed type of
> > counters in appropriate place. And "struct shrinker" is probably right
> > place for it.
> >
> > Essentially you are suggesting that we maintain per cpu statistics
> > of 'requested to free'(scanned) slab objects and actual freed objects.
> > And this should be on per shrinkable cache basis.
>
> Yep.
>
> > Is it ok to maintain this requested/freed counters as growing counters
> > or would it make more sense to have them reflect the statistics from
> > the latest/last attempt of cache shrink ?
>
> It makes a lot more sense to account for all shrink attempts: it is necessary
> to know how the reclaiming process is behaving over time. Thats why I wondered
> about using "=" instead of "+=" in your patch.
>
> > And where would be right place to export this information ?
> > (/proc/slabinfo ?, since it already gives details of all caches)
>
> My feeling is that changing /proc/slabinfo format might break userspace
> applications.
>
> > If I understand correctly, "slabs_scanned" is the sum total number
> > of objects from all shrinkable caches scanned for possible freeeing.
>
> Yep.
>
> > I didn't get why this is part of page_state which mostly includes
> > page related statistics.
>
> Well, page_state contains most of the reclaiming statistics - its scope
> is broader than "struct page" information.
>
> To me it seems like the best place.
>
Marcelo,
The attached patch is an attempt to break the "slabs_scanned" into
meaningful pieces as you suggested.
But I coudn't do this cleanly because kmem_cache_t isn't defined
in a .h file and I didn't want to touch too many files in the first
attempt.
What I am doing here is making the "requested to free" and
"actual freed" counters as part of struct shrinker. With this I can
update these statistics seamlessly from shrink_slab().
I don't have this as per cpu counters because I wasn't sure if shrink_slab()
would have many concurrent executions warranting a lockless percpu
counters for these.
I am displaying this information as part of /proc/slabinfo and I have
verified that it atleast isn't breaking slabtop.
I thought about having this as part of /proc/vmstat and using
mod_page_state infrastructure as u suggested, but having the
"requested to free" and "actual freed" counters in struct page_state
for only those caches which set the shrinker function didn't look
good.
If you think that all this can be done in a better way, please
let me know.
Regards,
Bharata.
Bharata,
On Sun, Oct 02, 2005 at 10:02:29PM +0530, Bharata B Rao wrote:
>
> Marcelo,
>
> The attached patch is an attempt to break the "slabs_scanned" into
> meaningful pieces as you suggested.
>
> But I coudn't do this cleanly because kmem_cache_t isn't defined
> in a .h file and I didn't want to touch too many files in the first
> attempt.
>
> What I am doing here is making the "requested to free" and
> "actual freed" counters as part of struct shrinker. With this I can
> update these statistics seamlessly from shrink_slab().
>
> I don't have this as per cpu counters because I wasn't sure if shrink_slab()
> would have many concurrent executions warranting a lockless percpu
> counters for these.
Per-CPU counters are interesting because they avoid the atomic
operation _and_ potential cacheline bouncing. Given the fact that less
commonly used counters in the reclaim path are already per-CPU,
I think that it might be worth to do it here too.
> I am displaying this information as part of /proc/slabinfo and I have
> verified that it atleast isn't breaking slabtop.
>
> I thought about having this as part of /proc/vmstat and using
> mod_page_state infrastructure as u suggested, but having the
> "requested to free" and "actual freed" counters in struct page_state
> for only those caches which set the shrinker function didn't look
> good.
OK... You could change the atomic counters to per-CPU variables
in "struct shrinker".
> If you think that all this can be done in a better way, please
> let me know.
Marcelo,
Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
into meaningful per cache statistics. Now I have the statistics counters
as percpu. [an issue remaining is that there are more than one cache as
part of mbcache and they all have a common shrinker routine and I am
displaying the collective shrinker stats info on each of them in
/proc/slabinfo ==> some kind of duplication]
With this patch (and my earlier dcache stats patch) I observed some
interesting results with the following test scenario on a 8cpu p3 box:
- Ran an application which consumes 40% of the total memory.
- Ran dbench on tmpfs with 128 clients twice (serially).
- Ran a find on a ext3 partition having ~9.5million entries (files and
directories included)
At the end of this run, I have the following results:
[root@llm09 bharata]# cat /proc/meminfo
MemTotal: 3872528 kB
MemFree: 1420940 kB
Buffers: 714068 kB
Cached: 21536 kB
SwapCached: 2264 kB
Active: 1672680 kB
Inactive: 637460 kB
HighTotal: 3014616 kB
HighFree: 1411740 kB
LowTotal: 857912 kB
LowFree: 9200 kB
SwapTotal: 2096472 kB
SwapFree: 2051408 kB
Dirty: 172 kB
Writeback: 0 kB
Mapped: 1583680 kB
Slab: 119564 kB
CommitLimit: 4032736 kB
Committed_AS: 1647260 kB
PageTables: 2248 kB
VmallocTotal: 114680 kB
VmallocUsed: 1264 kB
VmallocChunk: 113384 kB
nr_dentries/page nr_pages nr_inuse
0 0 0
1 5 2
2 12 4
3 26 9
4 46 18
5 76 40
6 82 47
7 91 59
8 122 93
9 114 102
10 142 136
11 138 185
12 118 164
13 128 206
14 126 208
15 120 219
16 136 261
17 159 315
18 145 311
19 179 379
20 192 407
21 256 631
22 286 741
23 316 816
24 342 934
25 381 1177
26 664 2813
27 0 0
28 0 0
29 0 0
Total: 4402 10277
dcache lru: total 75369 inuse 3599
[Here,
nr_dentries/page - Number of dentries per page
nr_pages - Number of pages with given number of dentries
nr_inuse - Number of inuse dentries in those pages.
Eg: From the above data, there are 26 pages with 3 dentries each
and out of 78 total dentries in these 3 pages, 9 dentries are in use.]
[root@llm09 bharata]# grep shrinker /proc/slabinfo
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
ext3_xattr 0 0 48 78 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
dquot 0 0 160 24 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
inode_cache 1301 1390 400 10 1 : tunables 54 27 8 : slabdata 139 139 0 : shrinker stat 682752 681900
dentry_cache 82110 114452 152 26 1 : tunables 120 60 8 : slabdata 4402 4402 0 : shrinker stat 1557760 760100
[root@llm09 bharata]# grep slabs_scanned /proc/vmstat
slabs_scanned 2240512
[root@llm09 bharata]# cat /proc/sys/fs/dentry-state
82046 75369 45 0 3599 0
[The order of dentry-state o/p is like this:
total dentries in dentry hash list, total dentries in lru list, age limit,
want_pages, inuse dentries in lru list, dummy]
So, we can see that with low memory pressure, even though the
shrinker runs on dcache repeatedly, not many dentries are freed
by dcache. And dcache lru list still has huge number of free
dentries.
Regards,
Bharata.
Hi Bharata,
On Tue, Oct 04, 2005 at 07:06:35PM +0530, Bharata B Rao wrote:
> Marcelo,
>
> Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
> into meaningful per cache statistics. Now I have the statistics counters
> as percpu. [an issue remaining is that there are more than one cache as
> part of mbcache and they all have a common shrinker routine and I am
> displaying the collective shrinker stats info on each of them in
> /proc/slabinfo ==> some kind of duplication]
Looks good to me! IMO it should be a candidate for -mm/mainline.
Nothing useful to suggest on the mbcache issue... sorry.
> With this patch (and my earlier dcache stats patch) I observed some
> interesting results with the following test scenario on a 8cpu p3 box:
>
> - Ran an application which consumes 40% of the total memory.
> - Ran dbench on tmpfs with 128 clients twice (serially).
> - Ran a find on a ext3 partition having ~9.5million entries (files and
> directories included)
>
> At the end of this run, I have the following results:
>
> [root@llm09 bharata]# cat /proc/meminfo
> MemTotal: 3872528 kB
> MemFree: 1420940 kB
> Buffers: 714068 kB
> Cached: 21536 kB
> SwapCached: 2264 kB
> Active: 1672680 kB
> Inactive: 637460 kB
> HighTotal: 3014616 kB
> HighFree: 1411740 kB
> LowTotal: 857912 kB
> LowFree: 9200 kB
> SwapTotal: 2096472 kB
> SwapFree: 2051408 kB
> Dirty: 172 kB
> Writeback: 0 kB
> Mapped: 1583680 kB
> Slab: 119564 kB
> CommitLimit: 4032736 kB
> Committed_AS: 1647260 kB
> PageTables: 2248 kB
> VmallocTotal: 114680 kB
> VmallocUsed: 1264 kB
> VmallocChunk: 113384 kB
> nr_dentries/page nr_pages nr_inuse
> 0 0 0
> 1 5 2
> 2 12 4
> 3 26 9
> 4 46 18
> 5 76 40
> 6 82 47
> 7 91 59
> 8 122 93
> 9 114 102
> 10 142 136
> 11 138 185
> 12 118 164
> 13 128 206
> 14 126 208
> 15 120 219
> 16 136 261
> 17 159 315
> 18 145 311
> 19 179 379
> 20 192 407
> 21 256 631
> 22 286 741
> 23 316 816
> 24 342 934
> 25 381 1177
> 26 664 2813
> 27 0 0
> 28 0 0
> 29 0 0
> Total: 4402 10277
> dcache lru: total 75369 inuse 3599
>
> [Here,
> nr_dentries/page - Number of dentries per page
> nr_pages - Number of pages with given number of dentries
> nr_inuse - Number of inuse dentries in those pages.
> Eg: From the above data, there are 26 pages with 3 dentries each
> and out of 78 total dentries in these 3 pages, 9 dentries are in use.]
>
> [root@llm09 bharata]# grep shrinker /proc/slabinfo
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
> ext3_xattr 0 0 48 78 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
> dquot 0 0 160 24 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
> inode_cache 1301 1390 400 10 1 : tunables 54 27 8 : slabdata 139 139 0 : shrinker stat 682752 681900
> dentry_cache 82110 114452 152 26 1 : tunables 120 60 8 : slabdata 4402 4402 0 : shrinker stat 1557760 760100
>
> [root@llm09 bharata]# grep slabs_scanned /proc/vmstat
> slabs_scanned 2240512
>
> [root@llm09 bharata]# cat /proc/sys/fs/dentry-state
> 82046 75369 45 0 3599 0
> [The order of dentry-state o/p is like this:
> total dentries in dentry hash list, total dentries in lru list, age limit,
> want_pages, inuse dentries in lru list, dummy]
>
> So, we can see that with low memory pressure, even though the
> shrinker runs on dcache repeatedly, not many dentries are freed
> by dcache. And dcache lru list still has huge number of free
> dentries.
The success/attempt ratio is about 1/2, which seems alright?
On Wed, Oct 05, 2005 at 06:25:51PM -0300, Marcelo Tosatti wrote:
> Hi Bharata,
>
> On Tue, Oct 04, 2005 at 07:06:35PM +0530, Bharata B Rao wrote:
> > Marcelo,
> >
> > Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
> > into meaningful per cache statistics. Now I have the statistics counters
> > as percpu. [an issue remaining is that there are more than one cache as
> > part of mbcache and they all have a common shrinker routine and I am
> > displaying the collective shrinker stats info on each of them in
> > /proc/slabinfo ==> some kind of duplication]
>
> Looks good to me! IMO it should be a candidate for -mm/mainline.
>
> Nothing useful to suggest on the mbcache issue... sorry.
Thanks Marcelo for reviewing.
<snip>
> >
> > [root@llm09 bharata]# grep shrinker /proc/slabinfo
> > # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
> > ext3_xattr 0 0 48 78 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
> > dquot 0 0 160 24 1 : tunables 120 60 8 : slabdata 0 0 0 : shrinker stat 0 0
> > inode_cache 1301 1390 400 10 1 : tunables 54 27 8 : slabdata 139 139 0 : shrinker stat 682752 681900
> > dentry_cache 82110 114452 152 26 1 : tunables 120 60 8 : slabdata 4402 4402 0 : shrinker stat 1557760 760100
> >
> > [root@llm09 bharata]# grep slabs_scanned /proc/vmstat
> > slabs_scanned 2240512
> >
> > [root@llm09 bharata]# cat /proc/sys/fs/dentry-state
> > 82046 75369 45 0 3599 0
> > [The order of dentry-state o/p is like this:
> > total dentries in dentry hash list, total dentries in lru list, age limit,
> > want_pages, inuse dentries in lru list, dummy]
> >
> > So, we can see that with low memory pressure, even though the
> > shrinker runs on dcache repeatedly, not many dentries are freed
> > by dcache. And dcache lru list still has huge number of free
> > dentries.
>
> The success/attempt ratio is about 1/2, which seems alright?
>
Hmm... when compared to inode_cache, I felt dcache shrinker wasn't
doing a good job. Anyway I will analyze further to see if things
can be made better with the existing shrinker.
Regards,
Bharata.