LinuxLists.cc - [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry

2008-03-06 04:41:42

Subject: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

[Summary]
Make a limitation of dentry_unused to avoid soft lock up at NFS mounts
and remounting any filesystem.

[Descriptions]
- background
dentry_unused is a list of dentries which is not in use. This works
as a cache against not-exisiting files. dentry_unused grows up when
directories or files are removed. This list can be *verrry* long
if there is no memory pressure, because there is no limit.

- what's problem
When prune_dcache() is called, it scans *all* dentry_unused linearly
under spin_lock(). This scan costs very much if there are many entries.
For example, prune_dcache() is called at mounting NFS.
In our test, when there are 100,000,000 of unused dentries, mounting
NFS took 1 minutes and almost all user programs hang during it.

100,000,000 is possible number on large systems.

This problem already happend on our system.
Therefore, we need a limitation of dentry_unused.

- How to fix
Limit number of unused dentries to suitable value.

Threshold is as follows:
dentry_unused_ratio: default value is 10000(%). If the amount of
dentry_unused reaches to 10000% of the amount of dentry_in_use,
5% of them are freed.

I feel we need more tests to determine resonable value to any system.
So, please test.

This patch is based on linux-2.6.25-rc4.

-Test Results

Result on 24GB boxes with excessive unused dentries.
Without patch:
# cat /proc/sys/fs/dentry-state
103327453 103313783 45 0 0
# time mount -t nfs 192.168.0.2:/export /mnt
real 1m4.698s
user 0m0.000s
sys 1m4.672s

With this patch:
# cat /proc/sys/fs/dentry-state
118681 117225 45 0 0 0
# time mount -t nfs 192.168.0.2:/export /mnt
real 0m0.103s
user 0m0.004s
sys 0m0.076s

Tested on Intel Itanium 2 9050 (dualcore) x12 MEM 24GB , kernel-2.6.25-rc4
I found no peformance regression in my tests.

Best Regards,
Kentaro Makita

Signed-off-by: Kentaro Makita <[email protected]>
---
fs/dcache.c | 7 +++++++
1 files changed, 7 insertions(+)
diff -rupN -X linux-2.6.25-rc4/Documentation/dontdiff linux-2.6.25-rc4/fs/dcache.c linux-2.6.25-rc4mod/fs/dcache.c
--- linux-2.6.25-rc4/fs/dcache.c 2008-03-05 13:33:54.000000000 +0900
+++ linux-2.6.25-rc4mod/fs/dcache.c 2008-03-05 16:47:18.000000000 +0900
@@ -42,6 +42,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOC

EXPORT_SYMBOL(dcache_lock);

+/* threshold to limit dentry_unused */
+unsigned int dentry_unused_ratio = 10000;
static struct kmem_cache *dentry_cache __read_mostly;

#define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname))
@@ -61,6 +63,7 @@ static unsigned int d_hash_mask __read_m
static unsigned int d_hash_shift __read_mostly;
static struct hlist_head *dentry_hashtable __read_mostly;
static LIST_HEAD(dentry_unused);
+static void prune_dcache(int count, struct super_block *sb);

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
@@ -214,6 +217,10 @@ repeat:
}
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
+ /* Prune unused dentry over threshold level */
+ int nr_in_use = (dentry_stat.nr_dentry - dentry_stat.nr_unused);
+ if (dentry_stat.nr_dentry > nr_in_use * dentry_unused_ratio / 100)
+ prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);
return;

unhash_it:

2008-03-06 05:54:49

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

On Thu, Mar 06, 2008 at 01:41:29PM +0900, Kentaro Makita wrote:
> [Summary]
> Make a limitation of dentry_unused to avoid soft lock up at NFS mounts
> and remounting any filesystem.
>
> [Descriptions]
> - background
> dentry_unused is a list of dentries which is not in use. This works
> as a cache against not-exisiting files. dentry_unused grows up when
> directories or files are removed. This list can be *verrry* long
> if there is no memory pressure, because there is no limit.
>
> - what's problem
> When prune_dcache() is called, it scans *all* dentry_unused linearly
> under spin_lock(). This scan costs very much if there are many entries.
> For example, prune_dcache() is called at mounting NFS.
> In our test, when there are 100,000,000 of unused dentries, mounting
> NFS took 1 minutes and almost all user programs hang during it.
>
> 100,000,000 is possible number on large systems.
>
> This problem already happend on our system.
> Therefore, we need a limitation of dentry_unused.

No, we need a smarter free list structure. There have been several attempts
at this in the past. Two that I can recall off the top of my head:

- per node unused LRUs
- per superblock unusued LRUs

I guess we need to revisit this again, because limiting the size of
the cache like this is not an option.

> I feel we need more tests to determine resonable value to any system.
> So, please test.
.....
> Tested on Intel Itanium 2 9050 (dualcore) x12 MEM 24GB , kernel-2.6.25-rc4
> I found no peformance regression in my tests.

Try something that relies on leaving the working set on the unused
list, like NFS server benchmarks that have a working set of tens of
million of files....

> Signed-off-by: Kentaro Makita <[email protected]>
> ---
> fs/dcache.c | 7 +++++++
> 1 files changed, 7 insertions(+)
> diff -rupN -X linux-2.6.25-rc4/Documentation/dontdiff linux-2.6.25-rc4/fs/dcache.c linux-2.6.25-rc4mod/fs/dcache.c
> --- linux-2.6.25-rc4/fs/dcache.c 2008-03-05 13:33:54.000000000 +0900
> +++ linux-2.6.25-rc4mod/fs/dcache.c 2008-03-05 16:47:18.000000000 +0900
> @@ -42,6 +42,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOC
>
> EXPORT_SYMBOL(dcache_lock);
>
> +/* threshold to limit dentry_unused */
> +unsigned int dentry_unused_ratio = 10000;
> static struct kmem_cache *dentry_cache __read_mostly;
>
> #define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname))
> @@ -61,6 +63,7 @@ static unsigned int d_hash_mask __read_m
> static unsigned int d_hash_shift __read_mostly;
> static struct hlist_head *dentry_hashtable __read_mostly;
> static LIST_HEAD(dentry_unused);
> +static void prune_dcache(int count, struct super_block *sb);
>
> /* Statistics gathering. */
> struct dentry_stat_t dentry_stat = {
> @@ -214,6 +217,10 @@ repeat:
> }
> spin_unlock(&dentry->d_lock);
> spin_unlock(&dcache_lock);
> + /* Prune unused dentry over threshold level */
> + int nr_in_use = (dentry_stat.nr_dentry - dentry_stat.nr_unused);
> + if (dentry_stat.nr_dentry > nr_in_use * dentry_unused_ratio / 100)
> + prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);

nr_in_use is going to overflow 32 bits with this calculation.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-03-06 07:15:58

by Kentaro Makita

[permalink] [raw]

Subject: Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

David Chinner wrote:
> On Thu, Mar 06, 2008 at 01:41:29PM +0900, Kentaro Makita wrote:
....
>> 100,000,000 is possible number on large systems.
>>
>> This problem already happend on our system.
>> Therefore, we need a limitation of dentry_unused.
>
> No, we need a smarter free list structure. There have been several attempts
> at this in the past. Two that I can recall off the top of my head:
>
> - per node unused LRUs
> - per superblock unusued LRUs

I know there is such attempt already, but they are not in main-line.
I think this is not a smart but simple way to avoid this ploblem.
>
> I guess we need to revisit this again, because limiting the size of
> the cache like this is not an option.
>
>> I feel we need more tests to determine resonable value to any system.
>> So, please test.
> .....
>> Tested on Intel Itanium 2 9050 (dualcore) x12 MEM 24GB , kernel-2.6.25-rc4
>> I found no peformance regression in my tests.
>
> Try something that relies on leaving the working set on the unused
> list, like NFS server benchmarks that have a working set of tens of
> million of files....
>
Okay, I'll try some benchmarks and report results...
>> Signed-off-by: Kentaro Makita <[email protected]>
>> ---
>> fs/dcache.c | 7 +++++++
>> 1 files changed, 7 insertions(+)
>> diff -rupN -X linux-2.6.25-rc4/Documentation/dontdiff linux-2.6.25-rc4/fs/dcache.c linux-2.6.25-rc4mod/fs/dcache.c
>> --- linux-2.6.25-rc4/fs/dcache.c 2008-03-05 13:33:54.000000000 +0900
>> +++ linux-2.6.25-rc4mod/fs/dcache.c 2008-03-05 16:47:18.000000000 +0900
......
>> @@ -214,6 +217,10 @@ repeat:
>> }
>> spin_unlock(&dentry->d_lock);
>> spin_unlock(&dcache_lock);
>> + /* Prune unused dentry over threshold level */
>> + int nr_in_use = (dentry_stat.nr_dentry - dentry_stat.nr_unused);
>> + if (dentry_stat.nr_dentry > nr_in_use * dentry_unused_ratio / 100)
>> + prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);
>
> nr_in_use is going to overflow 32 bits with this calculation.
Oh, I simply mistake. I fix it at this post.
>
> Cheers,
>
> Dave.
Best Regards,
Kentaro Makita

Signed-off-by: Kentaro Makita <[email protected]>
---
dcache.c | 7 +++++++
1 files changed, 7 insertions(+)
diff -rupN -X linux-2.6.25-rc4/Documentation/dontdiff linux-2.6.25-rc4/fs/dcache.c linux-2.6.25-rc4mod/fs/dcache.c
--- linux-2.6.25-rc4/fs/dcache.c 2008-03-05 13:33:54.000000000 +0900
+++ linux-2.6.25-rc4mod/fs/dcache.c 2008-03-06 15:27:22.000000000 +0900
@@ -42,6 +42,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOC

EXPORT_SYMBOL(dcache_lock);

+/* threshold to limit dentry_unused */
+unsigned int dentry_unused_ratio = 10000;
static struct kmem_cache *dentry_cache __read_mostly;

#define DNAME_INLINE_LEN (sizeof(struct dentry)-offsetof(struct dentry,d_iname))
@@ -61,6 +63,7 @@ static unsigned int d_hash_mask __read_m
static unsigned int d_hash_shift __read_mostly;
static struct hlist_head *dentry_hashtable __read_mostly;
static LIST_HEAD(dentry_unused);
+static void prune_dcache(int count, struct super_block *sb);

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
@@ -214,6 +217,10 @@ repeat:
}
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
+ /* Prune unused dentry over threshold level */
+ int nr_in_use = (dentry_stat.nr_dentry - dentry_stat.nr_unused);
+ if (dentry_stat.nr_dentry > nr_in_use * (dentry_unused_ratio / 100))
+ prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);
return;

unhash_it:

2008-03-08 08:32:49

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

Hi makita-san

in general, I agreed with many people disallow >1min hang up.

> > No, we need a smarter free list structure. There have been several attempts
> > at this in the past. Two that I can recall off the top of my head:
> >
> > - per node unused LRUs
> > - per superblock unusued LRUs
>
> I know there is such attempt already, but they are not in main-line.
> I think this is not a smart but simple way to avoid this ploblem.

I think 2 improvement is not exclusive.
your patch is nice, but we need David's patch too.

because 2 patch purpose is different.

per superblock lru: improve typical performance.
limit of unused list: prevent too long hang up.

many time hang up happend at worst case even introduce per superblock lru.
and
unused list traversal doesn't improvement even introduce limit of unused list.

I hope both.

> >> Tested on Intel Itanium 2 9050 (dualcore) x12 MEM 24GB , kernel-2.6.25-rc4
> >> I found no peformance regression in my tests.
> >
> > Try something that relies on leaving the working set on the unused
> > list, like NFS server benchmarks that have a working set of tens of
> > million of files....
> >
> Okay, I'll try some benchmarks and report results...

good luck.

> spin_unlock(&dentry->d_lock);
> spin_unlock(&dcache_lock);
> + /* Prune unused dentry over threshold level */
> + int nr_in_use = (dentry_stat.nr_dentry - dentry_stat.nr_unused);
> + if (dentry_stat.nr_dentry > nr_in_use * (dentry_unused_ratio / 100))
> + prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);
> return;

Why don't you make sysctl adjustable interface of dentry_unused_ratio?

- kosaki

2008-03-14 05:15:45

by Kentaro Makita

[permalink] [raw]

Subject: Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

-------------------------------------------------------------------------------
Basic file operations :
w/o patch on local ext3:
target \ operations | create | delete | list | copy | move
-----------------------+----------------------------------------------------------------------------------
1000 dirs x 1000 files | 22m6.930s | 0m32.682s | 0m0.037s | 1m31.506s | 0m2.154s
1000000 files | 22m37.759s | 18m34.901s | 0m0.002s | 19m24.388s | 0m0.156s
(elapsed time : second(s))

with patch on local ext3:
target \ operations | create | delete | list | copy | move
-----------------------+---------------------------------------------------------------------------------
1000 dirs x 1000 files | 21m54.470s | 0m32.040s | 0m0.008s | 1m30.796s | 0m2.943s
1000000 files | 22m8.381s | 21m55.047s | 0m0.020s | 21m25.779s | 0m0.052s
(elapsed time : second(s))

w/o patch on nfs:
target \ operations | create | delete | list | copy | move
------------------------+----------------------------------------------------------------------------------
1000000 files | 140m7.649s | 293m46.285s | 0m0.098s | 432m7.720s | 0m0.674s
(elapsed time : second(s))

with patch on nfs:
target \ operations | create | delete | list | copy | move
------------------------+--------------------------------------------------------------------------------
1000000 files | 141m53.534s | 290m17.669s | 0m0.040s | 440m51.964s | 0m0.361s
(elapsed time : second(s))

IOzone:
# ./iozone -Ra > logfile
on ext3:
bytes / sec (Average)
w/o patch with patch
Writer Report 499,136 502,536 100.68%
Re-writer Report 1,774,772 1,790,133 100.87%
Reader Report 3,761,592 3,818,147 101.50%
Re-reader Report 5,723,402 6,020,088 105.18%
Random Read Report 5,343,096 5,588,652 104.60%
Random Write Report 2,054,678 2,102,237 102.31%
Backward Read Report 3,628,740 3,696,570 101.87%
Record Rewrite Report 3,697,344 3,760,118 101.70%
Stride Read Report 4,899,821 5,053,645 103.14%
Fwrite Report 493,434 493,464 100.01%
Re-fwrite Report 1,505,555 1,516,702 100.74%
Fread Report 3,330,627 3,363,825 101.00%
Re-fread Report 5,404,997 5,572,977 103.11%

on nfs:
bytes / sec (Average)
w/o patch with patch
Writer Report 2,397,539 2,495,369 104.08%
Re-writer Report 2,534,827 2,539,019 100.17%
Reader Report 3,692,377 3,711,528 100.52%
Re-reader Report 5,783,150 5,745,256 99.34%
Random Read Report 5,569,286 5,663,204 101.69%
Random Write Report 2,982,048 2,988,895 100.23%
Backward Read Report 3,694,922 3,710,797 100.43%
Record Rewrite Report 5,844,580 5,873,414 100.49%
Stride Read Report 5,043,812 5,060,472 100.33%
Fwrite Report 1,769,812 1,788,991 101.08%
Re-fwrite Report 1,964,384 1,978,361 100.71%
Fread Report 3,362,162 3,293,340 97.95%
Re-fread Report 5,441,776 5,441,807 100.00%

kernbench-0.42:
# kernbench -M
w/o patch on local ext3:
2.6.25-rc5
Average Half load -j 12 Run (std deviation):
Elapsed Time 105.354 (0.608383)
User Time 1072.59 (1.42999)
System Time 68.406 (0.540074)
Percent CPU 1082.4 (5.17687)
Context Switches 75067.2 (2425.63)
Sleeps 155188 (2167.44)

Average Optimal load -j 96 Run (std deviation):
Elapsed Time 69.028 (0.523374)
User Time 1106.83 (36.1126)
System Time 67.735 (0.82922)
Percent CPU 1416 (351.761)
Context Switches 105700 (32397.8)
Sleeps 161568 (7136.89)

with patch on local ext3:
2.6.25-rc5dentry
Average Half load -j 12 Run (std deviation):
Elapsed Time 104.962 (0.0630079)
User Time 1071.74 (0.374993)
System Time 68.578 (0.301032)
Percent CPU 1086 (0.707107)
Context Switches 77173.8 (513.063)
Sleeps 156710 (669.205)

Average Optimal load -j 96 Run (std deviation):
Elapsed Time 68.826 (0.942804)
User Time 1107.5 (37.7007)
System Time 67.901 (0.770086)
Percent CPU 1422.2 (354.748)
Context Switches 107092 (31559.1)
Sleeps 161884 (6220.1)

w/o patch on nfs:
2.6.25-rc5
Average Half load -j 12 Run (std deviation):
Elapsed Time 237.71 (6.4713)
User Time 1087.07 (1.42099)
System Time 190.306 (0.941637)
Percent CPU 537.2 (15.0233)
Context Switches 358822 (8395.04)
Sleeps 4.46148e+06 (53959.4)

Average Optimal load -j 96 Run (std deviation):
Elapsed Time 286.312 (4.8972)
User Time 1127.59 (42.7355)
System Time 304.32 (120.184)
Percent CPU 545.5 (14.6382)
Context Switches 603299 (257858)
Sleeps 9.21507e+06 (5.01086e+06)

with patch on nfs:
2.6.25-rc5dentry
Average Half load -j 12 Run (std deviation):
Elapsed Time 257.704 (8.20142)
User Time 1087.19 (0.992084)
System Time 191.294 (1.11267)
Percent CPU 496 (15.5885)
Context Switches 356975 (14893.6)
Sleeps 4.42764e+06 (68507.4)

Average Optimal load -j 96 Run (std deviation):
Elapsed Time 293.448 (2.64979)
User Time 1127.5 (42.5004)
System Time 308.478 (123.531)
Percent CPU 519.3 (26.9281)
Context Switches 601352 (258290)
Sleeps 9.2956e+06 (5.13148e+06)

dbench-3.04:
(on local and nfs directories)
# dbench 100

w/o patch on local ext3:
Throughput 186.4 MB/sec 100 procs

with patch on local ext3:
Throughput 215.831 MB/sec 100 procs

w/o patch on nfs:
Throughput 3.13253 MB/sec 100 procs

with patch on nfs:
Throughput 3.37892 MB/sec 100 procs
-----------------------------------------------------------------------------------

Attachments:

regression test results (5.35 kB)

2008-03-14 06:44:20

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused

On Fri, Mar 14, 2008 at 02:15:28PM +0900, Kentaro Makita wrote:
> Hi David
> On Thu, 6 Mar 2008 16:54:16 +1100 David Chinner wrote:
> >> No, we need a smarter free list structure. There have been several attempts
> >> at this in the past. Two that I can recall off the top of my head:
> >>
> >> - per node unused LRUs
> >> - per superblock unusued LRUs
> >> I guess we need to revisit this again, because limiting the size of
> >> the cache like this is not an option.
> I 'm interesting in your patch. I 'll test two patches above if there
> is newer version based on latest kernel.
>
> >> Try something that relies on leaving the working set on the unused
> >> list, like NFS server benchmarks that have a working set of tens of
> >> million of files....
> >>
> I tested following, and I found no regressions except one case.
> - kernbench-0.24 on local ext3 and nfs
> - dbench-3.04 on local ext3 and nfs
> - IOzone-3.291 on local ext3 and nfs
> -Basic file operations (create/delete/list/copy/move) on local ext3 and nfs

None of those really demonstrate the potential effects of your
proposed change. Even 1 million file sequential create and delete
will not stress it. It won't be until you need to hold that
million dentries in memory to prevent disk lookups while an
application generates significant memory pressure that you will
notice the difference. Without the dentries pinning the inodes,
they'll get reclaimed and need to be fetched from disk again....

FWIW - in trying to understand this a little more, I just checked my
idle test box just after boot and realised something:

$ cat /proc/sys/fs/dentry-state
12723 8709 45 0 0 0
$

That means 12723 allocated dentrys, 8709 unused. That means ~4000 in use.

If the limiting test you are using is:

if (dentry_stat.nr_dentry > nr_in_use * dentry_unused_ratio / 100)
prune_dcache(dentry_stat.nr_unused * 5 / 100 , NULL);

We need to have (4000 * 10000) / 100) = 400,000 allocated unused, cached
dentries before they get pruned back. i.e. the working set of dentries I
can currently have is 400,000.

I've got 24GB RAM on this box, and often I want to cache 10,000,000 inodes.
Under this algorithm, I'll need to pin 100,000 dentries to allow the cache to
grow this large or tweak a knob. Therein lies the problem....

Effetively, the dentry_unused_ratio is saying that for every node in
the dentry tree, we allow (dentry_unused_ratio / 100) cached leaves
distributed throughout the tree. At dentry_unused_ratio = 10,000
that gives us 100 leaves per node in the tree.

i.e. if your directory heirachy is deep, then you can cache lots and
lots of inodes because you pin lots of dentries as nodes in the
tree. But If you have a flat directory structure, there will be
relatively few nodes pinned and you can't cache as many inodes.

IOWs, the size limiting aspect of this algorithm is biased in
exactly the wrong direction. It grows without bound on filesystem
traversal (and hence fails to prevent the condition you want to avoid)
yet prevents caching lots of file dentries if you have a shallow
directory structure (can affect normal application performance).

To prevent the first, you need to tweak the knob in one direction,
and to prevent the second, you need to tweak the knob in the
other direction. We try to avoid adding knobs that require ppl
to tweak them all the time to get optimal performance.

I think we're better off trying to fix the traversal issue....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group