2014-04-16 11:24:17

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

Hi all,

Here is the second version to improve the extent status tree shrinker.
In this version I do some cleanups, add some statistics, and implement
two apporaches that we discussed at Napa to improve the shrinker.

One is to improve the current lru algorithm, which add a new list to
track all reclaimable objects in order not to burn some cpu time to scan
delayed extent. Meanwhile it makes lru algorithm more efficient when
some applications open a huge number of files. Another apporach is
inspired by Jan Kara. It drops lru algorithm and uses a round-robin
algorithm to shrink all reclaimable extent caches. Every time the
shrinker scans the list and tries to shrink objects from the position
that it stopped at last time. Please see the commit log in the patch
to get the more details.

>From the result, the conclusion is that the round-robin algorithm wins.
Espeically if the applications open a large amount of files.

In this patch set, patch 1 is pretty stable and can be queued in this
cycle. Patch 2 adds some statistics in order that we can collect more
details about the status of the shrinker. But I am not sure whether or
not we should enable it by default. Maybe we need to define a switch
to turn on/off dynamically. Patch 3 and patch 4 improve the shrinker
as described above.

There are also some improvements for these apporaches, such as using
rcu when the shrinker traverses the list because now the shrinker does
not need to change the list during this process. Another improvement
is to make the shrinker numa-aware. But before that I believe this
patch set should be reviewed as soon as possible. Now the key problem
is to make a decision which apporach should be applied.

I use two test cases to compare these improvements. The test case A
simulates some applications that generate a very fragmented extents
status tree, and the test case B simulates some applications opens a
large number of files with a few extent caches. Every test cases are
run 3 times.

For getting a fragmented extents status tree, I hack the code and let
ext4_es_can_be_merged() always return 0 in order to disable to merge
the extents status tree. Meanwhile for increasing the memory pressure,
vm.dirty_background_ratio is set to 60, and vm.dirty_ratio is set to 80
in order to keep dirty pages in memory as many as possible.

Environement
============
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4799.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15

$ cat /proc/meminfo
MemTotal: 24677988 kB

$ df -ah
/dev/sdb1 183G 15G 159G 9% /mnt/sdb1 (HDD)

The Test Case A
===============

Script
------
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
filesize=100000g
size=600000g
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
rw=write
numjobs=100
nrfiles=5

Max Scan Time
-------------
x vanilla
+ lru
* rr
N Min Max Median Avg Stddev
x 3 22230 24607 23532 23456.333 1190.3051
+ 3 203 364 301 289.33333 81.13158
Difference at 95.0% confidence
-23167 +/- 1912.16
-98.7665% +/- 8.15199%
(Student's t, pooled s = 843.626)
* 3 165 248 172 195 46.032597
Difference at 95.0% confidence
-23261.3 +/- 1909.16
-99.1687% +/- 8.1392%
(Student's t, pooled s = 842.302)

Avg. Scan Time
-------------
x vanilla
+ lru
* rr
N Min Max Median Avg Stddev
x 220 204 15997 3976 5268.6773 4121.2038
+ 220 105 169 126 132.65 14.904881
Difference at 95.0% confidence
-5136.03 +/- 544.593
-97.4823% +/- 10.3364%
(Student's t, pooled s = 2914.15)
* 224 55 144 82 97.834821 27.811093
Difference at 95.0% confidence
-5170.84 +/- 539.706
-98.1431% +/- 10.2437%
(Student's t, pooled s = 2900.98)

The Test Case B
===============

Script
------
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap

[io]
rw=randwrite
numjobs=25
nrfiles=40000

[streamer]
rw=write
numjobs=1
filesize=1000g
size=1000g
nrfiles=1

Max Scan Time
-------------
x vanilla
+ lru
* rr
N Min Max Median Avg Stddev
x 3 390531 481463 393469 421821 51672.373
+ 3 106433 170801 130652 135962 32510.874
Difference at 95.0% confidence
-285859 +/- 97844.9
-67.7678% +/- 23.1958%
(Student's t, pooled s = 43168.2)
* 3 72569 156338 113704 114203.67 41886.735
Difference at 95.0% confidence
-307617 +/- 106609
-72.926% +/- 25.2734%
(Student's t, pooled s = 47034.7)

Avg. Scan Time
-------------
x vanilla
+ lru
* rr
N Min Max Median Avg Stddev
x 221 164 155601 19553 24630.968 22736.242
+ 207 44 49210 13633 16167.768 15087.729
Difference at 95.0% confidence
-8463.2 +/- 3681.22
-34.36% +/- 14.9455%
(Student's t, pooled s = 19417.6)
* 78 41 18043 166 808.85897 2605.2387
Difference at 95.0% confidence
-23822.1 +/- 5062.86
-96.7161% +/- 20.5548%
(Student's t, pooled s = 19613.2)

As always, feedback, comment and idea are welcome.

Regards,
- Zheng

Zheng Liu (4):
ext4: improve extents status tree trace point
ext4: track extent status tree shrinker delay statictics
ext4: improve extents status tree shrinker lru algorithm
ext4: use a round-robin algorithm to shrink extent cache

fs/ext4/ext4.h | 11 +-
fs/ext4/extents.c | 4 +-
fs/ext4/extents_status.c | 310 +++++++++++++++++++++++++++++--------------
fs/ext4/extents_status.h | 16 ++-
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 4 +-
fs/ext4/super.c | 22 ++-
include/trace/events/ext4.h | 59 ++++++--
8 files changed, 296 insertions(+), 134 deletions(-)

--
1.7.9.7



2014-04-16 11:24:20

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH v2 1/4] ext4: improve extents status tree trace point

From: Zheng Liu <[email protected]>

This commit improves the trace point of extents status tree. We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.

Further this commit fixes a variable name in trace point in order to
keep consistency with others.

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 6 +++---
include/trace/events/ext4.h | 28 ++++++++++++++++++++--------
2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 0a014a7..090e69a 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1015,7 +1015,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,

sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+ trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}

@@ -1028,14 +1028,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int ret, nr_shrunk;

ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
+ trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
return ret;

nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);

- trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
+ trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
return nr_shrunk;
}

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 010ea89..c1d7a7c 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2379,7 +2379,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
show_extent_status(__entry->found ? __entry->status : 0))
);

-TRACE_EVENT(ext4_es_shrink_enter,
+DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),

TP_ARGS(sb, nr_to_scan, cache_cnt),
@@ -2401,26 +2401,38 @@ TRACE_EVENT(ext4_es_shrink_enter,
__entry->nr_to_scan, __entry->cache_cnt)
);

-TRACE_EVENT(ext4_es_shrink_exit,
- TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+TRACE_EVENT(ext4_es_shrink_scan_exit,
+ TP_PROTO(struct super_block *sb, int nr_shrunk, int cache_cnt),

- TP_ARGS(sb, shrunk_nr, cache_cnt),
+ TP_ARGS(sb, nr_shrunk, cache_cnt),

TP_STRUCT__entry(
__field( dev_t, dev )
- __field( int, shrunk_nr )
+ __field( int, nr_shrunk )
__field( int, cache_cnt )
),

TP_fast_assign(
__entry->dev = sb->s_dev;
- __entry->shrunk_nr = shrunk_nr;
+ __entry->nr_shrunk = nr_shrunk;
__entry->cache_cnt = cache_cnt;
),

- TP_printk("dev %d,%d shrunk_nr %d cache_cnt %d",
+ TP_printk("dev %d,%d nr_shrunk %d cache_cnt %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
- __entry->shrunk_nr, __entry->cache_cnt)
+ __entry->nr_shrunk, __entry->cache_cnt)
);

TRACE_EVENT(ext4_collapse_range,
--
1.7.9.7


2014-04-16 11:24:23

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH v2 2/4] ext4: track extent status tree shrinker delay statictics

From: Zheng Liu <[email protected]>

This commit adds some statictics in extent status tree shrinker. The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker. Here we count
the following statictics:
stats:
the number of all objects on all extent status trees
the number of reclaimable objects on lru list
the last sorted interval
the number of inodes on lru list
average:
scan time for shrinking some objects
the number of shrunk objects
maximum:
the inode that has max nr. of objects on lru list
the maximum scan time for shrinking some objects

The output looks like below:
$ cat /proc/fs/ext4/sda1/es_shrinker_info
stats:
28228 objects
6341 reclaimable objects
586 ms last sorted interval
250 inodes on lru list
average:
153 us scan time
128 shrunk objects
maximum:
255 inode (255 objects, 198 reclaimable)
125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
250 inodes on lru list
...
maximum:
255 inode (255 objects, 198 reclaimable)
0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 4 +-
fs/ext4/extents_status.c | 176 ++++++++++++++++++++++++++++++++++++++++---
fs/ext4/extents_status.h | 11 ++-
fs/ext4/super.c | 17 ++---
include/trace/events/ext4.h | 31 ++++++++
5 files changed, 216 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f1c65dc..89f3dfa 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,7 @@ struct ext4_inode_info {
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
struct list_head i_es_lru;
+ unsigned int i_es_all_nr; /* protected by i_es_lock */
unsigned int i_es_lru_nr; /* protected by i_es_lock */
unsigned long i_touch_when; /* jiffies of last accessing */

@@ -1329,8 +1330,7 @@ struct ext4_sb_info {
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
struct list_head s_es_lru;
- unsigned long s_es_last_sorted;
- struct percpu_counter s_extent_cache_cnt;
+ struct ext4_es_stats s_es_stats;
struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 090e69a..9d977a3 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -11,6 +11,8 @@
*/
#include <linux/rbtree.h>
#include <linux/list_sort.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
#include "ext4.h"
#include "extents_status.h"

@@ -313,19 +315,27 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
*/
if (!ext4_es_is_delayed(es)) {
EXT4_I(inode)->i_es_lru_nr++;
- percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_inc(&EXT4_SB(inode->i_sb)->
+ s_es_stats.es_stats_lru_cnt);
}

+ EXT4_I(inode)->i_es_all_nr++;
+ percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
return es;
}

static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ EXT4_I(inode)->i_es_all_nr--;
+ percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
/* Decrease the lru counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
EXT4_I(inode)->i_es_lru_nr--;
- percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_dec(&EXT4_SB(inode->i_sb)->
+ s_es_stats.es_stats_lru_cnt);
}

kmem_cache_free(ext4_es_cachep, es);
@@ -927,11 +937,16 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei)
{
struct ext4_inode_info *ei;
+ struct ext4_es_stats *es_stats;
struct list_head *cur, *tmp;
LIST_HEAD(skipped);
+ ktime_t start_time;
+ u64 scan_time;
int nr_shrunk = 0;
int retried = 0, skip_precached = 1, nr_skipped = 0;

+ es_stats = &sbi->s_es_stats;
+ start_time = ktime_get();
spin_lock(&sbi->s_es_lru_lock);

retry:
@@ -942,7 +957,8 @@ retry:
* If we have already reclaimed all extents from extent
* status tree, just stop the loop immediately.
*/
- if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+ if (percpu_counter_read_positive(
+ &es_stats->es_stats_lru_cnt) == 0)
break;

ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
@@ -952,7 +968,7 @@ retry:
* time. Normally we try hard to avoid shrinking
* precached inodes, but we will as a last resort.
*/
- if ((sbi->s_es_last_sorted < ei->i_touch_when) ||
+ if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
(skip_precached && ext4_test_inode_state(&ei->vfs_inode,
EXT4_STATE_EXT_PRECACHED))) {
nr_skipped++;
@@ -986,7 +1002,7 @@ retry:
if ((nr_shrunk == 0) && nr_skipped && !retried) {
retried++;
list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
- sbi->s_es_last_sorted = jiffies;
+ es_stats->es_stats_last_sorted = jiffies;
ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
i_es_lru);
/*
@@ -1004,6 +1020,22 @@ retry:
if (locked_ei && nr_shrunk == 0)
nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);

+ scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+ if (likely(es_stats->es_stats_scan_time))
+ es_stats->es_stats_scan_time = (scan_time +
+ es_stats->es_stats_scan_time*3) / 4;
+ else
+ es_stats->es_stats_scan_time = scan_time;
+ if (scan_time > es_stats->es_stats_max_scan_time)
+ es_stats->es_stats_max_scan_time = scan_time;
+ if (likely(es_stats->es_stats_shrunk))
+ es_stats->es_stats_shrunk = (nr_shrunk +
+ es_stats->es_stats_shrunk*3) / 4;
+ else
+ es_stats->es_stats_shrunk = nr_shrunk;
+
+ trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+ nr_skipped, retried);
return nr_shrunk;
}

@@ -1014,7 +1046,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
struct ext4_sb_info *sbi;

sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
- nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+ nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}
@@ -1027,7 +1059,7 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk;

- ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+ ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
@@ -1039,19 +1071,143 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
return nr_shrunk;
}

-void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
+static void *ext4_es_seq_shrinker_info_start(struct seq_file *seq, loff_t *pos)
+{
+ return *pos ? NULL : SEQ_START_TOKEN;
+}
+
+static void *
+ext4_es_seq_shrinker_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ return NULL;
+}
+
+static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
+{
+ struct ext4_sb_info *sbi = seq->private;
+ struct ext4_es_stats *es_stats = &sbi->s_es_stats;
+ struct ext4_inode_info *ei, *max = NULL;
+ unsigned int inode_cnt = 0;
+
+ if (v != SEQ_START_TOKEN)
+ return 0;
+
+ /* here we just find an inode that has the max nr. of objects */
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+ inode_cnt++;
+ if (max && max->i_es_all_nr < ei->i_es_all_nr)
+ max = ei;
+ else if (!max)
+ max = ei;
+ }
+ spin_unlock(&sbi->s_es_lru_lock);
+
+ seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
+ percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
+ percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+ if (es_stats->es_stats_last_sorted != 0)
+ seq_printf(seq, " %u ms last sorted interval\n",
+ jiffies_to_msecs(jiffies -
+ es_stats->es_stats_last_sorted));
+ if (inode_cnt)
+ seq_printf(seq, " %d inodes on lru list\n", inode_cnt);
+
+ seq_printf(seq, "average:\n %llu us scan time\n",
+ div_u64(es_stats->es_stats_scan_time, 1000));
+ seq_printf(seq, " %lu shrunk objects\n", es_stats->es_stats_shrunk);
+ if (inode_cnt)
+ seq_printf(seq,
+ "maximum:\n %lu inode (%u objects, %u reclaimable)\n"
+ " %llu us max scan time\n",
+ max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+ div_u64(es_stats->es_stats_max_scan_time, 1000));
+
+ return 0;
+}
+
+static void ext4_es_seq_shrinker_info_stop(struct seq_file *seq, void *v)
{
+}
+
+static const struct seq_operations ext4_es_seq_shrinker_info_ops = {
+ .start = ext4_es_seq_shrinker_info_start,
+ .next = ext4_es_seq_shrinker_info_next,
+ .stop = ext4_es_seq_shrinker_info_stop,
+ .show = ext4_es_seq_shrinker_info_show,
+};
+
+static int
+ext4_es_seq_shrinker_info_open(struct inode *inode, struct file *file)
+{
+ int ret;
+
+ ret = seq_open(file, &ext4_es_seq_shrinker_info_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = PDE_DATA(inode);
+ }
+
+ return ret;
+}
+
+static int
+ext4_es_seq_shrinker_info_release(struct inode *inode, struct file *file)
+{
+ return seq_release(inode, file);
+}
+
+static const struct file_operations ext4_es_seq_shrinker_info_fops = {
+ .owner = THIS_MODULE,
+ .open = ext4_es_seq_shrinker_info_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = ext4_es_seq_shrinker_info_release,
+};
+
+int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
+{
+ int err;
+
INIT_LIST_HEAD(&sbi->s_es_lru);
spin_lock_init(&sbi->s_es_lru_lock);
- sbi->s_es_last_sorted = 0;
+ sbi->s_es_stats.es_stats_last_sorted = 0;
+ sbi->s_es_stats.es_stats_shrunk = 0;
+ sbi->s_es_stats.es_stats_scan_time = 0;
+ sbi->s_es_stats.es_stats_max_scan_time = 0;
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0);
+ if (err)
+ return err;
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0);
+ if (err)
+ goto err1;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
- register_shrinker(&sbi->s_es_shrinker);
+ err = register_shrinker(&sbi->s_es_shrinker);
+ if (err)
+ goto err2;
+
+ if (sbi->s_proc)
+ proc_create_data("es_shrinker_info", S_IRUGO, sbi->s_proc,
+ &ext4_es_seq_shrinker_info_fops, sbi);
+
+ return 0;
+
+err2:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+err1:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+ return err;
}

void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
+ if (sbi->s_proc)
+ remove_proc_entry("es_shrinker_info", sbi->s_proc);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
unregister_shrinker(&sbi->s_es_shrinker);
}

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f1b62a4..647c3c9 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -64,6 +64,15 @@ struct ext4_es_tree {
struct extent_status *cache_es; /* recently accessed extent */
};

+struct ext4_es_stats {
+ unsigned long es_stats_last_sorted;
+ unsigned long es_stats_shrunk;
+ u64 es_stats_scan_time;
+ u64 es_stats_max_scan_time;
+ struct percpu_counter es_stats_all_cnt;
+ struct percpu_counter es_stats_lru_cnt;
+};
+
extern int __init ext4_init_es(void);
extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
@@ -138,7 +147,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
(pb & ~ES_MASK));
}

-extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
+extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_lru_add(struct inode *inode);
extern void ext4_es_lru_del(struct inode *inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f3c6670..df0fb22 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -820,7 +820,6 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
- percpu_counter_destroy(&sbi->s_extent_cache_cnt);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < MAXQUOTAS; i++)
@@ -884,6 +883,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
INIT_LIST_HEAD(&ei->i_es_lru);
+ ei->i_es_all_nr = 0;
ei->i_es_lru_nr = 0;
ei->i_touch_when = 0;
ei->i_reserved_data_blocks = 0;
@@ -3890,10 +3890,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_err_report.data = (unsigned long) sb;

/* Register extent status tree shrinker */
- ext4_es_register_shrinker(sbi);
-
- err = percpu_counter_init(&sbi->s_freeclusters_counter,
- ext4_count_free_clusters(sb));
+ err = ext4_es_register_shrinker(sbi);
+ if (!err) {
+ err = percpu_counter_init(&sbi->s_freeclusters_counter,
+ ext4_count_free_clusters(sb));
+ }
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
ext4_count_free_inodes(sb));
@@ -3905,9 +3906,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (!err) {
err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0);
}
- if (!err) {
- err = percpu_counter_init(&sbi->s_extent_cache_cnt, 0);
- }
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
goto failed_mount3;
@@ -4220,8 +4218,8 @@ failed_mount_wq:
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
}
-failed_mount3:
ext4_es_unregister_shrinker(sbi);
+failed_mount3:
del_timer_sync(&sbi->s_err_report);
if (sbi->s_flex_groups)
ext4_kvfree(sbi->s_flex_groups);
@@ -4229,7 +4227,6 @@ failed_mount3:
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
- percpu_counter_destroy(&sbi->s_extent_cache_cnt);
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
failed_mount2:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index c1d7a7c..996e58b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2460,6 +2460,37 @@ TRACE_EVENT(ext4_collapse_range,
__entry->offset, __entry->len)
);

+TRACE_EVENT(ext4_es_shrink,
+ TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
+ int skip_precached, int nr_skipped, int retried),
+
+ TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_shrunk )
+ __field( unsigned long long, scan_time )
+ __field( int, skip_precached )
+ __field( int, nr_skipped )
+ __field( int, retried )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_shrunk = nr_shrunk;
+ __entry->scan_time = div_u64(scan_time, 1000);
+ __entry->skip_precached = skip_precached;
+ __entry->nr_skipped = nr_skipped;
+ __entry->retried = retried;
+ ),
+
+ TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+ "nr_skipped %d retried %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
+ __entry->scan_time, __entry->skip_precached,
+ __entry->nr_skipped, __entry->retried)
+);
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
--
1.7.9.7


2014-04-16 11:24:27

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH v2 4/4] ext4: use a round-robin algorithm to shrink extent cache

From: Zheng Liu <[email protected]>

This commit drops lru algorithm and uses a new round-robin algorithm to
reclaim the extent cache. The new algorithm scans all extent caches and
saves the inode and offset that it stopped at last time, and rescans from
here. If this inode has been reclaimed, the shrinker will scan from the
next inode.

At the first round the shrinker skips the precached inode, and the
shrinker will rescan the list to reclaim these precached inodes if it
doesn't reclaim any objects at the first round.

In fact the shrinker shouldn't need to add any new variables to save the
inode or offset. Every time the inodes that have been scanned will be
moved to the tail of the list. Meanwhile extent cache is added into the
tail of the list. Thus every time the shrinker will scan from the inode
and offset that last time it stopped.

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 7 ++--
fs/ext4/extents.c | 4 +--
fs/ext4/extents_status.c | 82 ++++++++++++++++++++++------------------------
fs/ext4/extents_status.h | 9 +++--
fs/ext4/inode.c | 4 +--
fs/ext4/ioctl.c | 4 +--
fs/ext4/super.c | 5 ++-
7 files changed, 54 insertions(+), 61 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 89f3dfa..82238ba 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -889,10 +889,9 @@ struct ext4_inode_info {
/* extents status tree */
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
- struct list_head i_es_lru;
+ struct list_head i_es_list;
unsigned int i_es_all_nr; /* protected by i_es_lock */
unsigned int i_es_lru_nr; /* protected by i_es_lock */
- unsigned long i_touch_when; /* jiffies of last accessing */

/* ialloc */
ext4_group_t i_last_alloc_group;
@@ -1329,10 +1328,10 @@ struct ext4_sb_info {

/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
- struct list_head s_es_lru;
+ struct list_head s_es_list;
struct ext4_es_stats s_es_stats;
struct mb_cache *s_mb_cache;
- spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
+ spinlock_t s_es_lock ____cacheline_aligned_in_smp;

/* Ratelimit ext4 messages. */
struct ratelimit_state s_err_ratelimit_state;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 82df3ce..ed7da01 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4617,7 +4617,7 @@ out2:

trace_ext4_ext_map_blocks_exit(inode, flags, map,
err ? err : allocated);
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
return err ? err : allocated;
}

@@ -5159,7 +5159,7 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
error = ext4_fill_fiemap_extents(inode, start_blk,
len_blks, fieinfo);
}
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
return error;
}

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index fe33557..f05cb40 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -170,7 +170,7 @@ void ext4_exit_es(void)
void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
- INIT_HLIST_HEAD(&tree->evictable_list);
+ INIT_LIST_HEAD(&tree->evictable_list);
tree->cache_es = NULL;
}

@@ -309,7 +309,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
if (es == NULL)
return NULL;

- INIT_HLIST_NODE(&es->es_list);
+ INIT_LIST_HEAD(&es->es_list);
es->es_lblk = lblk;
es->es_len = len;
es->es_pblk = pblk;
@@ -321,7 +321,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ei->i_es_lru_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_lru_cnt);
- hlist_add_head(&es->es_list, &ei->i_es_tree.evictable_list);
+ list_add_tail(&es->es_list, &ei->i_es_tree.evictable_list);
}

EXT4_I(inode)->i_es_all_nr++;
@@ -340,7 +340,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
/* Decrease the lru counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
BUG_ON(ei->i_es_lru_nr-- == 0);
- hlist_del_init(&es->es_list);
+ list_del_init(&es->es_list);
percpu_counter_dec(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_lru_cnt);
}
@@ -922,21 +922,20 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei)
{
- struct ext4_inode_info *ei;
struct ext4_es_stats *es_stats;
+ struct ext4_inode_info *ei;
+ LIST_HEAD(scanned);
ktime_t start_time;
u64 scan_time;
- int nr_shrunk = 0;
+ int nr_shrunk = 0, shrunk;
int retried = 0, skip_precached = 1, nr_skipped = 0;

es_stats = &sbi->s_es_stats;
start_time = ktime_get();
- spin_lock(&sbi->s_es_lru_lock);
+ spin_lock(&sbi->s_es_lock);

retry:
- list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
- int shrunk;
-
+ list_for_each_entry(ei, &sbi->s_es_list, i_es_list) {
/*
* If we have already reclaimed all extents from extent
* status tree, just stop the loop immediately.
@@ -950,9 +949,8 @@ retry:
* time. Normally we try hard to avoid shrinking
* precached inodes, but we will as a last resort.
*/
- if ((es_stats->es_stats_last_scanned < ei->i_touch_when) ||
- (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
- EXT4_STATE_EXT_PRECACHED))) {
+ if (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
+ EXT4_STATE_EXT_PRECACHED)) {
nr_skipped++;
continue;
}
@@ -970,12 +968,18 @@ retry:
break;
}

+ /* Move the scanned inodes into the tail of the list */
+ if (&ei->i_es_list != &sbi->s_es_list) {
+ struct ext4_inode_info *prev = list_prev_entry(ei, i_es_list);
+ list_cut_position(&scanned, &sbi->s_es_list, &prev->i_es_list);
+ list_splice_tail(&scanned, &sbi->s_es_list);
+ }
+
/*
* If we skipped any inodes, and we weren't able to make any
- * forward progress, update the last scanned time stamp and try again.
+ * forward progress, release precached inode.
*/
if ((nr_shrunk == 0) && nr_skipped && !retried) {
- es_stats->es_stats_last_scanned = jiffies;
retried++;
/*
* If the shrinker can not reclaim any objects at the
@@ -985,7 +989,7 @@ retry:
goto retry;
}

- spin_unlock(&sbi->s_es_lru_lock);
+ spin_unlock(&sbi->s_es_lock);

if (locked_ei && nr_shrunk == 0)
nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
@@ -1063,25 +1067,21 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
return 0;

/* here we just find an inode that has the max nr. of objects */
- spin_lock(&sbi->s_es_lru_lock);
- list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+ spin_lock(&sbi->s_es_lock);
+ list_for_each_entry(ei, &sbi->s_es_list, i_es_list) {
inode_cnt++;
if (max && max->i_es_all_nr < ei->i_es_all_nr)
max = ei;
else if (!max)
max = ei;
}
- spin_unlock(&sbi->s_es_lru_lock);
+ spin_unlock(&sbi->s_es_lock);

seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
- if (es_stats->es_stats_last_scanned != 0)
- seq_printf(seq, " %u ms last sorted interval\n",
- jiffies_to_msecs(jiffies -
- es_stats->es_stats_last_scanned));
if (inode_cnt)
- seq_printf(seq, " %d inodes on lru list\n", inode_cnt);
+ seq_printf(seq, " %d inodes on list\n", inode_cnt);

seq_printf(seq, "average:\n %llu us scan time\n",
div_u64(es_stats->es_stats_scan_time, 1000));
@@ -1139,9 +1139,8 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
{
int err;

- INIT_LIST_HEAD(&sbi->s_es_lru);
- spin_lock_init(&sbi->s_es_lru_lock);
- sbi->s_es_stats.es_stats_last_scanned = 0;
+ INIT_LIST_HEAD(&sbi->s_es_list);
+ spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
@@ -1181,31 +1180,29 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
unregister_shrinker(&sbi->s_es_shrinker);
}

-void ext4_es_lru_add(struct inode *inode)
+void ext4_es_list_add(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

- ei->i_touch_when = jiffies;
-
- if (!list_empty(&ei->i_es_lru))
+ if (!list_empty(&ei->i_es_list))
return;

- spin_lock(&sbi->s_es_lru_lock);
- if (list_empty(&ei->i_es_lru))
- list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
- spin_unlock(&sbi->s_es_lru_lock);
+ spin_lock(&sbi->s_es_lock);
+ if (list_empty(&ei->i_es_list))
+ list_add_tail(&ei->i_es_list, &sbi->s_es_list);
+ spin_unlock(&sbi->s_es_lock);
}

-void ext4_es_lru_del(struct inode *inode)
+void ext4_es_list_del(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

- spin_lock(&sbi->s_es_lru_lock);
- if (!list_empty(&ei->i_es_lru))
- list_del_init(&ei->i_es_lru);
- spin_unlock(&sbi->s_es_lru_lock);
+ spin_lock(&sbi->s_es_lock);
+ if (!list_empty(&ei->i_es_list))
+ list_del_init(&ei->i_es_list);
+ spin_unlock(&sbi->s_es_lock);
}

static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
@@ -1213,8 +1210,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
- struct extent_status *es;
- struct hlist_node *tmp;
+ struct extent_status *es, *tmp;
unsigned long nr_shrunk = 0;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -1226,7 +1222,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
__ratelimit(&_rs))
ext4_warning(inode->i_sb, "forced shrink of precached extents");

- hlist_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
+ list_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
BUG_ON(ext4_es_is_delayed(es));
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(inode, es);
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f19ca17..3626d88 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,7 +54,7 @@ struct ext4_extent;

struct extent_status {
struct rb_node rb_node;
- struct hlist_node es_list;
+ struct list_head es_list;
ext4_lblk_t es_lblk; /* first logical block extent covers */
ext4_lblk_t es_len; /* length of extent in block */
ext4_fsblk_t es_pblk; /* first physical block */
@@ -62,12 +62,11 @@ struct extent_status {

struct ext4_es_tree {
struct rb_root root;
- struct hlist_head evictable_list;
+ struct list_head evictable_list;
struct extent_status *cache_es; /* recently accessed extent */
};

struct ext4_es_stats {
- unsigned long es_stats_last_scanned;
unsigned long es_stats_shrunk;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
@@ -151,7 +150,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,

extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
-extern void ext4_es_lru_add(struct inode *inode);
-extern void ext4_es_lru_del(struct inode *inode);
+extern void ext4_es_list_add(struct inode *inode);
+extern void ext4_es_list_del(struct inode *inode);

#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0171c19..dba2c1f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -523,7 +523,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,

/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
map->m_pblk = ext4_es_pblock(&es) +
map->m_lblk - es.es_lblk;
@@ -1532,7 +1532,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,

/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, iblock, &es)) {
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
if (ext4_es_is_hole(&es)) {
retval = 0;
down_read((&EXT4_I(inode)->i_data_sem));
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f2252e..25c9ef0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -78,8 +78,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
memswap(&ei1->i_disksize, &ei2->i_disksize, sizeof(ei1->i_disksize));
ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
- ext4_es_lru_del(inode1);
- ext4_es_lru_del(inode2);
+ ext4_es_list_del(inode1);
+ ext4_es_list_del(inode2);

isize = i_size_read(inode1);
i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index df0fb22..8d9cdf4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -882,10 +882,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
- INIT_LIST_HEAD(&ei->i_es_lru);
+ INIT_LIST_HEAD(&ei->i_es_list);
ei->i_es_all_nr = 0;
ei->i_es_lru_nr = 0;
- ei->i_touch_when = 0;
ei->i_reserved_data_blocks = 0;
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
@@ -974,7 +973,7 @@ void ext4_clear_inode(struct inode *inode)
dquot_drop(inode);
ext4_discard_preallocations(inode);
ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
- ext4_es_lru_del(inode);
+ ext4_es_list_del(inode);
if (EXT4_I(inode)->jinode) {
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
EXT4_I(inode)->jinode);
--
1.7.9.7


2014-04-16 11:24:24

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH v2 3/4] ext4: improve extents status tree shrinker lru algorithm

From: Zheng Liu <[email protected]>

Currently there are two defects in extents status tree shrinker. One is
non-reclaimable entry, and another is lru list.

The extents status tree shrinker will scan all inodes on sbi->s_es_lru
under heavy memory pressure, and try to reclaim the entry from extents
status tree. During this process it couldn't reclaim the delayed entry
because ext4 needs to use these entries to do delayed allocation space
reservation, seek_data/hole, etc.... So if a system did a huge number
of writes and these dirty pages don't be written out. There will be a
lot of delayed entries on extents status tree. If shrinker tries to
reclaim memory from the tree, it will burn some CPU time to iterate on
these non-reclaimable entries. Under some circumstances it could cause
excessive stall time. In this commit a new list is used to track
reclaimable entries of extent status tree (e.g. written/unwritten/hole
entries). The shrinker will scan reclaimable entry on this list. So it
won't encouter any delayed entry and don't need to take too much time to
spin. But the defect is that we need to cost extra 1/3 memory space for
one entry. Before this commit, 'struct extent_status' occupies 48 bytes
on a 64bits platform. After that it will occupy 64 bytes. :(

Another improvement in this commit is to make lru list more efficient.
Now when we can not shrink any entry from the extents status tree every
time, we will sort lru list and scan again. But it takes too much time,
and cause a huge stall when the application opens a large number of
files. In this commit we just use ei->i_touch_when to determine whether
or not the shrinker reclaim entry from an inode. This time stamp will
be touched in ext4_es_lru_add(). After that, entry whose i_touch_when
is less than es_stats_last_scanned is discarded. When the shrinker can
not reclaim any entry, es_stats_last_scanned will be updated, and it will
try to reclaim precached inode.

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 98 ++++++++++++++--------------------------------
fs/ext4/extents_status.h | 4 +-
2 files changed, 33 insertions(+), 69 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9d977a3..fe33557 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -10,7 +10,6 @@
* Ext4 extents status tree core functions.
*/
#include <linux/rbtree.h>
-#include <linux/list_sort.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include "ext4.h"
@@ -171,6 +170,7 @@ void ext4_exit_es(void)
void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
+ INIT_HLIST_HEAD(&tree->evictable_list);
tree->cache_es = NULL;
}

@@ -302,10 +302,14 @@ static struct extent_status *
ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ext4_fsblk_t pblk)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
struct extent_status *es;
+
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
if (es == NULL)
return NULL;
+
+ INIT_HLIST_NODE(&es->es_list);
es->es_lblk = lblk;
es->es_len = len;
es->es_pblk = pblk;
@@ -314,9 +318,10 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
* We don't count delayed extent because we never try to reclaim them
*/
if (!ext4_es_is_delayed(es)) {
- EXT4_I(inode)->i_es_lru_nr++;
+ ei->i_es_lru_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_lru_cnt);
+ hlist_add_head(&es->es_list, &ei->i_es_tree.evictable_list);
}

EXT4_I(inode)->i_es_all_nr++;
@@ -327,13 +332,15 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,

static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
EXT4_I(inode)->i_es_all_nr--;
percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);

/* Decrease the lru counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
- BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
- EXT4_I(inode)->i_es_lru_nr--;
+ BUG_ON(ei->i_es_lru_nr-- == 0);
+ hlist_del_init(&es->es_list);
percpu_counter_dec(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_lru_cnt);
}
@@ -912,34 +919,11 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
return err;
}

-static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
- struct list_head *b)
-{
- struct ext4_inode_info *eia, *eib;
- eia = list_entry(a, struct ext4_inode_info, i_es_lru);
- eib = list_entry(b, struct ext4_inode_info, i_es_lru);
-
- if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
- !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
- return 1;
- if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
- ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
- return -1;
- if (eia->i_touch_when == eib->i_touch_when)
- return 0;
- if (time_after(eia->i_touch_when, eib->i_touch_when))
- return 1;
- else
- return -1;
-}
-
static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei)
{
struct ext4_inode_info *ei;
struct ext4_es_stats *es_stats;
- struct list_head *cur, *tmp;
- LIST_HEAD(skipped);
ktime_t start_time;
u64 scan_time;
int nr_shrunk = 0;
@@ -950,7 +934,7 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
spin_lock(&sbi->s_es_lru_lock);

retry:
- list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+ list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
int shrunk;

/*
@@ -961,18 +945,15 @@ retry:
&es_stats->es_stats_lru_cnt) == 0)
break;

- ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
-
/*
- * Skip the inode that is newer than the last_sorted
+ * Skip the inode that is newer than the last_scanned
* time. Normally we try hard to avoid shrinking
* precached inodes, but we will as a last resort.
*/
- if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
+ if ((es_stats->es_stats_last_scanned < ei->i_touch_when) ||
(skip_precached && ext4_test_inode_state(&ei->vfs_inode,
EXT4_STATE_EXT_PRECACHED))) {
nr_skipped++;
- list_move_tail(cur, &skipped);
continue;
}

@@ -981,8 +962,6 @@ retry:

write_lock(&ei->i_es_lock);
shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
- if (ei->i_es_lru_nr == 0)
- list_del_init(&ei->i_es_lru);
write_unlock(&ei->i_es_lock);

nr_shrunk += shrunk;
@@ -991,27 +970,18 @@ retry:
break;
}

- /* Move the newer inodes into the tail of the LRU list. */
- list_splice_tail(&skipped, &sbi->s_es_lru);
- INIT_LIST_HEAD(&skipped);
-
/*
* If we skipped any inodes, and we weren't able to make any
- * forward progress, sort the list and try again.
+ * forward progress, update the last scanned time stamp and try again.
*/
if ((nr_shrunk == 0) && nr_skipped && !retried) {
+ es_stats->es_stats_last_scanned = jiffies;
retried++;
- list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
- es_stats->es_stats_last_sorted = jiffies;
- ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
- i_es_lru);
/*
- * If there are no non-precached inodes left on the
- * list, start releasing precached extents.
+ * If the shrinker can not reclaim any objects at the
+ * first round, we start to reclaim precached inodes.
*/
- if (ext4_test_inode_state(&ei->vfs_inode,
- EXT4_STATE_EXT_PRECACHED))
- skip_precached = 0;
+ skip_precached = 0;
goto retry;
}

@@ -1106,10 +1076,10 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
- if (es_stats->es_stats_last_sorted != 0)
+ if (es_stats->es_stats_last_scanned != 0)
seq_printf(seq, " %u ms last sorted interval\n",
jiffies_to_msecs(jiffies -
- es_stats->es_stats_last_sorted));
+ es_stats->es_stats_last_scanned));
if (inode_cnt)
seq_printf(seq, " %d inodes on lru list\n", inode_cnt);

@@ -1171,7 +1141,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)

INIT_LIST_HEAD(&sbi->s_es_lru);
spin_lock_init(&sbi->s_es_lru_lock);
- sbi->s_es_stats.es_stats_last_sorted = 0;
+ sbi->s_es_stats.es_stats_last_scanned = 0;
sbi->s_es_stats.es_stats_shrunk = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
@@ -1243,8 +1213,8 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
- struct rb_node *node;
struct extent_status *es;
+ struct hlist_node *tmp;
unsigned long nr_shrunk = 0;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -1256,21 +1226,13 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
__ratelimit(&_rs))
ext4_warning(inode->i_sb, "forced shrink of precached extents");

- node = rb_first(&tree->root);
- while (node != NULL) {
- es = rb_entry(node, struct extent_status, rb_node);
- node = rb_next(&es->rb_node);
- /*
- * We can't reclaim delayed extent from status tree because
- * fiemap, bigallic, and seek_data/hole need to use it.
- */
- if (!ext4_es_is_delayed(es)) {
- rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(inode, es);
- nr_shrunk++;
- if (--nr_to_scan == 0)
- break;
- }
+ hlist_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
+ BUG_ON(ext4_es_is_delayed(es));
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(inode, es);
+ nr_shrunk++;
+ if (--nr_to_scan == 0)
+ break;
}
tree->cache_es = NULL;
return nr_shrunk;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 647c3c9..f19ca17 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ struct ext4_extent;

struct extent_status {
struct rb_node rb_node;
+ struct hlist_node es_list;
ext4_lblk_t es_lblk; /* first logical block extent covers */
ext4_lblk_t es_len; /* length of extent in block */
ext4_fsblk_t es_pblk; /* first physical block */
@@ -61,11 +62,12 @@ struct extent_status {

struct ext4_es_tree {
struct rb_root root;
+ struct hlist_head evictable_list;
struct extent_status *cache_es; /* recently accessed extent */
};

struct ext4_es_stats {
- unsigned long es_stats_last_sorted;
+ unsigned long es_stats_last_scanned;
unsigned long es_stats_shrunk;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
--
1.7.9.7


2014-04-16 15:19:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

Hi Zheng,

Thanks so much for your deep and detailed work on this.

I have two immediate reactions.

1) We should first fix __ext4_es_shrink so that we interpret
nr_to_scan correctly --- it's the number of objects to scan, not the
number of objects that we need to shirnk. That should significantly
reduce the number of scans that we do, and fixing this could
potentially influence the metrics that we measure.

2) In addition to measuring the scan time, we should also measure how
many times we end up loading extents from disk when (during the course
of a normal workload) we are under memory pressure, and RR ends up
throwing out an extent that would have been saved in LRU, and so an
additional metadata read would be required when using RR.

- Ted

2014-04-16 15:42:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Wed, Apr 16, 2014 at 11:19:38AM -0400, Theodore Ts'o wrote:
>
> 1) We should first fix __ext4_es_shrink so that we interpret
> nr_to_scan correctly --- it's the number of objects to scan, not the
> number of objects that we need to shirnk. That should significantly
> reduce the number of scans that we do, and fixing this could
> potentially influence the metrics that we measure.

I've been looking at this more closely, and what we're doing isn't as
bad as I thought. We only return the number of extents that are not
subject delayed allocation, and the number of items we shrink is equal
to the number of objects that we scan.

It may be, however, that the better way to do this is to return the
number of items in the extent status cache (i.e., including the
delalloc extents), and then skip the delalloc extents. That way the
VM knows how much work we are doing, and it is balancing the amount of
work that our shrinker is doing against the other shrinkers. If there
are no cache entries that can be freed (because they are all delalloc
entries), we could then return SHRINK_STOP.

That should help in particular with the really pathalogical workloads
where we have a large number of delalloc extents.

- Ted

2014-04-17 15:35:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

So I've been thinking about this some more, and it seems to me is
actually, what we need is *both* an LRU and a RR scheme.

The real problem here is that we have workloads that are generating a
large number of "low value" extent cache entries. That is, they are
extremely unlikely to be used again, because they are small, and being
generated when you have a highly fragmented extent status cache, and
very often, the workload is a random read/write workload, so there is
no way the full "working set" of extent cache entries could be kept in
memory at the same time anyway. These less valuable cache entries are
being generated at a very high rate, and we want to make sure we don't
penalize the "valuable" cache entries.

There's a classic solution to this problem for garbage collectors, and
that's to have a "nursery" and "tenured" space. So what we could do
is to have two lists (as the proposed LRU improvement patch does), but
in the first list, we put the delalloc and "tenured" cache entries,
and in the second list we put the "nursery" cache entries.

The "nursery" cache items are cleaned using an RR scheme, and indeed,
we might want to have a system where we try to keep the "nursery"
cache items to a mangeable level, even if we aren't under memory
pressure. If a cache item gets used a certain number of times, then
when we get to that item in the RR scheme, it gets "promoted" to the
"tenured" space.

The "tenured" space is then kept under control using some kind of LRU
scheme, and a target number of "tenured" items. (We might or might
not want to count delalloc entries for the purposes of this target.
That's TBD.)

The system should ideally automatically tune itself to control the
promotion rate from the nursery to tenured space based on the number
of uses required before a cache entry gets promoted, and there will be
a bunch of hueristics that we'll need to tune. But I think this
general approach should work pretty well.

What do other people think?

- Ted


2014-04-21 13:43:46

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Thu, Apr 17, 2014 at 11:35:26AM -0400, Theodore Ts'o wrote:
> So I've been thinking about this some more, and it seems to me is
> actually, what we need is *both* an LRU and a RR scheme.
>
> The real problem here is that we have workloads that are generating a
> large number of "low value" extent cache entries. That is, they are
> extremely unlikely to be used again, because they are small, and being
> generated when you have a highly fragmented extent status cache, and
> very often, the workload is a random read/write workload, so there is
> no way the full "working set" of extent cache entries could be kept in
> memory at the same time anyway. These less valuable cache entries are
> being generated at a very high rate, and we want to make sure we don't
> penalize the "valuable" cache entries.
>
> There's a classic solution to this problem for garbage collectors, and
> that's to have a "nursery" and "tenured" space. So what we could do
> is to have two lists (as the proposed LRU improvement patch does), but
> in the first list, we put the delalloc and "tenured" cache entries,
> and in the second list we put the "nursery" cache entries.
>
> The "nursery" cache items are cleaned using an RR scheme, and indeed,
> we might want to have a system where we try to keep the "nursery"
> cache items to a mangeable level, even if we aren't under memory
> pressure. If a cache item gets used a certain number of times, then
> when we get to that item in the RR scheme, it gets "promoted" to the
> "tenured" space.
>
> The "tenured" space is then kept under control using some kind of LRU
> scheme, and a target number of "tenured" items. (We might or might
> not want to count delalloc entries for the purposes of this target.
> That's TBD.)
>
> The system should ideally automatically tune itself to control the
> promotion rate from the nursery to tenured space based on the number
> of uses required before a cache entry gets promoted, and there will be
> a bunch of hueristics that we'll need to tune. But I think this
> general approach should work pretty well.

Hi Ted,

Sorry for the late reply because of vacation, and thanks for thinking
about this deeply.

First question is about 'nr_to_scan'. Do you want me to generate a
patch to fix it in this merge window? Because I can imagine that this
patch should be trival and easy for reviewing.

Second question is about your deeply thought. From your comment, it
seems that now we need a replacement algorithm that looks like we do in
memory management subsystem. Now in memory management subsystem, we
have an active list and an inactive list which tracks some pages. When
the system read/write some pages, these pages will be put on inactive
list. Then if some pages are accessed again, they will be promoted into
active list. When 'kswapd' thread tries to reclaim some pages, it will
drop some pages from inactive list and demote some pages from active
list to inactive list. I am happy to give it a try later.

Regards,
- Zheng

2014-04-21 14:05:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Mon, Apr 21, 2014 at 09:50:07PM +0800, Zheng Liu wrote:
> First question is about 'nr_to_scan'. Do you want me to generate a
> patch to fix it in this merge window? Because I can imagine that this
> patch should be trival and easy for reviewing.

I looked at trying to create such a patch for this development cycle,
and I decided not to try to fix this for 3.15 because it's a bit
subtle. What we have in nr_to_scan is not broken per se. I'm not
sure it's the best behaviour, but it is at least consistent.

Right now, when nr_to_scan is zero, we are return to the shrinker the
number of items which _can_ be shrunk (i.e., we don't include the
extents subject to delalloc). Hence what we are effectively telling
the shrinker is that the extents that are in the extent cache due to
delalloc don't exist for the purposes of counting them. So if we
don't count them when we report the number of items in the cache, then
it's consistent that we don't count then when we tell the shrinker
that we've scanned that many items.

So if we change this, we should also change the number of items that
we have in the cache (i.e., when the shrinker is calle with nr_to_scan
is zero) to include _all_ of the shrinkers in the cache, and if we run
out items, we need to return SHRINK_STOP.

I suspect we should make this change, but it's not as simple patch as
I had initially thought, and I think we'll want to be very careful in
testing it to make sure it behaves correctly. So instead of including
this as a bug fix for 3.15, I suspect it may be better to code and
test it for 3.16, and if we're confident, we can include a cc:
[email protected] so it gets backported to 3.15.x.

What do other folks think?

> Second question is about your deeply thought. From your comment, it
> seems that now we need a replacement algorithm that looks like we do in
> memory management subsystem. Now in memory management subsystem, we
> have an active list and an inactive list which tracks some pages. When
> the system read/write some pages, these pages will be put on inactive
> list. Then if some pages are accessed again, they will be promoted into
> active list. When 'kswapd' thread tries to reclaim some pages, it will
> drop some pages from inactive list and demote some pages from active
> list to inactive list. I am happy to give it a try later.

Yes, although we'll need to be careful that we don't introduce
scalability problems caused by needing to take too many locks. So I
don't think we want to move items from the inactive to active list
when the extent cache is referenced. Instead, we'll probably want to
bump a ref count in the cache entry, and then in the shrinker, when we
scan the inactive list, we can check the ref count and decide then to
move the item to the active list. That way, only the shrinker needs
to take the global lock.

Cheers,

- Ted

2014-04-21 14:40:24

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Mon, Apr 21, 2014 at 10:05:23AM -0400, Theodore Ts'o wrote:
[...]
> > Second question is about your deeply thought. From your comment, it
> > seems that now we need a replacement algorithm that looks like we do in
> > memory management subsystem. Now in memory management subsystem, we
> > have an active list and an inactive list which tracks some pages. When
> > the system read/write some pages, these pages will be put on inactive
> > list. Then if some pages are accessed again, they will be promoted into
> > active list. When 'kswapd' thread tries to reclaim some pages, it will
> > drop some pages from inactive list and demote some pages from active
> > list to inactive list. I am happy to give it a try later.
>
> Yes, although we'll need to be careful that we don't introduce
> scalability problems caused by needing to take too many locks. So I
> don't think we want to move items from the inactive to active list
> when the extent cache is referenced. Instead, we'll probably want to
> bump a ref count in the cache entry, and then in the shrinker, when we
> scan the inactive list, we can check the ref count and decide then to
> move the item to the active list. That way, only the shrinker needs
> to take the global lock.

I am not sure that I fully understand your meaning. AFAIU, we have a
list which uses RR scheme to shrink extent caches. In this list it
tracks written/unwrittten/hole extent caches. When the shrinker tries
to reclaim some objects, it will scan this list and reclaim all extent
caches whose ref count are less than a number. Meanwhile the ref count
of rest caches will be decreased and moved into active list. In active
list it tracks all delayed extent caches, precached extent caches and
other caches that have been promoted. The shrinker uses LRU algorithm
to reclaim objects from active list. Please let me know if I miss
something.

Regards,
- Zheng

2014-04-21 14:54:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Mon, Apr 21, 2014 at 10:46:46PM +0800, Zheng Liu wrote:
> I am not sure that I fully understand your meaning. AFAIU, we have a
> list which uses RR scheme to shrink extent caches. In this list it
> tracks written/unwrittten/hole extent caches. When the shrinker tries
> to reclaim some objects, it will scan this list and reclaim all extent
> caches whose ref count are less than a number. Meanwhile the ref count
> of rest caches will be decreased and moved into active list. In active
> list it tracks all delayed extent caches, precached extent caches and
> other caches that have been promoted. The shrinker uses LRU algorithm
> to reclaim objects from active list. Please let me know if I miss
> something.

Yes, that's right. I misunderstood your analogy to how it's like the
MM. It is much like the page aging algorithm in that the work done by
the MMU is very minimal, and most of the work is done in the scanning
algorithm. It might be that a pure "clock algorithm", would work
better than something which is closer to a GC like scheme with a
"tenured" and "nursery" space". It certainly would be simpler to
implement.

So it's certainly something that's worth considering.

Cheers,

- Ted

2014-04-21 23:10:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Thu, Apr 17, 2014 at 11:35:26AM -0400, Theodore Ts'o wrote:
> So I've been thinking about this some more, and it seems to me is
> actually, what we need is *both* an LRU and a RR scheme.

We already have shrinker implementations that do this. It would
probably take 10-15 lines of code to add it to any existing LRU
list based shrinker.....

> The real problem here is that we have workloads that are generating a
> large number of "low value" extent cache entries. That is, they are
> extremely unlikely to be used again, because they are small, and being
> generated when you have a highly fragmented extent status cache, and
> very often, the workload is a random read/write workload, so there is
> no way the full "working set" of extent cache entries could be kept in
> memory at the same time anyway. These less valuable cache entries are
> being generated at a very high rate, and we want to make sure we don't
> penalize the "valuable" cache entries.

Yup, an "object referenced" bit that gets set on a cache lookup hit.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-23 05:28:58

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Tue, Apr 22, 2014 at 09:10:02AM +1000, Dave Chinner wrote:
> On Thu, Apr 17, 2014 at 11:35:26AM -0400, Theodore Ts'o wrote:
> > So I've been thinking about this some more, and it seems to me is
> > actually, what we need is *both* an LRU and a RR scheme.
>
> We already have shrinker implementations that do this. It would
> probably take 10-15 lines of code to add it to any existing LRU
> list based shrinker.....

Hi Dave,

I guess that you are talking about lru list in include/linux/list_lru.h.
Thanks for pointing it out, and I will take a look at it.

Thanks,

- Zheng

2014-04-24 01:46:17

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/4] ext4: extents status tree shrinker improvement

On Wed, Apr 23, 2014 at 01:35:21PM +0800, Zheng Liu wrote:
> On Tue, Apr 22, 2014 at 09:10:02AM +1000, Dave Chinner wrote:
> > On Thu, Apr 17, 2014 at 11:35:26AM -0400, Theodore Ts'o wrote:
> > > So I've been thinking about this some more, and it seems to me is
> > > actually, what we need is *both* an LRU and a RR scheme.
> >
> > We already have shrinker implementations that do this. It would
> > probably take 10-15 lines of code to add it to any existing LRU
> > list based shrinker.....
>
> Hi Dave,
>
> I guess that you are talking about lru list in include/linux/list_lru.h.
> Thanks for pointing it out, and I will take a look at it.

No, I'm not - that's just the linked list implementation.

I'm talking about the use of referenced bits on the objects
themselves, and how the shrinker treats them. i.e. the I_REFERENCED
state bit in the inode, and DCACHE_REFERENCED on the dentry.

Cheers,

Dave.
--
Dave Chinner
[email protected]