2013-12-20 10:39:31

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH 0/2] ext4: extents status tree shrinker improvement

Hi all,

I still have some ideas for improving the extents status tree shrinker.
But before doing that, I think I should send out these patches for
discussion so that I want to be sure that I am on the right way. I
describe my work below, which contains four parts. The first part
describes my work of improvment for the shrinker. In second part I
describe the test case for measurement. In third part I measure the
performance and compare my work with current implememtation on mainline
kernel. The last part is TODO list. I describe some ideas that might
improve the performance of extents status tree shrinker.

Improvement
===========
According to this bug report [1], the extents status tree shrinker might
causes excessive time stall under heavy memory pressure. Due to lacking
some details from this report, I am not sure whether the delalloc is
enabled or not. But if I remember correctly Andreas has mentioned that
he also notices that es shrinker takes too much time to try to reclaim
some entries from extents status tree, and he observes that the app does
a lot of buffered writes. So I suspect that the root cause is that
currently es shrinker has to scan a lot of delayed entries that don't be
reclaimed under some workloads. Hence it burns too much cpu time.

1. https://bugzilla.kernel.org/show_bug.cgi?id=62751

Why don't we discard delayed entry from extents status tree? Now we need
to use delayed entry in extents status tree to reserve delayed space for
delayed allocation, and support seek_data/hole. So we couldn't discard
these delayed entries.

In this patch series, it contains two patches. Patch 1 just improves the
trace point in extents status tree. Later I will use them to calculate
the latency of every function. This is a trival patch. So feel free to
pick it up.

Patch 2 tries to make es shrinker avoid scanning delayed entry. It adds
a evictable list in order to track all reclaimable objects. Meanwhile
a hlist_node structure is defined in 'extent_status' structure. The
defect of using hlist_node is that it costs extra 1/3 memory space. On
64bits platform, extent_status structure occupies 48 bytes. After that
it will take 64 bytes. Then the es shrinker can traverse on this list.
It shouldn't encounter any delayed entry. So it reduces the time taking
sbi->s_es_lru_lock and ei->i_es_lock due to it don't need to scan all
delayed extries.

Workload
========
For comparing the improvement, I use fio to generate a workload that
could cause some fragmented files. For getting a extreme fragmented
tree, I hack the code in ext4_es_can_be_merge() to always return 0.
That means we don't try to merge any entry in extents status tree.
Meanwhile vm.dirty_ratio, vm.dirty_background_ratio are adjusted to 80
and 60 on my machine. The purpose of doing this is to accumulate more
delayed entries on extents status tree. The configure file of fio is
like below:

[global]
ioengine=psync
bs=4k
directory=/mnt/sda1
thread
group_reporting
fallocate=0
direct=0
filesize=10g
size=20g
runtime=300

[io]
rw=randwrite:32
rw_sequencer=sequential
numjobs=25
nrfiles=10

sysctl settings:

#!/bin/bash
#
# This script is used to adjust sysctl parameter to increase the memory
# pressure and accumluate some delayed entries.

sudo sysctl vm.dirty_ratio=80
sudo sysctl vm.dirty_background_ratio=60

Using this workload, the extents status tree will look like this:
[0, 1)[1, 2)...[31, 32)[32, 51)[51, 52)[52, 53)...[101, 102)...[111, 112)
[ D ][ D ]...[ D ][ H ][ D ][ D ]...[ W ]...[ U ]

D: Delayed entry
H: Hole entry
W: Written entry
U: Unwritten entry

*NOTE*
This workload couldn't reproduce the problem that is reported by [1].

I run this workload on my desktop, which has a Intel Core Duo CPU @ 3.0
GHz, 4G memeory and a HDD. I will run the same workload on a server to
measure the result of improvement.

Comparison
===========

slabtop
-------
slabtop is used to measure memory space which the slab of extent_status
occupies. The following is the result of $(slabtop -o)

vanilla:
718890 688677 95% 0.04K 7730 93 30920K ext4_extent_status
patched:
718420 675933 94% 0.05K 10565 68 42260K ext4_extent_status

vanilla patched +/- (%)
ext4_extent_status 30920k 42260k 36.68

As I said above, adding hlist_node adds ~1/3 extra memory space.

perf record -g
--------------
The following is the result of $(perf diff), which compares the result
of $(perf record -g). Here I cut all *_es_* functions and paste them
here.

# Event 'cycles'
#
# Baseline Delta Shared Object Symbol
# ........ ....... .................. ............................................
#
1.82% +0.04% [ext4] [k] ext4_es_lookup_extent
0.55% +0.01% [ext4] [k] __es_tree_search
0.39% -0.05% [ext4] [k] __es_insert_extent
0.10% +0.02% [ext4] [k] ext4_es_insert_extent
0.08% +0.01% [ext4] [k] __es_remove_extent
0.07% +0.01% [ext4] [k] ext4_es_lru_add
0.01% -0.01% [ext4] [k] __es_try_to_reclaim_extents
0.00% [ext4] [k] ext4_es_free_extent
0.00% [ext4] [k] ext4_es_cache_extent

perf lock record -g
-------------------
Here is the result of $(perf lock report -k wait_max).

Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
vanilla:
&(&sbi->s_es_lru... 152 1 87275 87275 87275 87275
patched:
&(&sbi->s_es_lru... 148 0 0 0 0 0

'sbi->s_es_lru_lock' protects the entire lru list of inodes which have
reclaimable entries. When es shrinker tries to reclaim some entries,
this lock will be taken during this process. From the result we can see
that the improvement can reduce the waiting time.

latency
-------
Here I calculate the latency of the following operations from the
result of trace points. (Unit: *us*)

- es_find_delayed_extent_range -
x vanilla
+ patched
N Min Max Median Avg Stddev
x 4425 0 54 1 1.0280226 1.0898311
+ 7063 0 140 1 1.0127425 1.8662003

- es_insert_extent -
x vanilla
+ patched
N Min Max Median Avg Stddev
x 26995 1 1270 2 4.1053899 20.26818
+ 26559 1 440 2 3.0948454 7.6911348

- es_lookup_extent -
x vanilla
+ patched
N Min Max Median Avg Stddev
x 32025 0 104 1 1.6082748 1.385739
+ 31433 0 168 1 1.6219578 1.4846153

- es_shrink_scan -
x vanilla
+ patched
N Min Max Median Avg Stddev
x 1219 24 1872 171 278.35603 275.8815
+ 1405 17 339 86 95.057651 33.216144

The result of latency shows that the patch can reduce the latency when es
shrinker tries to reclaim the entry. Meanwhile other operations don't be
impacted too much.

TODO list
=========

Numa-aware shrinker
-------------------
Now the shrinker has been numa-aware. But the es shrinker doesn't
support it. So an improvement is to make es shrinker numa-aware. The
key issue is that extent_status objects might not be on the same node
with its inode. So that means we need to use lru_list on ext4_sb_info in
order to make s_es_lru list numa-aware, and use lru_list on ext4_es_tree
for making evictable list numa-aware. I am not sure it is worthwhile.

rcu-skiplist
------------
Chris Mason has sent out a rcu-skiplist patch set. So an potential
improvement here is to use rcu-skiplist in extents status tree to track
all entries. Now we use rb-tree to do the same thing. But the defect
of using rb-tree here is that we need to rotate the subtree when a node
is inserted or removed. That means that rb-tree couldn't provide a very
stable latency when it tries to balance the subtree. The skiplist does
not need to do this. So I suspect that it would provide more smooth
latency.

Regards,
- Zheng

Zheng Liu (2):
ext4: improve extents status tree trace point
ext4: improve extents status tree shrinker to avoid scanning delayed entries

fs/ext4/extents_status.c | 53 +++++++++++++++++++++++--------------------
fs/ext4/extents_status.h | 2 ++
include/trace/events/ext4.h | 46 +++++++++++++++++++++++++++++++++----
3 files changed, 71 insertions(+), 30 deletions(-)

--
1.7.9.7



2013-12-20 10:39:33

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH 1/2] ext4: improve extents status tree trace point

From: Zheng Liu <[email protected]>

This commit improves the trace point of extents status tree. For
improving the shrinker, we need to collect more details from trace
point. First we put more enter/exit pairs in insertion, deletion,
and caching code path in order to collect the latency of these code
path. Then we rename trace_ext4_es_shrink_enter in ext4_es_count()
because it is also used in ext4_es_scan() and we don't identify
them from the result.

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 15 ++++++++------
include/trace/events/ext4.h | 46 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 3981ff7..e842d74 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -660,7 +660,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
newes.es_len = len;
ext4_es_store_pblock(&newes, pblk);
ext4_es_store_status(&newes, status);
- trace_ext4_es_insert_extent(inode, &newes);
+ trace_ext4_es_insert_extent_enter(inode, &newes);

ext4_es_insert_extent_check(inode, &newes);

@@ -678,6 +678,7 @@ retry:

error:
write_unlock(&EXT4_I(inode)->i_es_lock);
+ trace_ext4_es_insert_extent_exit(inode, &newes);

ext4_es_print_tree(inode);

@@ -701,7 +702,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
newes.es_len = len;
ext4_es_store_pblock(&newes, pblk);
ext4_es_store_status(&newes, status);
- trace_ext4_es_cache_extent(inode, &newes);
+ trace_ext4_es_cache_extent_enter(inode, &newes);

if (!len)
return;
@@ -714,6 +715,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
if (!es || es->es_lblk > end)
__es_insert_extent(inode, &newes);
write_unlock(&EXT4_I(inode)->i_es_lock);
+ trace_ext4_es_cache_extent_exit(inode, &newes);
}

/*
@@ -887,7 +889,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end;
int err = 0;

- trace_ext4_es_remove_extent(inode, lblk, len);
+ trace_ext4_es_remove_extent_enter(inode, lblk, len);
es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
lblk, len, inode->i_ino);

@@ -901,6 +903,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
err = __es_remove_extent(inode, lblk, end);
write_unlock(&EXT4_I(inode)->i_es_lock);
ext4_es_print_tree(inode);
+ trace_ext4_es_remove_extent_exit(inode, lblk, len);
return err;
}

@@ -1017,7 +1020,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,

sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+ trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}

@@ -1030,14 +1033,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int ret, nr_shrunk;

ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
+ trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);

if (!nr_to_scan)
return ret;

nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);

- trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
+ trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
return nr_shrunk;
}

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 197d312..8b3ff82 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2221,19 +2221,31 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
__entry->pblk, show_extent_status(__entry->status))
);

-DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
+DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_enter,
TP_PROTO(struct inode *inode, struct extent_status *es),

TP_ARGS(inode, es)
);

-DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent,
+DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_exit,
TP_PROTO(struct inode *inode, struct extent_status *es),

TP_ARGS(inode, es)
);

-TRACE_EVENT(ext4_es_remove_extent,
+DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_enter,
+ TP_PROTO(struct inode *inode, struct extent_status *es),
+
+ TP_ARGS(inode, es)
+);
+
+DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_exit,
+ TP_PROTO(struct inode *inode, struct extent_status *es),
+
+ TP_ARGS(inode, es)
+);
+
+DECLARE_EVENT_CLASS(ext4__es_remove_extent,
TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),

TP_ARGS(inode, lblk, len),
@@ -2258,6 +2270,18 @@ TRACE_EVENT(ext4_es_remove_extent,
__entry->lblk, __entry->len)
);

+DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_enter,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+
+ TP_ARGS(inode, lblk, len)
+);
+
+DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_exit,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
+
+ TP_ARGS(inode, lblk, len)
+);
+
TRACE_EVENT(ext4_es_find_delayed_extent_range_enter,
TP_PROTO(struct inode *inode, ext4_lblk_t lblk),

@@ -2366,7 +2390,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
show_extent_status(__entry->found ? __entry->status : 0))
);

-TRACE_EVENT(ext4_es_shrink_enter,
+DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),

TP_ARGS(sb, nr_to_scan, cache_cnt),
@@ -2388,7 +2412,19 @@ TRACE_EVENT(ext4_es_shrink_enter,
__entry->nr_to_scan, __entry->cache_cnt)
);

-TRACE_EVENT(ext4_es_shrink_exit,
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+TRACE_EVENT(ext4_es_shrink_scan_exit,
TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),

TP_ARGS(sb, shrunk_nr, cache_cnt),
--
1.7.9.7


2013-12-20 10:39:36

by Zheng Liu

[permalink] [raw]
Subject: [RFC PATCH 2/2] ext4: improve extents status tree shrinker to avoid scanning delayed entries

From: Zheng Liu <[email protected]>

The extents status tree shrinker will scan all inodes on sbi->s_es_lru
under heavy memory pressure, and try to reclaim the entry from extents
status tree. During this process it couldn't reclaim the delayed entry
because ext4 needs to use these entries to do delayed allocation space
reservation, seek_data/hole, etc.... So if a system has done a huge
number of writes and these dirty pages don't be written out. There will
be a lot of delayed entries on extents status tree. If shrinker tries
to reclaim memory from the tree, it will burn some CPU time to iterate
on these non-reclaimable entries. At some circumstances it could cause
excessive stall time.

In this commit a new list is used to track reclaimable entries of extent
status tree (e.g. written/unwritten/hole entries). The shrinker will
scan reclaimable entry on this list. So it won't encouter any delayed
entry and don't need to take too much time to spin. But the defect is
that we need to cost extra 1/3 memory space for one entry. Before this
commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
After that it will occupy 64 bytes. :(

Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 38 +++++++++++++++++++-------------------
fs/ext4/extents_status.h | 2 ++
2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e842d74..11bdb2f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -169,6 +169,7 @@ void ext4_exit_es(void)
void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
+ INIT_HLIST_HEAD(&tree->evictable_list);
tree->cache_es = NULL;
}

@@ -300,10 +301,14 @@ static struct extent_status *
ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ext4_fsblk_t pblk)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
struct extent_status *es;
+
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
if (es == NULL)
return NULL;
+
+ INIT_HLIST_NODE(&es->es_list);
es->es_lblk = lblk;
es->es_len = len;
es->es_pblk = pblk;
@@ -312,8 +317,9 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
* We don't count delayed extent because we never try to reclaim them
*/
if (!ext4_es_is_delayed(es)) {
- EXT4_I(inode)->i_es_lru_nr++;
+ ei->i_es_lru_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ hlist_add_head(&es->es_list, &ei->i_es_tree.evictable_list);
}

return es;
@@ -321,10 +327,12 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,

static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
/* Decrease the lru counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
- BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
- EXT4_I(inode)->i_es_lru_nr--;
+ BUG_ON(ei->i_es_lru_nr-- == 0);
+ hlist_del_init(&es->es_list);
percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
}

@@ -1092,8 +1100,8 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
- struct rb_node *node;
struct extent_status *es;
+ struct hlist_node *tmp;
unsigned long nr_shrunk = 0;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -1105,21 +1113,13 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
__ratelimit(&_rs))
ext4_warning(inode->i_sb, "forced shrink of precached extents");

- node = rb_first(&tree->root);
- while (node != NULL) {
- es = rb_entry(node, struct extent_status, rb_node);
- node = rb_next(&es->rb_node);
- /*
- * We can't reclaim delayed extent from status tree because
- * fiemap, bigallic, and seek_data/hole need to use it.
- */
- if (!ext4_es_is_delayed(es)) {
- rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(inode, es);
- nr_shrunk++;
- if (--nr_to_scan == 0)
- break;
- }
+ hlist_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
+ BUG_ON(ext4_es_is_delayed(es));
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(inode, es);
+ nr_shrunk++;
+ if (--nr_to_scan == 0)
+ break;
}
tree->cache_es = NULL;
return nr_shrunk;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 167f4ab8..38ca83e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ struct ext4_extent;

struct extent_status {
struct rb_node rb_node;
+ struct hlist_node es_list;
ext4_lblk_t es_lblk; /* first logical block extent covers */
ext4_lblk_t es_len; /* length of extent in block */
ext4_fsblk_t es_pblk; /* first physical block */
@@ -61,6 +62,7 @@ struct extent_status {

struct ext4_es_tree {
struct rb_root root;
+ struct hlist_head evictable_list;
struct extent_status *cache_es; /* recently accessed extent */
};

--
1.7.9.7


2013-12-23 08:39:04

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] ext4: improve extents status tree trace point

On Fri 20-12-13 18:42:44, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> This commit improves the trace point of extents status tree. For
> improving the shrinker, we need to collect more details from trace
> point. First we put more enter/exit pairs in insertion, deletion,
> and caching code path in order to collect the latency of these code
> path. Then we rename trace_ext4_es_shrink_enter in ext4_es_count()
> because it is also used in ext4_es_scan() and we don't identify
> them from the result.
Umm, I think you can use a generic 'function trace' infrastructure to
trace entry & exit into arbitrary functions in kernel. So there shouldn't
be a need to introduce extra entry & exit tracepoints.

Honza

> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
> ---
> fs/ext4/extents_status.c | 15 ++++++++------
> include/trace/events/ext4.h | 46 ++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 50 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 3981ff7..e842d74 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -660,7 +660,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> newes.es_len = len;
> ext4_es_store_pblock(&newes, pblk);
> ext4_es_store_status(&newes, status);
> - trace_ext4_es_insert_extent(inode, &newes);
> + trace_ext4_es_insert_extent_enter(inode, &newes);
>
> ext4_es_insert_extent_check(inode, &newes);
>
> @@ -678,6 +678,7 @@ retry:
>
> error:
> write_unlock(&EXT4_I(inode)->i_es_lock);
> + trace_ext4_es_insert_extent_exit(inode, &newes);
>
> ext4_es_print_tree(inode);
>
> @@ -701,7 +702,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> newes.es_len = len;
> ext4_es_store_pblock(&newes, pblk);
> ext4_es_store_status(&newes, status);
> - trace_ext4_es_cache_extent(inode, &newes);
> + trace_ext4_es_cache_extent_enter(inode, &newes);
>
> if (!len)
> return;
> @@ -714,6 +715,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> if (!es || es->es_lblk > end)
> __es_insert_extent(inode, &newes);
> write_unlock(&EXT4_I(inode)->i_es_lock);
> + trace_ext4_es_cache_extent_exit(inode, &newes);
> }
>
> /*
> @@ -887,7 +889,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t end;
> int err = 0;
>
> - trace_ext4_es_remove_extent(inode, lblk, len);
> + trace_ext4_es_remove_extent_enter(inode, lblk, len);
> es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
> lblk, len, inode->i_ino);
>
> @@ -901,6 +903,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> err = __es_remove_extent(inode, lblk, end);
> write_unlock(&EXT4_I(inode)->i_es_lock);
> ext4_es_print_tree(inode);
> + trace_ext4_es_remove_extent_exit(inode, lblk, len);
> return err;
> }
>
> @@ -1017,7 +1020,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
>
> sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
> nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
> - trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
> + trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
> return nr;
> }
>
> @@ -1030,14 +1033,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
> int ret, nr_shrunk;
>
> ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
> - trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
> + trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
>
> if (!nr_to_scan)
> return ret;
>
> nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
>
> - trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
> + trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
> return nr_shrunk;
> }
>
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 197d312..8b3ff82 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2221,19 +2221,31 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
> __entry->pblk, show_extent_status(__entry->status))
> );
>
> -DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
> +DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_enter,
> TP_PROTO(struct inode *inode, struct extent_status *es),
>
> TP_ARGS(inode, es)
> );
>
> -DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent,
> +DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_exit,
> TP_PROTO(struct inode *inode, struct extent_status *es),
>
> TP_ARGS(inode, es)
> );
>
> -TRACE_EVENT(ext4_es_remove_extent,
> +DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_enter,
> + TP_PROTO(struct inode *inode, struct extent_status *es),
> +
> + TP_ARGS(inode, es)
> +);
> +
> +DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_exit,
> + TP_PROTO(struct inode *inode, struct extent_status *es),
> +
> + TP_ARGS(inode, es)
> +);
> +
> +DECLARE_EVENT_CLASS(ext4__es_remove_extent,
> TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
>
> TP_ARGS(inode, lblk, len),
> @@ -2258,6 +2270,18 @@ TRACE_EVENT(ext4_es_remove_extent,
> __entry->lblk, __entry->len)
> );
>
> +DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_enter,
> + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> +
> + TP_ARGS(inode, lblk, len)
> +);
> +
> +DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_exit,
> + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> +
> + TP_ARGS(inode, lblk, len)
> +);
> +
> TRACE_EVENT(ext4_es_find_delayed_extent_range_enter,
> TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
>
> @@ -2366,7 +2390,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
> show_extent_status(__entry->found ? __entry->status : 0))
> );
>
> -TRACE_EVENT(ext4_es_shrink_enter,
> +DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
> TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
>
> TP_ARGS(sb, nr_to_scan, cache_cnt),
> @@ -2388,7 +2412,19 @@ TRACE_EVENT(ext4_es_shrink_enter,
> __entry->nr_to_scan, __entry->cache_cnt)
> );
>
> -TRACE_EVENT(ext4_es_shrink_exit,
> +DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
> + TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
> +
> + TP_ARGS(sb, nr_to_scan, cache_cnt)
> +);
> +
> +DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
> + TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
> +
> + TP_ARGS(sb, nr_to_scan, cache_cnt)
> +);
> +
> +TRACE_EVENT(ext4_es_shrink_scan_exit,
> TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),
>
> TP_ARGS(sb, shrunk_nr, cache_cnt),
> --
> 1.7.9.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-12-23 08:54:19

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] ext4: improve extents status tree shrinker to avoid scanning delayed entries

On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> under heavy memory pressure, and try to reclaim the entry from extents
> status tree. During this process it couldn't reclaim the delayed entry
> because ext4 needs to use these entries to do delayed allocation space
> reservation, seek_data/hole, etc.... So if a system has done a huge
> number of writes and these dirty pages don't be written out. There will
> be a lot of delayed entries on extents status tree. If shrinker tries
> to reclaim memory from the tree, it will burn some CPU time to iterate
> on these non-reclaimable entries. At some circumstances it could cause
> excessive stall time.
>
> In this commit a new list is used to track reclaimable entries of extent
> status tree (e.g. written/unwritten/hole entries). The shrinker will
> scan reclaimable entry on this list. So it won't encouter any delayed
> entry and don't need to take too much time to spin. But the defect is
> that we need to cost extra 1/3 memory space for one entry. Before this
> commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> After that it will occupy 64 bytes. :(
This looks sensible. I was just wondering about one thing: One incorrect
thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
objects it can find. Now we shouldn't always start scanning at the end of
the LRU because if delayed extents accumulate there we would never reclaim
anything. Rather we should cycle through the list of entries we have. But
that doesn't play well with the fact we have LRU list and thus want to
reclaim from the end of the list. In the end what you do might be the best
we can do but I wanted to mention the above just in case someone has some
idea.

Honza

> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
> ---
> fs/ext4/extents_status.c | 38 +++++++++++++++++++-------------------
> fs/ext4/extents_status.h | 2 ++
> 2 files changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index e842d74..11bdb2f 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -169,6 +169,7 @@ void ext4_exit_es(void)
> void ext4_es_init_tree(struct ext4_es_tree *tree)
> {
> tree->root = RB_ROOT;
> + INIT_HLIST_HEAD(&tree->evictable_list);
> tree->cache_es = NULL;
> }
>
> @@ -300,10 +301,14 @@ static struct extent_status *
> ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> ext4_fsblk_t pblk)
> {
> + struct ext4_inode_info *ei = EXT4_I(inode);
> struct extent_status *es;
> +
> es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
> if (es == NULL)
> return NULL;
> +
> + INIT_HLIST_NODE(&es->es_list);
> es->es_lblk = lblk;
> es->es_len = len;
> es->es_pblk = pblk;
> @@ -312,8 +317,9 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> * We don't count delayed extent because we never try to reclaim them
> */
> if (!ext4_es_is_delayed(es)) {
> - EXT4_I(inode)->i_es_lru_nr++;
> + ei->i_es_lru_nr++;
> percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
> + hlist_add_head(&es->es_list, &ei->i_es_tree.evictable_list);
> }
>
> return es;
> @@ -321,10 +327,12 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>
> static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
> {
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> /* Decrease the lru counter when this es is not delayed */
> if (!ext4_es_is_delayed(es)) {
> - BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
> - EXT4_I(inode)->i_es_lru_nr--;
> + BUG_ON(ei->i_es_lru_nr-- == 0);
> + hlist_del_init(&es->es_list);
> percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
> }
>
> @@ -1092,8 +1100,8 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> {
> struct inode *inode = &ei->vfs_inode;
> struct ext4_es_tree *tree = &ei->i_es_tree;
> - struct rb_node *node;
> struct extent_status *es;
> + struct hlist_node *tmp;
> unsigned long nr_shrunk = 0;
> static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
> @@ -1105,21 +1113,13 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> __ratelimit(&_rs))
> ext4_warning(inode->i_sb, "forced shrink of precached extents");
>
> - node = rb_first(&tree->root);
> - while (node != NULL) {
> - es = rb_entry(node, struct extent_status, rb_node);
> - node = rb_next(&es->rb_node);
> - /*
> - * We can't reclaim delayed extent from status tree because
> - * fiemap, bigallic, and seek_data/hole need to use it.
> - */
> - if (!ext4_es_is_delayed(es)) {
> - rb_erase(&es->rb_node, &tree->root);
> - ext4_es_free_extent(inode, es);
> - nr_shrunk++;
> - if (--nr_to_scan == 0)
> - break;
> - }
> + hlist_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
> + BUG_ON(ext4_es_is_delayed(es));
> + rb_erase(&es->rb_node, &tree->root);
> + ext4_es_free_extent(inode, es);
> + nr_shrunk++;
> + if (--nr_to_scan == 0)
> + break;
> }
> tree->cache_es = NULL;
> return nr_shrunk;
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 167f4ab8..38ca83e 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -54,6 +54,7 @@ struct ext4_extent;
>
> struct extent_status {
> struct rb_node rb_node;
> + struct hlist_node es_list;
> ext4_lblk_t es_lblk; /* first logical block extent covers */
> ext4_lblk_t es_len; /* length of extent in block */
> ext4_fsblk_t es_pblk; /* first physical block */
> @@ -61,6 +62,7 @@ struct extent_status {
>
> struct ext4_es_tree {
> struct rb_root root;
> + struct hlist_head evictable_list;
> struct extent_status *cache_es; /* recently accessed extent */
> };
>
> --
> 1.7.9.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-12-23 09:03:34

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] ext4: extents status tree shrinker improvement

On Fri 20-12-13 18:42:43, Zheng Liu wrote:
> Hi all,
>
> I still have some ideas for improving the extents status tree shrinker.
> But before doing that, I think I should send out these patches for
> discussion so that I want to be sure that I am on the right way. I
> describe my work below, which contains four parts. The first part
> describes my work of improvment for the shrinker. In second part I
> describe the test case for measurement. In third part I measure the
> performance and compare my work with current implememtation on mainline
> kernel. The last part is TODO list. I describe some ideas that might
> improve the performance of extents status tree shrinker.
>
> Improvement
> ===========
> According to this bug report [1], the extents status tree shrinker might
> causes excessive time stall under heavy memory pressure. Due to lacking
> some details from this report, I am not sure whether the delalloc is
> enabled or not. But if I remember correctly Andreas has mentioned that
> he also notices that es shrinker takes too much time to try to reclaim
> some entries from extents status tree, and he observes that the app does
> a lot of buffered writes. So I suspect that the root cause is that
> currently es shrinker has to scan a lot of delayed entries that don't be
> reclaimed under some workloads. Hence it burns too much cpu time.
>
> 1. https://bugzilla.kernel.org/show_bug.cgi?id=62751
>
> Why don't we discard delayed entry from extents status tree? Now we need
> to use delayed entry in extents status tree to reserve delayed space for
> delayed allocation, and support seek_data/hole. So we couldn't discard
> these delayed entries.
>
> In this patch series, it contains two patches. Patch 1 just improves the
> trace point in extents status tree. Later I will use them to calculate
> the latency of every function. This is a trival patch. So feel free to
> pick it up.
>
> Patch 2 tries to make es shrinker avoid scanning delayed entry. It adds
> a evictable list in order to track all reclaimable objects. Meanwhile
> a hlist_node structure is defined in 'extent_status' structure. The
> defect of using hlist_node is that it costs extra 1/3 memory space. On
> 64bits platform, extent_status structure occupies 48 bytes. After that
> it will take 64 bytes. Then the es shrinker can traverse on this list.
> It shouldn't encounter any delayed entry. So it reduces the time taking
> sbi->s_es_lru_lock and ei->i_es_lock due to it don't need to scan all
> delayed extries.
>
> Workload
> ========
> For comparing the improvement, I use fio to generate a workload that
> could cause some fragmented files. For getting a extreme fragmented
> tree, I hack the code in ext4_es_can_be_merge() to always return 0.
> That means we don't try to merge any entry in extents status tree.
> Meanwhile vm.dirty_ratio, vm.dirty_background_ratio are adjusted to 80
> and 60 on my machine. The purpose of doing this is to accumulate more
> delayed entries on extents status tree. The configure file of fio is
> like below:
>
> [global]
> ioengine=psync
> bs=4k
> directory=/mnt/sda1
> thread
> group_reporting
> fallocate=0
> direct=0
> filesize=10g
> size=20g
> runtime=300
>
> [io]
> rw=randwrite:32
> rw_sequencer=sequential
> numjobs=25
> nrfiles=10
Well, my suspicion is the stall can be caused when we have lots of
relatively small files and we go through all the pain of scanning inode
list, then resorting it, and scanning again. Your workload doesn't
excercise this path at all if I understand right.

> sysctl settings:
>
> #!/bin/bash
> #
> # This script is used to adjust sysctl parameter to increase the memory
> # pressure and accumluate some delayed entries.
>
> sudo sysctl vm.dirty_ratio=80
> sudo sysctl vm.dirty_background_ratio=60
>
> Using this workload, the extents status tree will look like this:
> [0, 1)[1, 2)...[31, 32)[32, 51)[51, 52)[52, 53)...[101, 102)...[111, 112)
> [ D ][ D ]...[ D ][ H ][ D ][ D ]...[ W ]...[ U ]
>
> D: Delayed entry
> H: Hole entry
> W: Written entry
> U: Unwritten entry
>
> *NOTE*
> This workload couldn't reproduce the problem that is reported by [1].
>
> I run this workload on my desktop, which has a Intel Core Duo CPU @ 3.0
> GHz, 4G memeory and a HDD. I will run the same workload on a server to
> measure the result of improvement.
>
> Comparison
> ===========
>
> slabtop
> -------
> slabtop is used to measure memory space which the slab of extent_status
> occupies. The following is the result of $(slabtop -o)
>
> vanilla:
> 718890 688677 95% 0.04K 7730 93 30920K ext4_extent_status
> patched:
> 718420 675933 94% 0.05K 10565 68 42260K ext4_extent_status
>
> vanilla patched +/- (%)
> ext4_extent_status 30920k 42260k 36.68
>
> As I said above, adding hlist_node adds ~1/3 extra memory space.
>
> perf record -g
> --------------
> The following is the result of $(perf diff), which compares the result
> of $(perf record -g). Here I cut all *_es_* functions and paste them
> here.
>
> # Event 'cycles'
> #
> # Baseline Delta Shared Object Symbol
> # ........ ....... .................. ............................................
> #
> 1.82% +0.04% [ext4] [k] ext4_es_lookup_extent
> 0.55% +0.01% [ext4] [k] __es_tree_search
> 0.39% -0.05% [ext4] [k] __es_insert_extent
> 0.10% +0.02% [ext4] [k] ext4_es_insert_extent
> 0.08% +0.01% [ext4] [k] __es_remove_extent
> 0.07% +0.01% [ext4] [k] ext4_es_lru_add
> 0.01% -0.01% [ext4] [k] __es_try_to_reclaim_extents
> 0.00% [ext4] [k] ext4_es_free_extent
> 0.00% [ext4] [k] ext4_es_cache_extent
>
> perf lock record -g
> -------------------
> Here is the result of $(perf lock report -k wait_max).
>
> Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
> vanilla:
> &(&sbi->s_es_lru... 152 1 87275 87275 87275 87275
> patched:
> &(&sbi->s_es_lru... 148 0 0 0 0 0
>
> 'sbi->s_es_lru_lock' protects the entire lru list of inodes which have
> reclaimable entries. When es shrinker tries to reclaim some entries,
> this lock will be taken during this process. From the result we can see
> that the improvement can reduce the waiting time.

Well, this doesn't say too much. During the whole run there was exactly one
case where we contended for the s_es_lru_lock. It may be just luck you
didn't hit the contention in the 'patched' kernel.

Also without having a reproducer it's a bit hard to judge whether we are
indeed improving a thing which needs improvement - i.e., whether the gain
in latency is worth the cost in memory usage.


Honza

> latency
> -------
> Here I calculate the latency of the following operations from the
> result of trace points. (Unit: *us*)
>
> - es_find_delayed_extent_range -
> x vanilla
> + patched
> N Min Max Median Avg Stddev
> x 4425 0 54 1 1.0280226 1.0898311
> + 7063 0 140 1 1.0127425 1.8662003
>
> - es_insert_extent -
> x vanilla
> + patched
> N Min Max Median Avg Stddev
> x 26995 1 1270 2 4.1053899 20.26818
> + 26559 1 440 2 3.0948454 7.6911348
>
> - es_lookup_extent -
> x vanilla
> + patched
> N Min Max Median Avg Stddev
> x 32025 0 104 1 1.6082748 1.385739
> + 31433 0 168 1 1.6219578 1.4846153
>
> - es_shrink_scan -
> x vanilla
> + patched
> N Min Max Median Avg Stddev
> x 1219 24 1872 171 278.35603 275.8815
> + 1405 17 339 86 95.057651 33.216144
>
> The result of latency shows that the patch can reduce the latency when es
> shrinker tries to reclaim the entry. Meanwhile other operations don't be
> impacted too much.
>
> TODO list
> =========
>
> Numa-aware shrinker
> -------------------
> Now the shrinker has been numa-aware. But the es shrinker doesn't
> support it. So an improvement is to make es shrinker numa-aware. The
> key issue is that extent_status objects might not be on the same node
> with its inode. So that means we need to use lru_list on ext4_sb_info in
> order to make s_es_lru list numa-aware, and use lru_list on ext4_es_tree
> for making evictable list numa-aware. I am not sure it is worthwhile.
>
> rcu-skiplist
> ------------
> Chris Mason has sent out a rcu-skiplist patch set. So an potential
> improvement here is to use rcu-skiplist in extents status tree to track
> all entries. Now we use rb-tree to do the same thing. But the defect
> of using rb-tree here is that we need to rotate the subtree when a node
> is inserted or removed. That means that rb-tree couldn't provide a very
> stable latency when it tries to balance the subtree. The skiplist does
> not need to do this. So I suspect that it would provide more smooth
> latency.
>
> Regards,
> - Zheng
>
> Zheng Liu (2):
> ext4: improve extents status tree trace point
> ext4: improve extents status tree shrinker to avoid scanning delayed entries
>
> fs/ext4/extents_status.c | 53 +++++++++++++++++++++++--------------------
> fs/ext4/extents_status.h | 2 ++
> include/trace/events/ext4.h | 46 +++++++++++++++++++++++++++++++++----
> 3 files changed, 71 insertions(+), 30 deletions(-)
>
> --
> 1.7.9.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-12-25 03:17:40

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] ext4: extents status tree shrinker improvement

On Mon, Dec 23, 2013 at 10:03:34AM +0100, Jan Kara wrote:
> On Fri 20-12-13 18:42:43, Zheng Liu wrote:
> > Hi all,
> >
> > I still have some ideas for improving the extents status tree shrinker.
> > But before doing that, I think I should send out these patches for
> > discussion so that I want to be sure that I am on the right way. I
> > describe my work below, which contains four parts. The first part
> > describes my work of improvment for the shrinker. In second part I
> > describe the test case for measurement. In third part I measure the
> > performance and compare my work with current implememtation on mainline
> > kernel. The last part is TODO list. I describe some ideas that might
> > improve the performance of extents status tree shrinker.
> >
> > Improvement
> > ===========
> > According to this bug report [1], the extents status tree shrinker might
> > causes excessive time stall under heavy memory pressure. Due to lacking
> > some details from this report, I am not sure whether the delalloc is
> > enabled or not. But if I remember correctly Andreas has mentioned that
> > he also notices that es shrinker takes too much time to try to reclaim
> > some entries from extents status tree, and he observes that the app does
> > a lot of buffered writes. So I suspect that the root cause is that
> > currently es shrinker has to scan a lot of delayed entries that don't be
> > reclaimed under some workloads. Hence it burns too much cpu time.
> >
> > 1. https://bugzilla.kernel.org/show_bug.cgi?id=62751
> >
> > Why don't we discard delayed entry from extents status tree? Now we need
> > to use delayed entry in extents status tree to reserve delayed space for
> > delayed allocation, and support seek_data/hole. So we couldn't discard
> > these delayed entries.
> >
> > In this patch series, it contains two patches. Patch 1 just improves the
> > trace point in extents status tree. Later I will use them to calculate
> > the latency of every function. This is a trival patch. So feel free to
> > pick it up.
> >
> > Patch 2 tries to make es shrinker avoid scanning delayed entry. It adds
> > a evictable list in order to track all reclaimable objects. Meanwhile
> > a hlist_node structure is defined in 'extent_status' structure. The
> > defect of using hlist_node is that it costs extra 1/3 memory space. On
> > 64bits platform, extent_status structure occupies 48 bytes. After that
> > it will take 64 bytes. Then the es shrinker can traverse on this list.
> > It shouldn't encounter any delayed entry. So it reduces the time taking
> > sbi->s_es_lru_lock and ei->i_es_lock due to it don't need to scan all
> > delayed extries.
> >
> > Workload
> > ========
> > For comparing the improvement, I use fio to generate a workload that
> > could cause some fragmented files. For getting a extreme fragmented
> > tree, I hack the code in ext4_es_can_be_merge() to always return 0.
> > That means we don't try to merge any entry in extents status tree.
> > Meanwhile vm.dirty_ratio, vm.dirty_background_ratio are adjusted to 80
> > and 60 on my machine. The purpose of doing this is to accumulate more
> > delayed entries on extents status tree. The configure file of fio is
> > like below:
> >
> > [global]
> > ioengine=psync
> > bs=4k
> > directory=/mnt/sda1
> > thread
> > group_reporting
> > fallocate=0
> > direct=0
> > filesize=10g
> > size=20g
> > runtime=300
> >
> > [io]
> > rw=randwrite:32
> > rw_sequencer=sequential
> > numjobs=25
> > nrfiles=10
> Well, my suspicion is the stall can be caused when we have lots of
> relatively small files and we go through all the pain of scanning inode
> list, then resorting it, and scanning again. Your workload doesn't
> excercise this path at all if I understand right.

You are right. My workload doesn't excercise your suspicion. But I am
happy to create this workload and try again to look at the result of
improvement.

>
> > sysctl settings:
> >
> > #!/bin/bash
> > #
> > # This script is used to adjust sysctl parameter to increase the memory
> > # pressure and accumluate some delayed entries.
> >
> > sudo sysctl vm.dirty_ratio=80
> > sudo sysctl vm.dirty_background_ratio=60
> >
> > Using this workload, the extents status tree will look like this:
> > [0, 1)[1, 2)...[31, 32)[32, 51)[51, 52)[52, 53)...[101, 102)...[111, 112)
> > [ D ][ D ]...[ D ][ H ][ D ][ D ]...[ W ]...[ U ]
> >
> > D: Delayed entry
> > H: Hole entry
> > W: Written entry
> > U: Unwritten entry
> >
> > *NOTE*
> > This workload couldn't reproduce the problem that is reported by [1].
> >
> > I run this workload on my desktop, which has a Intel Core Duo CPU @ 3.0
> > GHz, 4G memeory and a HDD. I will run the same workload on a server to
> > measure the result of improvement.
> >
> > Comparison
> > ===========
> >
> > slabtop
> > -------
> > slabtop is used to measure memory space which the slab of extent_status
> > occupies. The following is the result of $(slabtop -o)
> >
> > vanilla:
> > 718890 688677 95% 0.04K 7730 93 30920K ext4_extent_status
> > patched:
> > 718420 675933 94% 0.05K 10565 68 42260K ext4_extent_status
> >
> > vanilla patched +/- (%)
> > ext4_extent_status 30920k 42260k 36.68
> >
> > As I said above, adding hlist_node adds ~1/3 extra memory space.
> >
> > perf record -g
> > --------------
> > The following is the result of $(perf diff), which compares the result
> > of $(perf record -g). Here I cut all *_es_* functions and paste them
> > here.
> >
> > # Event 'cycles'
> > #
> > # Baseline Delta Shared Object Symbol
> > # ........ ....... .................. ............................................
> > #
> > 1.82% +0.04% [ext4] [k] ext4_es_lookup_extent
> > 0.55% +0.01% [ext4] [k] __es_tree_search
> > 0.39% -0.05% [ext4] [k] __es_insert_extent
> > 0.10% +0.02% [ext4] [k] ext4_es_insert_extent
> > 0.08% +0.01% [ext4] [k] __es_remove_extent
> > 0.07% +0.01% [ext4] [k] ext4_es_lru_add
> > 0.01% -0.01% [ext4] [k] __es_try_to_reclaim_extents
> > 0.00% [ext4] [k] ext4_es_free_extent
> > 0.00% [ext4] [k] ext4_es_cache_extent
> >
> > perf lock record -g
> > -------------------
> > Here is the result of $(perf lock report -k wait_max).
> >
> > Name acquired contended avg wait (ns) total wait (ns) max wait (ns) min wait (ns)
> > vanilla:
> > &(&sbi->s_es_lru... 152 1 87275 87275 87275 87275
> > patched:
> > &(&sbi->s_es_lru... 148 0 0 0 0 0
> >
> > 'sbi->s_es_lru_lock' protects the entire lru list of inodes which have
> > reclaimable entries. When es shrinker tries to reclaim some entries,
> > this lock will be taken during this process. From the result we can see
> > that the improvement can reduce the waiting time.
>
> Well, this doesn't say too much. During the whole run there was exactly one
> case where we contended for the s_es_lru_lock. It may be just luck you
> didn't hit the contention in the 'patched' kernel.
>
> Also without having a reproducer it's a bit hard to judge whether we are
> indeed improving a thing which needs improvement - i.e., whether the gain
> in latency is worth the cost in memory usage.

Yes, I am also afraid of this issue. But to be honest, it is diffculty
for us to create a very similar workload before we can get more details.
So maybe we can create some workloads to test different corner cases,
and make sure that this improvement can bring benifits for all cases.
Now we have two cases to measure two locks in es shrinker.

1. create some big file that have a lot of delayed extents
2. caeate a huge number of small files that have a few extents

The first case can measure the overhead of s_es_lru_lock and i_es_lock.
The second one can measure the overhead of s_es_lru_lock. More cases
are welcome.

Thanks,
- Zheng

>
>
> Honza
>
> > latency
> > -------
> > Here I calculate the latency of the following operations from the
> > result of trace points. (Unit: *us*)
> >
> > - es_find_delayed_extent_range -
> > x vanilla
> > + patched
> > N Min Max Median Avg Stddev
> > x 4425 0 54 1 1.0280226 1.0898311
> > + 7063 0 140 1 1.0127425 1.8662003
> >
> > - es_insert_extent -
> > x vanilla
> > + patched
> > N Min Max Median Avg Stddev
> > x 26995 1 1270 2 4.1053899 20.26818
> > + 26559 1 440 2 3.0948454 7.6911348
> >
> > - es_lookup_extent -
> > x vanilla
> > + patched
> > N Min Max Median Avg Stddev
> > x 32025 0 104 1 1.6082748 1.385739
> > + 31433 0 168 1 1.6219578 1.4846153
> >
> > - es_shrink_scan -
> > x vanilla
> > + patched
> > N Min Max Median Avg Stddev
> > x 1219 24 1872 171 278.35603 275.8815
> > + 1405 17 339 86 95.057651 33.216144
> >
> > The result of latency shows that the patch can reduce the latency when es
> > shrinker tries to reclaim the entry. Meanwhile other operations don't be
> > impacted too much.
> >
> > TODO list
> > =========
> >
> > Numa-aware shrinker
> > -------------------
> > Now the shrinker has been numa-aware. But the es shrinker doesn't
> > support it. So an improvement is to make es shrinker numa-aware. The
> > key issue is that extent_status objects might not be on the same node
> > with its inode. So that means we need to use lru_list on ext4_sb_info in
> > order to make s_es_lru list numa-aware, and use lru_list on ext4_es_tree
> > for making evictable list numa-aware. I am not sure it is worthwhile.
> >
> > rcu-skiplist
> > ------------
> > Chris Mason has sent out a rcu-skiplist patch set. So an potential
> > improvement here is to use rcu-skiplist in extents status tree to track
> > all entries. Now we use rb-tree to do the same thing. But the defect
> > of using rb-tree here is that we need to rotate the subtree when a node
> > is inserted or removed. That means that rb-tree couldn't provide a very
> > stable latency when it tries to balance the subtree. The skiplist does
> > not need to do this. So I suspect that it would provide more smooth
> > latency.
> >
> > Regards,
> > - Zheng
> >
> > Zheng Liu (2):
> > ext4: improve extents status tree trace point
> > ext4: improve extents status tree shrinker to avoid scanning delayed entries
> >
> > fs/ext4/extents_status.c | 53 +++++++++++++++++++++++--------------------
> > fs/ext4/extents_status.h | 2 ++
> > include/trace/events/ext4.h | 46 +++++++++++++++++++++++++++++++++----
> > 3 files changed, 71 insertions(+), 30 deletions(-)
> >
> > --
> > 1.7.9.7
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2013-12-25 03:19:57

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] ext4: improve extents status tree trace point

On Mon, Dec 23, 2013 at 09:39:03AM +0100, Jan Kara wrote:
> On Fri 20-12-13 18:42:44, Zheng Liu wrote:
> > From: Zheng Liu <[email protected]>
> >
> > This commit improves the trace point of extents status tree. For
> > improving the shrinker, we need to collect more details from trace
> > point. First we put more enter/exit pairs in insertion, deletion,
> > and caching code path in order to collect the latency of these code
> > path. Then we rename trace_ext4_es_shrink_enter in ext4_es_count()
> > because it is also used in ext4_es_scan() and we don't identify
> > them from the result.
> Umm, I think you can use a generic 'function trace' infrastructure to
> trace entry & exit into arbitrary functions in kernel. So there shouldn't
> be a need to introduce extra entry & exit tracepoints.

Ah, thanks for reminding me. Yes, 'function trace' is extreme useful!

- Zheng

>
> Honza
>
> > Cc: "Theodore Ts'o" <[email protected]>
> > Cc: Andreas Dilger <[email protected]>
> > Signed-off-by: Zheng Liu <[email protected]>
> > ---
> > fs/ext4/extents_status.c | 15 ++++++++------
> > include/trace/events/ext4.h | 46 ++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 50 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index 3981ff7..e842d74 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -660,7 +660,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> > newes.es_len = len;
> > ext4_es_store_pblock(&newes, pblk);
> > ext4_es_store_status(&newes, status);
> > - trace_ext4_es_insert_extent(inode, &newes);
> > + trace_ext4_es_insert_extent_enter(inode, &newes);
> >
> > ext4_es_insert_extent_check(inode, &newes);
> >
> > @@ -678,6 +678,7 @@ retry:
> >
> > error:
> > write_unlock(&EXT4_I(inode)->i_es_lock);
> > + trace_ext4_es_insert_extent_exit(inode, &newes);
> >
> > ext4_es_print_tree(inode);
> >
> > @@ -701,7 +702,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> > newes.es_len = len;
> > ext4_es_store_pblock(&newes, pblk);
> > ext4_es_store_status(&newes, status);
> > - trace_ext4_es_cache_extent(inode, &newes);
> > + trace_ext4_es_cache_extent_enter(inode, &newes);
> >
> > if (!len)
> > return;
> > @@ -714,6 +715,7 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
> > if (!es || es->es_lblk > end)
> > __es_insert_extent(inode, &newes);
> > write_unlock(&EXT4_I(inode)->i_es_lock);
> > + trace_ext4_es_cache_extent_exit(inode, &newes);
> > }
> >
> > /*
> > @@ -887,7 +889,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> > ext4_lblk_t end;
> > int err = 0;
> >
> > - trace_ext4_es_remove_extent(inode, lblk, len);
> > + trace_ext4_es_remove_extent_enter(inode, lblk, len);
> > es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
> > lblk, len, inode->i_ino);
> >
> > @@ -901,6 +903,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> > err = __es_remove_extent(inode, lblk, end);
> > write_unlock(&EXT4_I(inode)->i_es_lock);
> > ext4_es_print_tree(inode);
> > + trace_ext4_es_remove_extent_exit(inode, lblk, len);
> > return err;
> > }
> >
> > @@ -1017,7 +1020,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
> >
> > sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
> > nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
> > - trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
> > + trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
> > return nr;
> > }
> >
> > @@ -1030,14 +1033,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
> > int ret, nr_shrunk;
> >
> > ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
> > - trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
> > + trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
> >
> > if (!nr_to_scan)
> > return ret;
> >
> > nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
> >
> > - trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
> > + trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
> > return nr_shrunk;
> > }
> >
> > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> > index 197d312..8b3ff82 100644
> > --- a/include/trace/events/ext4.h
> > +++ b/include/trace/events/ext4.h
> > @@ -2221,19 +2221,31 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
> > __entry->pblk, show_extent_status(__entry->status))
> > );
> >
> > -DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
> > +DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_enter,
> > TP_PROTO(struct inode *inode, struct extent_status *es),
> >
> > TP_ARGS(inode, es)
> > );
> >
> > -DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent,
> > +DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent_exit,
> > TP_PROTO(struct inode *inode, struct extent_status *es),
> >
> > TP_ARGS(inode, es)
> > );
> >
> > -TRACE_EVENT(ext4_es_remove_extent,
> > +DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_enter,
> > + TP_PROTO(struct inode *inode, struct extent_status *es),
> > +
> > + TP_ARGS(inode, es)
> > +);
> > +
> > +DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent_exit,
> > + TP_PROTO(struct inode *inode, struct extent_status *es),
> > +
> > + TP_ARGS(inode, es)
> > +);
> > +
> > +DECLARE_EVENT_CLASS(ext4__es_remove_extent,
> > TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> >
> > TP_ARGS(inode, lblk, len),
> > @@ -2258,6 +2270,18 @@ TRACE_EVENT(ext4_es_remove_extent,
> > __entry->lblk, __entry->len)
> > );
> >
> > +DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_enter,
> > + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> > +
> > + TP_ARGS(inode, lblk, len)
> > +);
> > +
> > +DEFINE_EVENT(ext4__es_remove_extent, ext4_es_remove_extent_exit,
> > + TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
> > +
> > + TP_ARGS(inode, lblk, len)
> > +);
> > +
> > TRACE_EVENT(ext4_es_find_delayed_extent_range_enter,
> > TP_PROTO(struct inode *inode, ext4_lblk_t lblk),
> >
> > @@ -2366,7 +2390,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
> > show_extent_status(__entry->found ? __entry->status : 0))
> > );
> >
> > -TRACE_EVENT(ext4_es_shrink_enter,
> > +DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
> > TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
> >
> > TP_ARGS(sb, nr_to_scan, cache_cnt),
> > @@ -2388,7 +2412,19 @@ TRACE_EVENT(ext4_es_shrink_enter,
> > __entry->nr_to_scan, __entry->cache_cnt)
> > );
> >
> > -TRACE_EVENT(ext4_es_shrink_exit,
> > +DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
> > + TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
> > +
> > + TP_ARGS(sb, nr_to_scan, cache_cnt)
> > +);
> > +
> > +DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
> > + TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
> > +
> > + TP_ARGS(sb, nr_to_scan, cache_cnt)
> > +);
> > +
> > +TRACE_EVENT(ext4_es_shrink_scan_exit,
> > TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),
> >
> > TP_ARGS(sb, shrunk_nr, cache_cnt),
> > --
> > 1.7.9.7
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2013-12-25 03:31:23

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] ext4: improve extents status tree shrinker to avoid scanning delayed entries

On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > From: Zheng Liu <[email protected]>
> >
> > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > under heavy memory pressure, and try to reclaim the entry from extents
> > status tree. During this process it couldn't reclaim the delayed entry
> > because ext4 needs to use these entries to do delayed allocation space
> > reservation, seek_data/hole, etc.... So if a system has done a huge
> > number of writes and these dirty pages don't be written out. There will
> > be a lot of delayed entries on extents status tree. If shrinker tries
> > to reclaim memory from the tree, it will burn some CPU time to iterate
> > on these non-reclaimable entries. At some circumstances it could cause
> > excessive stall time.
> >
> > In this commit a new list is used to track reclaimable entries of extent
> > status tree (e.g. written/unwritten/hole entries). The shrinker will
> > scan reclaimable entry on this list. So it won't encouter any delayed
> > entry and don't need to take too much time to spin. But the defect is
> > that we need to cost extra 1/3 memory space for one entry. Before this
> > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > After that it will occupy 64 bytes. :(
> This looks sensible. I was just wondering about one thing: One incorrect
> thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> objects it can find. Now we shouldn't always start scanning at the end of
> the LRU because if delayed extents accumulate there we would never reclaim
> anything. Rather we should cycle through the list of entries we have. But
> that doesn't play well with the fact we have LRU list and thus want to
> reclaim from the end of the list. In the end what you do might be the best
> we can do but I wanted to mention the above just in case someone has some
> idea.

Ah, thanks for pointing it out. So maybe we can fix this issue before
we are sure that the new improvement is acceptable because it makes us
avoid scanning too many objects. What do you think?

Regards,
- Zheng

>
> Honza
>
> > Cc: "Theodore Ts'o" <[email protected]>
> > Cc: Andreas Dilger <[email protected]>
> > Signed-off-by: Zheng Liu <[email protected]>
> > ---
> > fs/ext4/extents_status.c | 38 +++++++++++++++++++-------------------
> > fs/ext4/extents_status.h | 2 ++
> > 2 files changed, 21 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index e842d74..11bdb2f 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -169,6 +169,7 @@ void ext4_exit_es(void)
> > void ext4_es_init_tree(struct ext4_es_tree *tree)
> > {
> > tree->root = RB_ROOT;
> > + INIT_HLIST_HEAD(&tree->evictable_list);
> > tree->cache_es = NULL;
> > }
> >
> > @@ -300,10 +301,14 @@ static struct extent_status *
> > ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> > ext4_fsblk_t pblk)
> > {
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > struct extent_status *es;
> > +
> > es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
> > if (es == NULL)
> > return NULL;
> > +
> > + INIT_HLIST_NODE(&es->es_list);
> > es->es_lblk = lblk;
> > es->es_len = len;
> > es->es_pblk = pblk;
> > @@ -312,8 +317,9 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> > * We don't count delayed extent because we never try to reclaim them
> > */
> > if (!ext4_es_is_delayed(es)) {
> > - EXT4_I(inode)->i_es_lru_nr++;
> > + ei->i_es_lru_nr++;
> > percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
> > + hlist_add_head(&es->es_list, &ei->i_es_tree.evictable_list);
> > }
> >
> > return es;
> > @@ -321,10 +327,12 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> >
> > static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
> > {
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > /* Decrease the lru counter when this es is not delayed */
> > if (!ext4_es_is_delayed(es)) {
> > - BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
> > - EXT4_I(inode)->i_es_lru_nr--;
> > + BUG_ON(ei->i_es_lru_nr-- == 0);
> > + hlist_del_init(&es->es_list);
> > percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
> > }
> >
> > @@ -1092,8 +1100,8 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> > {
> > struct inode *inode = &ei->vfs_inode;
> > struct ext4_es_tree *tree = &ei->i_es_tree;
> > - struct rb_node *node;
> > struct extent_status *es;
> > + struct hlist_node *tmp;
> > unsigned long nr_shrunk = 0;
> > static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
> > DEFAULT_RATELIMIT_BURST);
> > @@ -1105,21 +1113,13 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> > __ratelimit(&_rs))
> > ext4_warning(inode->i_sb, "forced shrink of precached extents");
> >
> > - node = rb_first(&tree->root);
> > - while (node != NULL) {
> > - es = rb_entry(node, struct extent_status, rb_node);
> > - node = rb_next(&es->rb_node);
> > - /*
> > - * We can't reclaim delayed extent from status tree because
> > - * fiemap, bigallic, and seek_data/hole need to use it.
> > - */
> > - if (!ext4_es_is_delayed(es)) {
> > - rb_erase(&es->rb_node, &tree->root);
> > - ext4_es_free_extent(inode, es);
> > - nr_shrunk++;
> > - if (--nr_to_scan == 0)
> > - break;
> > - }
> > + hlist_for_each_entry_safe(es, tmp, &tree->evictable_list, es_list) {
> > + BUG_ON(ext4_es_is_delayed(es));
> > + rb_erase(&es->rb_node, &tree->root);
> > + ext4_es_free_extent(inode, es);
> > + nr_shrunk++;
> > + if (--nr_to_scan == 0)
> > + break;
> > }
> > tree->cache_es = NULL;
> > return nr_shrunk;
> > diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> > index 167f4ab8..38ca83e 100644
> > --- a/fs/ext4/extents_status.h
> > +++ b/fs/ext4/extents_status.h
> > @@ -54,6 +54,7 @@ struct ext4_extent;
> >
> > struct extent_status {
> > struct rb_node rb_node;
> > + struct hlist_node es_list;
> > ext4_lblk_t es_lblk; /* first logical block extent covers */
> > ext4_lblk_t es_len; /* length of extent in block */
> > ext4_fsblk_t es_pblk; /* first physical block */
> > @@ -61,6 +62,7 @@ struct extent_status {
> >
> > struct ext4_es_tree {
> > struct rb_root root;
> > + struct hlist_head evictable_list;
> > struct extent_status *cache_es; /* recently accessed extent */
> > };
> >
> > --
> > 1.7.9.7
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2013-12-30 21:09:20

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] ext4: improve extents status tree shrinker to avoid scanning delayed entries

On Wed 25-12-13 11:34:48, Zheng Liu wrote:
> On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> > On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > > From: Zheng Liu <[email protected]>
> > >
> > > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > > under heavy memory pressure, and try to reclaim the entry from extents
> > > status tree. During this process it couldn't reclaim the delayed entry
> > > because ext4 needs to use these entries to do delayed allocation space
> > > reservation, seek_data/hole, etc.... So if a system has done a huge
> > > number of writes and these dirty pages don't be written out. There will
> > > be a lot of delayed entries on extents status tree. If shrinker tries
> > > to reclaim memory from the tree, it will burn some CPU time to iterate
> > > on these non-reclaimable entries. At some circumstances it could cause
> > > excessive stall time.
> > >
> > > In this commit a new list is used to track reclaimable entries of extent
> > > status tree (e.g. written/unwritten/hole entries). The shrinker will
> > > scan reclaimable entry on this list. So it won't encouter any delayed
> > > entry and don't need to take too much time to spin. But the defect is
> > > that we need to cost extra 1/3 memory space for one entry. Before this
> > > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > > After that it will occupy 64 bytes. :(
> > This looks sensible. I was just wondering about one thing: One incorrect
> > thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> > objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> > objects it can find. Now we shouldn't always start scanning at the end of
> > the LRU because if delayed extents accumulate there we would never reclaim
> > anything. Rather we should cycle through the list of entries we have. But
> > that doesn't play well with the fact we have LRU list and thus want to
> > reclaim from the end of the list. In the end what you do might be the best
> > we can do but I wanted to mention the above just in case someone has some
> > idea.
>
> Ah, thanks for pointing it out. So maybe we can fix this issue before
> we are sure that the new improvement is acceptable because it makes us
> avoid scanning too many objects. What do you think?
I'm sorry but I'm not sure I understand. By 'fix this issue' do you mean
using your patch or somehow fixing the problem that we try to reclaim
'nr_to_scan' objects instead of just trying to scan that many objects?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-12-31 02:47:20

by Zheng Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] ext4: improve extents status tree shrinker to avoid scanning delayed entries

On Mon, Dec 30, 2013 at 10:09:17PM +0100, Jan Kara wrote:
> On Wed 25-12-13 11:34:48, Zheng Liu wrote:
> > On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> > > On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > > > From: Zheng Liu <[email protected]>
> > > >
> > > > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > > > under heavy memory pressure, and try to reclaim the entry from extents
> > > > status tree. During this process it couldn't reclaim the delayed entry
> > > > because ext4 needs to use these entries to do delayed allocation space
> > > > reservation, seek_data/hole, etc.... So if a system has done a huge
> > > > number of writes and these dirty pages don't be written out. There will
> > > > be a lot of delayed entries on extents status tree. If shrinker tries
> > > > to reclaim memory from the tree, it will burn some CPU time to iterate
> > > > on these non-reclaimable entries. At some circumstances it could cause
> > > > excessive stall time.
> > > >
> > > > In this commit a new list is used to track reclaimable entries of extent
> > > > status tree (e.g. written/unwritten/hole entries). The shrinker will
> > > > scan reclaimable entry on this list. So it won't encouter any delayed
> > > > entry and don't need to take too much time to spin. But the defect is
> > > > that we need to cost extra 1/3 memory space for one entry. Before this
> > > > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > > > After that it will occupy 64 bytes. :(
> > > This looks sensible. I was just wondering about one thing: One incorrect
> > > thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> > > objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> > > objects it can find. Now we shouldn't always start scanning at the end of
> > > the LRU because if delayed extents accumulate there we would never reclaim
> > > anything. Rather we should cycle through the list of entries we have. But
> > > that doesn't play well with the fact we have LRU list and thus want to
> > > reclaim from the end of the list. In the end what you do might be the best
> > > we can do but I wanted to mention the above just in case someone has some
> > > idea.
> >
> > Ah, thanks for pointing it out. So maybe we can fix this issue before
> > we are sure that the new improvement is acceptable because it makes us
> > avoid scanning too many objects. What do you think?
> I'm sorry but I'm not sure I understand. By 'fix this issue' do you mean
> using your patch or somehow fixing the problem that we try to reclaim
> 'nr_to_scan' objects instead of just trying to scan that many objects?

Sorry, let me clarify it please. I mean that we can have a patch to fix
the issue that we try to reclaim 'nr_to_scan' objects. After this we
could avoid scanning too much objects in extent status tree. My idea is
that we use a single patch to fix this issue. That means that we don't
need to wait other improvements because we still needs to take some time
verifing these improvements useful.

Thanks for the reply and happy new year :)!
- Zheng

2013-12-31 08:16:41

by Zheng Liu

[permalink] [raw]
Subject: [PATCH] ext4: make es shrinker handle nr_to_scan correctly (Re: [RFC PATCH 2/2] ext4: improve ...)

On Mon, Dec 30, 2013 at 10:09:17PM +0100, Jan Kara wrote:
> On Wed 25-12-13 11:34:48, Zheng Liu wrote:
> > On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> > > On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > > > From: Zheng Liu <[email protected]>
> > > >
> > > > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > > > under heavy memory pressure, and try to reclaim the entry from extents
> > > > status tree. During this process it couldn't reclaim the delayed entry
> > > > because ext4 needs to use these entries to do delayed allocation space
> > > > reservation, seek_data/hole, etc.... So if a system has done a huge
> > > > number of writes and these dirty pages don't be written out. There will
> > > > be a lot of delayed entries on extents status tree. If shrinker tries
> > > > to reclaim memory from the tree, it will burn some CPU time to iterate
> > > > on these non-reclaimable entries. At some circumstances it could cause
> > > > excessive stall time.
> > > >
> > > > In this commit a new list is used to track reclaimable entries of extent
> > > > status tree (e.g. written/unwritten/hole entries). The shrinker will
> > > > scan reclaimable entry on this list. So it won't encouter any delayed
> > > > entry and don't need to take too much time to spin. But the defect is
> > > > that we need to cost extra 1/3 memory space for one entry. Before this
> > > > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > > > After that it will occupy 64 bytes. :(
> > > This looks sensible. I was just wondering about one thing: One incorrect
> > > thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> > > objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> > > objects it can find. Now we shouldn't always start scanning at the end of
> > > the LRU because if delayed extents accumulate there we would never reclaim
> > > anything. Rather we should cycle through the list of entries we have. But
> > > that doesn't play well with the fact we have LRU list and thus want to
> > > reclaim from the end of the list. In the end what you do might be the best
> > > we can do but I wanted to mention the above just in case someone has some
> > > idea.
> >
> > Ah, thanks for pointing it out. So maybe we can fix this issue before
> > we are sure that the new improvement is acceptable because it makes us
> > avoid scanning too many objects. What do you think?
> I'm sorry but I'm not sure I understand. By 'fix this issue' do you mean
> using your patch or somehow fixing the problem that we try to reclaim
> 'nr_to_scan' objects instead of just trying to scan that many objects?

This patch tries to make es shrinker handle nr_to_scan properly. After
applying this patch, es shrinker just scans nr_to_scan objects. But
__ext4_es_shrink() couldn't guarantee that it can reclaim objects from
extent status tree because it doesn't scan all objects and it might only
scan nr_to_scan delayed objects. So it could return ENOMEM error from
ext4_es_insert/remove_extent(), although there are some reclaimable
objects in the tree.

Another approach to solve above problem is that we add a new parameter
called 'nr_to_discard', which tells es shrinker to reclaim nr_to_discard
objects. But it makes thing a bit complicated. I am not sure whether
it is worthwhile because if we use a list to track all reclaimable
objects we no longer need nr_to_discard parameter.

Obviously, this patch is only a band-aid and it might be not very useful
for us to solve the problem that we faced. But I attach it below in
case someone have some idea.

Regards,
- Zheng

Subject: [PATCH] ext4: make es shrinker handle nr_to_scan correctly

From: Zheng Liu <[email protected]>

Nr_to_scan means that the shrinker needs to traverse nr_to_scan objects
rather than reclaim nr_to_scan objects. This commit makes es shrinker
handle it correctly. But after this __ext4_es_shrink() couldn't
guarantee that it can reclaim objects. Hence we need to enlarge value
from 1 to 128 in ext4_es_insert/remove_extent() to avoid returning
ENOMEM error. Before we use a list to track all reclaimable objects,
there is a tradeoff between returning ENOMEM and discarding too many
objects from extent status tree.

Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index eb7ae61..87795d1 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -146,7 +146,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
- int nr_to_scan);
+ int *nr_to_scan);
static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei);

@@ -670,7 +670,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
goto error;
retry:
err = __es_insert_extent(inode, &newes);
- if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
+ if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
EXT4_I(inode)))
goto retry;
if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
@@ -824,7 +824,7 @@ retry:
es->es_lblk = orig_es.es_lblk;
es->es_len = orig_es.es_len;
if ((err == -ENOMEM) &&
- __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
+ __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
EXT4_I(inode)))
goto retry;
goto out;
@@ -938,8 +938,6 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,

retry:
list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
- int shrunk;
-
/*
* If we have already reclaimed all extents from extent
* status tree, just stop the loop immediately.
@@ -966,13 +964,11 @@ retry:
continue;

write_lock(&ei->i_es_lock);
- shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
+ nr_shrunk += __es_try_to_reclaim_extents(ei, &nr_to_scan);
if (ei->i_es_lru_nr == 0)
list_del_init(&ei->i_es_lru);
write_unlock(&ei->i_es_lock);

- nr_shrunk += shrunk;
- nr_to_scan -= shrunk;
if (nr_to_scan == 0)
break;
}
@@ -1003,8 +999,16 @@ retry:

spin_unlock(&sbi->s_es_lru_lock);

- if (locked_ei && nr_shrunk == 0)
- nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+ if (locked_ei && nr_shrunk == 0) {
+ /*
+ * We try to reclaim objects from locked inode. We don't
+ * want to discard too many objects from this inode because
+ * it might be accessed frequently.
+ */
+ if (!nr_to_scan)
+ nr_to_scan = 8;
+ nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &nr_to_scan);
+ }

return nr_shrunk;
}
@@ -1085,7 +1089,7 @@ void ext4_es_lru_del(struct inode *inode)
}

static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
- int nr_to_scan)
+ int *nr_to_scan)
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
@@ -1114,7 +1118,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
rb_erase(&es->rb_node, &tree->root);
ext4_es_free_extent(inode, es);
nr_shrunk++;
- if (--nr_to_scan == 0)
+ if (--(*nr_to_scan) == 0)
break;
}
}
--
1.7.9.7


2013-12-31 10:59:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: make es shrinker handle nr_to_scan correctly (Re: [RFC PATCH 2/2] ext4: improve ...)

On Tue 31-12-13 16:20:15, Zheng Liu wrote:
> On Mon, Dec 30, 2013 at 10:09:17PM +0100, Jan Kara wrote:
> > On Wed 25-12-13 11:34:48, Zheng Liu wrote:
> > > On Mon, Dec 23, 2013 at 09:54:19AM +0100, Jan Kara wrote:
> > > > On Fri 20-12-13 18:42:45, Zheng Liu wrote:
> > > > > From: Zheng Liu <[email protected]>
> > > > >
> > > > > The extents status tree shrinker will scan all inodes on sbi->s_es_lru
> > > > > under heavy memory pressure, and try to reclaim the entry from extents
> > > > > status tree. During this process it couldn't reclaim the delayed entry
> > > > > because ext4 needs to use these entries to do delayed allocation space
> > > > > reservation, seek_data/hole, etc.... So if a system has done a huge
> > > > > number of writes and these dirty pages don't be written out. There will
> > > > > be a lot of delayed entries on extents status tree. If shrinker tries
> > > > > to reclaim memory from the tree, it will burn some CPU time to iterate
> > > > > on these non-reclaimable entries. At some circumstances it could cause
> > > > > excessive stall time.
> > > > >
> > > > > In this commit a new list is used to track reclaimable entries of extent
> > > > > status tree (e.g. written/unwritten/hole entries). The shrinker will
> > > > > scan reclaimable entry on this list. So it won't encouter any delayed
> > > > > entry and don't need to take too much time to spin. But the defect is
> > > > > that we need to cost extra 1/3 memory space for one entry. Before this
> > > > > commit, 'struct extent_status' occupies 48 bytes on a 64bits platform.
> > > > > After that it will occupy 64 bytes. :(
> > > > This looks sensible. I was just wondering about one thing: One incorrect
> > > > thing the old extent shrinker does is that it tries to reclaim 'nr_to_scan'
> > > > objects. That is wrong - it should *scan* 'nr_to_scan' objects and reclaim
> > > > objects it can find. Now we shouldn't always start scanning at the end of
> > > > the LRU because if delayed extents accumulate there we would never reclaim
> > > > anything. Rather we should cycle through the list of entries we have. But
> > > > that doesn't play well with the fact we have LRU list and thus want to
> > > > reclaim from the end of the list. In the end what you do might be the best
> > > > we can do but I wanted to mention the above just in case someone has some
> > > > idea.
> > >
> > > Ah, thanks for pointing it out. So maybe we can fix this issue before
> > > we are sure that the new improvement is acceptable because it makes us
> > > avoid scanning too many objects. What do you think?
> > I'm sorry but I'm not sure I understand. By 'fix this issue' do you mean
> > using your patch or somehow fixing the problem that we try to reclaim
> > 'nr_to_scan' objects instead of just trying to scan that many objects?
>
> This patch tries to make es shrinker handle nr_to_scan properly. After
> applying this patch, es shrinker just scans nr_to_scan objects. But
> __ext4_es_shrink() couldn't guarantee that it can reclaim objects from
> extent status tree because it doesn't scan all objects and it might only
> scan nr_to_scan delayed objects. So it could return ENOMEM error from
> ext4_es_insert/remove_extent(), although there are some reclaimable
> objects in the tree.
>
> Another approach to solve above problem is that we add a new parameter
> called 'nr_to_discard', which tells es shrinker to reclaim nr_to_discard
> objects. But it makes thing a bit complicated. I am not sure whether
> it is worthwhile because if we use a list to track all reclaimable
> objects we no longer need nr_to_discard parameter.
>
> Obviously, this patch is only a band-aid and it might be not very useful
> for us to solve the problem that we faced. But I attach it below in
> case someone have some idea.
>
> Regards,
> - Zheng
>
> Subject: [PATCH] ext4: make es shrinker handle nr_to_scan correctly
>
> From: Zheng Liu <[email protected]>
>
> Nr_to_scan means that the shrinker needs to traverse nr_to_scan objects
> rather than reclaim nr_to_scan objects. This commit makes es shrinker
> handle it correctly. But after this __ext4_es_shrink() couldn't
> guarantee that it can reclaim objects. Hence we need to enlarge value
> from 1 to 128 in ext4_es_insert/remove_extent() to avoid returning
> ENOMEM error. Before we use a list to track all reclaimable objects,
> there is a tradeoff between returning ENOMEM and discarding too many
> objects from extent status tree.
But the way you've implemented this won't quite work - reclaim never asks
a slab cache to scan large number of objects in one round - the maximum is
128. So if you have 128 unreclaimable objects at the beginning of your
object list you would never reclaim anything - that is what I was speaking
about in my original email. So you have to implement some 'cycling' of head
pointer so that you can continue scanning when asked next time instead of
staring over from the beginning of the list. This logic doesn't play well
with the LRU approach we try to do in the extent cache. So I really see two
viable approaches

1) Trivial: Forget about LRU, just have a list of inodes with extents in
extent cache. Shrinker will remember ino + offset pair it stopped scanning
at and continues scanning the list of inodes from there (restarting the scan
from the beginning if inode got reclaimed in the mean time).

2) Complex: Keep the LRU algorithm and try to make it more CPU efficient.

Approach 1) can possibly reclaim extents which are still very useful. OTOH
these extents will be loaded quickly back and then it should take full
cycle over all other extents to get back to these and reclaim them again.
So it needn't be too bad.

Honza

> Signed-off-by: Zheng Liu <[email protected]>
> ---
> fs/ext4/extents_status.c | 28 ++++++++++++++++------------
> 1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index eb7ae61..87795d1 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -146,7 +146,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
> static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t end);
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> - int nr_to_scan);
> + int *nr_to_scan);
> static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> struct ext4_inode_info *locked_ei);
>
> @@ -670,7 +670,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> goto error;
> retry:
> err = __es_insert_extent(inode, &newes);
> - if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> + if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
> EXT4_I(inode)))
> goto retry;
> if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
> @@ -824,7 +824,7 @@ retry:
> es->es_lblk = orig_es.es_lblk;
> es->es_len = orig_es.es_len;
> if ((err == -ENOMEM) &&
> - __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> + __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
> EXT4_I(inode)))
> goto retry;
> goto out;
> @@ -938,8 +938,6 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
>
> retry:
> list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> - int shrunk;
> -
> /*
> * If we have already reclaimed all extents from extent
> * status tree, just stop the loop immediately.
> @@ -966,13 +964,11 @@ retry:
> continue;
>
> write_lock(&ei->i_es_lock);
> - shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> + nr_shrunk += __es_try_to_reclaim_extents(ei, &nr_to_scan);
> if (ei->i_es_lru_nr == 0)
> list_del_init(&ei->i_es_lru);
> write_unlock(&ei->i_es_lock);
>
> - nr_shrunk += shrunk;
> - nr_to_scan -= shrunk;
> if (nr_to_scan == 0)
> break;
> }
> @@ -1003,8 +999,16 @@ retry:
>
> spin_unlock(&sbi->s_es_lru_lock);
>
> - if (locked_ei && nr_shrunk == 0)
> - nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
> + if (locked_ei && nr_shrunk == 0) {
> + /*
> + * We try to reclaim objects from locked inode. We don't
> + * want to discard too many objects from this inode because
> + * it might be accessed frequently.
> + */
> + if (!nr_to_scan)
> + nr_to_scan = 8;
> + nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &nr_to_scan);
> + }
>
> return nr_shrunk;
> }
> @@ -1085,7 +1089,7 @@ void ext4_es_lru_del(struct inode *inode)
> }
>
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> - int nr_to_scan)
> + int *nr_to_scan)
> {
> struct inode *inode = &ei->vfs_inode;
> struct ext4_es_tree *tree = &ei->i_es_tree;
> @@ -1114,7 +1118,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> rb_erase(&es->rb_node, &tree->root);
> ext4_es_free_extent(inode, es);
> nr_shrunk++;
> - if (--nr_to_scan == 0)
> + if (--(*nr_to_scan) == 0)
> break;
> }
> }
> --
> 1.7.9.7
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-01-15 02:58:36

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH] ext4: make es shrinker handle nr_to_scan correctly (Re: [RFC PATCH 2/2] ext4: improve ...)

On Tue, Dec 31, 2013 at 11:59:12AM +0100, Jan Kara wrote:
[...]
> > Subject: [PATCH] ext4: make es shrinker handle nr_to_scan correctly
> >
> > From: Zheng Liu <[email protected]>
> >
> > Nr_to_scan means that the shrinker needs to traverse nr_to_scan objects
> > rather than reclaim nr_to_scan objects. This commit makes es shrinker
> > handle it correctly. But after this __ext4_es_shrink() couldn't
> > guarantee that it can reclaim objects. Hence we need to enlarge value
> > from 1 to 128 in ext4_es_insert/remove_extent() to avoid returning
> > ENOMEM error. Before we use a list to track all reclaimable objects,
> > there is a tradeoff between returning ENOMEM and discarding too many
> > objects from extent status tree.
> But the way you've implemented this won't quite work - reclaim never asks
> a slab cache to scan large number of objects in one round - the maximum is
> 128. So if you have 128 unreclaimable objects at the beginning of your
> object list you would never reclaim anything - that is what I was speaking
> about in my original email. So you have to implement some 'cycling' of head
> pointer so that you can continue scanning when asked next time instead of
> staring over from the beginning of the list. This logic doesn't play well
> with the LRU approach we try to do in the extent cache. So I really see two
> viable approaches
>
> 1) Trivial: Forget about LRU, just have a list of inodes with extents in
> extent cache. Shrinker will remember ino + offset pair it stopped scanning
> at and continues scanning the list of inodes from there (restarting the scan
> from the beginning if inode got reclaimed in the mean time).
>
> 2) Complex: Keep the LRU algorithm and try to make it more CPU efficient.
>
> Approach 1) can possibly reclaim extents which are still very useful. OTOH
> these extents will be loaded quickly back and then it should take full
> cycle over all other extents to get back to these and reclaim them again.
> So it needn't be too bad.

Sorry for the delay. I have sent out two patches for adding some
statistics in extent status tree shrinker and doing some cleanups for
the tracepoint of the shrinker. Thanks for your review.

As you have mentioned above, now we have two approaches for improving
the shrinker. IMHO, I prefer the LRU list approach. But I don't have
any number to prove it. So my plan is that both of them will be
implemented. Then we can measure the performance of them, and decide
one of them as a final solution.

Thanks,
- Zheng

>
> Honza
>
> > Signed-off-by: Zheng Liu <[email protected]>
> > ---
> > fs/ext4/extents_status.c | 28 ++++++++++++++++------------
> > 1 file changed, 16 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> > index eb7ae61..87795d1 100644
> > --- a/fs/ext4/extents_status.c
> > +++ b/fs/ext4/extents_status.c
> > @@ -146,7 +146,7 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
> > static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> > ext4_lblk_t end);
> > static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> > - int nr_to_scan);
> > + int *nr_to_scan);
> > static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> > struct ext4_inode_info *locked_ei);
> >
> > @@ -670,7 +670,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> > goto error;
> > retry:
> > err = __es_insert_extent(inode, &newes);
> > - if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> > + if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
> > EXT4_I(inode)))
> > goto retry;
> > if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
> > @@ -824,7 +824,7 @@ retry:
> > es->es_lblk = orig_es.es_lblk;
> > es->es_len = orig_es.es_len;
> > if ((err == -ENOMEM) &&
> > - __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> > + __ext4_es_shrink(EXT4_SB(inode->i_sb), 128,
> > EXT4_I(inode)))
> > goto retry;
> > goto out;
> > @@ -938,8 +938,6 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> >
> > retry:
> > list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> > - int shrunk;
> > -
> > /*
> > * If we have already reclaimed all extents from extent
> > * status tree, just stop the loop immediately.
> > @@ -966,13 +964,11 @@ retry:
> > continue;
> >
> > write_lock(&ei->i_es_lock);
> > - shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> > + nr_shrunk += __es_try_to_reclaim_extents(ei, &nr_to_scan);
> > if (ei->i_es_lru_nr == 0)
> > list_del_init(&ei->i_es_lru);
> > write_unlock(&ei->i_es_lock);
> >
> > - nr_shrunk += shrunk;
> > - nr_to_scan -= shrunk;
> > if (nr_to_scan == 0)
> > break;
> > }
> > @@ -1003,8 +999,16 @@ retry:
> >
> > spin_unlock(&sbi->s_es_lru_lock);
> >
> > - if (locked_ei && nr_shrunk == 0)
> > - nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
> > + if (locked_ei && nr_shrunk == 0) {
> > + /*
> > + * We try to reclaim objects from locked inode. We don't
> > + * want to discard too many objects from this inode because
> > + * it might be accessed frequently.
> > + */
> > + if (!nr_to_scan)
> > + nr_to_scan = 8;
> > + nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &nr_to_scan);
> > + }
> >
> > return nr_shrunk;
> > }
> > @@ -1085,7 +1089,7 @@ void ext4_es_lru_del(struct inode *inode)
> > }
> >
> > static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> > - int nr_to_scan)
> > + int *nr_to_scan)
> > {
> > struct inode *inode = &ei->vfs_inode;
> > struct ext4_es_tree *tree = &ei->i_es_tree;
> > @@ -1114,7 +1118,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> > rb_erase(&es->rb_node, &tree->root);
> > ext4_es_free_extent(inode, es);
> > nr_shrunk++;
> > - if (--nr_to_scan == 0)
> > + if (--(*nr_to_scan) == 0)
> > break;
> > }
> > }
> > --
> > 1.7.9.7
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR