Hi all,
Intro
=====
Here is the third version to improve the extent status tree shrinker.
Sorry for very late work. Thanks Ted's and Dave's suggestions. I
revise my work and currently in extent status tree shrinker we should
fix two issues. One is how to reclaim some objects from the tree as
quick as possible, and another is how to keep useful extent cache in
the tree as many as possible. The first issue also can be divided into
two problems: a) how to scan list that tracks all inodes and b) how to
reclaim object from an inode. This patch set tries to fix these issues.
In the patch set, the first two patches just does some cleanups and
adds some statistics for measuring the improvements.
Patch 3 makes extent status tree cache extent hole in delalloc code path
to improve the cache hit ratio. Currently we don't cache extent hole
when we do a delalloc write because this extent hole might be converted
into delayed soon. But the defect is that we could miss some extent
hole in extent cache. That means that the following writes must lookup
in extent tree again to make sure whether a block has been allocated or
not.
Patch 4 makes shrinker replace lru list with a normal list to track all
inodes. Then the shrinker uses a round-robin algorithm to scan this
list. The purpose that we discard the lru list and use rr algorithm is
that we don't need to take a long time to keep lru list. Thus we can
save some scan time. Meanwhile this commit tries to reduce the critical
section. Thus the locking gets more fine-granularity. That means that
other processes don't need to wait on this locking for a long time. The
improvement can be seen in test case (b).
Patch 5 uses a list to track all reclaimable objects in an inode to
speed up the reclaim time. Now when shrinker tries to reclaim some
objects it should scan rb-tree in an inode. But in this rb-tree, it
has some non-reclaimable objects (delayed extent, ext4 uses them to
finish seeking data/hole, finding delayed range, etc.). That means the
shrinker must take some time to skip these non-reclaimble objects during
the scan time. So after applying this patch the shrinker can directly
reclaim objects. The improvement can be seen in test case (a).
Patch 6 improve the list that tracks all reclaimable objects in order
to promote the cache hit ratio. After applied patch 5, the extent
cache could be flushed by a streaming workload because we don't any
way to recognize which one should be kept as much as possible. In this
commit it splits the list into active list and inactive list. Meanwhile
a new flag called '_ACCESSED' is defined. When an extent cache is
accessed, this flag will be markd. When the shrinker encounters this
flag during scanning list, this extent cache will be moved to the tail
of active list. All these active extent caches will be reclaimed when
we are under high memory pressure and the shrinker doesn't reclaim any
object in the first round. The improvement can be seen in test case (c).
Environment
===========
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2400.000
BogoMIPS: 4799.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-3,8-11
NUMA node1 CPU(s): 4-7,12-15
$ cat /proc/meminfo
MemTotal: 24677988 kB
$ df -ah
/dev/sdb1 453G 51G 380G 12% /mnt/sdb1 (HDD)
Test Case
=========
Test Case (a)
-------------
The test case (a) is used to create a huge number of extent caches in
500 inodes. Running this test, we will get a very big rb-tree on every
inodes. The purpose is to measure the latency when shrinker reclaim
object from an inode.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
filesize=100000g
size=600000g
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap
[io]
rw=write
numjobs=100
nrfiles=5
Test Case (b)
-------------
The test case (b) is used to create a very long list in super block so
that we can measure the latency when shrinker scan the list.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
fallocate=0
direct=0
runtime=300
create_on_open=1
create_serialize=0
create_fsync=0
norandommap
[io]
filesize=100000g
size=600000g
rw=write
numjobs=100
nrfiles=5
openfiles=10000
[rand]
filesize=1000g
size=600000g
rw=write:4k
numjobs=20
nrfiles=20000
Test Case (c)
-------------
The test case (c) is used to measure the cache hit/miss.
*NOTE*
I reduce the memory to 12g in order to make shrinker work more aggressive.
Fio script:
[global]
ioengine=psync
bs=4k
directory=/mnt/sdb1
group_reporting
direct=0
runtime=300
[read]
rw=read
numjobs=1
nrfiles=50
filesize=1g
size=600000g
[stream]
filesize=1000g
size=600000g
rw=write
numjobs=100
nrfiles=5
create_on_open=1
create_serialize=0
create_fsync=0
Note 1
------
*For getting a very fragmented extent status tree, I use the following
patch to disallow to merge extent cache*
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b6c3366..f8c693e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -351,6 +351,7 @@ static void ext4_es_free_extent(struct inode *inode, struct
extent_status *es)
static int ext4_es_can_be_merged(struct extent_status *es1,
struct extent_status *es2)
{
+#if 0
if (ext4_es_status(es1) != ext4_es_status(es2))
return 0;
@@ -376,6 +377,7 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
/* we need to check delayed extent is without unwritten status */
if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
return 1;
+#endif
return 0;
}
Note 2
------
For caching data in memory as many as possible, two sysctl parameters
are changed:
sudo sysctl vm.dirty_ratio=80
sudo sysctl vm.dirty_background_ratio=60
Result
======
Every test cases we collect 5 metrics:
1. The shrinker avg. scan time
2. The shrinker max scan time
3. The extent cache hit/miss
4. The userspace avg. latency
5. The userspace max latency
Due to using fio to do the test, it is easy to get the userspace latency.
For review, I brief the patch as blow:
* baseline: ext4/dev branch + patch 1, 2
* es-cache: baseline + patch 3
* es-slist: es-cache + patch 4
* es-ilist: es-slist + patch 5
* es-gc: es-ilist + patch 6
Every test cases should be run 3 times.
1. es avg. scan time
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 9402 11552 9633 10195.667 1180.284
+ 3 7069 10358 9931 9119.3333 1788.4301
* 3 3754 6854 4480 5029.3333 1621.3653
% 3 190 243 204 212.33333 27.465129
# 3 269 513 290 357.33333 135.21957
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 23047 36274 30195 29838.667 6620.6958
+ 3 2992 32867 14370 16743 15078.205
* 3 124 1281 173 526 654.30803
% 3 133 157 136 142 13.076697
# 3 124 315 140 193 105.95754
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 137 235 163 178.33333 50.767444
+ 3 57 9333 6383 5257.6667 4739.2853
* 3 48 54 49 50.333333 3.2145503
% 3 52 57 53 54 2.6457513
# 3 49 88 69 68.666667 19.502137
2. es max scan time
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 17503 22339 19329 19723.667 2442.0371
+ 3 26275 33930 31867 30690.667 3960.7545
* 3 170970 199678 188287 186311.67 14455.579
% 3 325 405 382 370.66667 41.186567
# 3 500 857 790 715.66667 189.75335
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 361903 384286 380877 375688.67 12059.8
+ 3 277470 313231 293883 294861.33 17900.562
* 3 112727 131888 114761 119792 10524.695
% 3 59466 76178 67195 67613 8363.8376
# 3 38747 84505 45682 56311.333 24661.421
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 337 462 355 384.66667 67.57465
+ 3 22136 31236 24031 25801 4801.2681
* 3 36105 56888 54150 49047.667 11291.972
% 3 152 191 158 167 21
# 3 115 154 139 136 19.672316
3. cache hit/miss
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 4673 4858 4832 4787.6667 100.15155
+ 3 7911368 8364372 8263210 8179650 237781.12
* 3 6657959 6900137 6830922 6796339.3 124737.79
% 3 6525957 6691967 6535884 6584602.7 93112.627
# 3 10061008 10704728 10501548 10422428 329072.7
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 1645455 1648083 1646239 1646592.3 1349.1588
+ 3 8531336 9081690 8570763 8727929.7 306999.03
* 3 6591288 6643048 6595983 6610106.3 28624.741
% 3 6501434 6556700 6537840 6531991.3 28093.378
# 3 9656351 9801225 9739132 9732236 72682.77
test case (c)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 10873 18200 15614 14895.667 3715.9433
+ 3 1781010 1925331 1888470 1864937 74983.26
* 3 1697486 1929501 1765950 1797645.7 119210.74
% 3 1703674 1870348 1743376 1772466 87061.626
# 3 1687157 1953796 1774275 1805076 135961.82
4. userspace avg. latency (usec)
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 2638.64 2798.02 2649.18 2695.28 89.131384
+ 3 2641.12 2817.16 2648.18 2702.1533 99.661231
* 3 2637.54 2812.82 2642.68 2697.68 99.747279
% 3 2850.76 2859.45 2852.68 2854.2967 4.5650009
# 3 2643.07 2817.62 2653.54 2704.7433 97.894135
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 3561.9 3580.1 3563.47 3568.49 10.085152
+ 3 3596.12 3631.57 3600.2 3609.2967 19.396846
* 3 3565.4 3605.12 3585.48 3585.3333 19.860406
% 3 3577.07 3597.76 3588.12 3587.65 10.353004
# 3 3593.63 3626.3 3602.66 3607.53 16.870682
test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 165.46 200.33 166.31 177.36667 19.891371
+ 3 165.11 218.36 165.16 182.87667 30.729478
* 3 165.14 199.36 165.36 176.62 19.693725
% 3 165.45 197.07 166.32 176.28 18.009922
# 3 165.13 228.35 165.71 186.39667 36.33381
write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 13804.25 15932.47 15148.61 14961.777 1076.3411
+ 3 13807.94 15922.01 15114.67 14948.207 1066.8203
* 3 13808.51 15936.08 15083.47 14942.687 1070.749
% 3 13807.08 15290.5 15093.27 14730.283 805.57632
# 3 13809.13 15972.27 15308.6 15030 1108.1548
5. userspace max latency (usec)
test case (a)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 849941 2720800 941754 1504165 1054636.4
+ 3 1080500 2199500 1090500 1456833.3 643187.63
* 3 863204 2663400 920104 1482236 1023313.6
% 3 879189 1156700 991861 1009250 139570.31
# 3 896181 1692100 984983 1191088 436155.04
test case (b)
-------------
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 1293500 4917500 3935800 3382266.7 1874338.1
+ 3 1027900 3232200 1505500 1921866.7 1159635.9
* 3 1025500 1109300 1033500 1056100 46245.865
% 3 983719 1188300 1119400 1097139.7 104091.25
# 3 869236 1118400 992344 993326.67 124584.91
test case (c)
-------------
read:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 147107 3594200 566804 1436037 1880767.7
+ 3 124889 3586600 402357 1371282 1923531.3
* 3 299597 3578300 327797 1401898 1884872.2
% 3 406603 3587600 729393 1574532 1750822.8
# 3 301480 3583600 997081 1627387 1729463
write:
x baseline
+ es-cache
* es-slist
% es-ilist
# es-gc
N Min Max Median Avg Stddev
x 3 28892400 29715800 29536400 29381533 432994.98
+ 3 28726600 29821200 29530000 29359267 566921.24
* 3 28859400 29681400 29528700 29356500 437219.2
% 3 28472300 29819700 28728200 29006733 715581.79
# 3 28807200 29681900 29516100 29335067 464601.79
Conclusion
==========
On the view of kernel side that the max scan time and avg. scan time can
be improved. Meanwhile the cache hit ratio is also increased.
On the view of userspace side, the max latency of test case (a) and (b)
can be improved. Others are not outstanding.
As always, any comment or feedback are welcome.
Thanks,
- Zheng
Zheng Liu (6):
ext4: improve extents status tree trace point
ext4: track extent status tree shrinker delay statictics
ext4: cache extent hole in extent status tree for
ext4_da_map_blocks()
ext4: change lru to round-robin in extent status tree shrinker
ext4: use a list to track all reclaimable objects for extent status
tree
ext4: use a garbage collection algorithm to manage object
fs/ext4/ext4.h | 18 +-
fs/ext4/extents.c | 27 ++-
fs/ext4/extents_status.c | 430 ++++++++++++++++++++++++++++++-------------
fs/ext4/extents_status.h | 47 ++++-
fs/ext4/inode.c | 10 +-
fs/ext4/ioctl.c | 4 +-
fs/ext4/super.c | 18 +-
include/trace/events/ext4.h | 59 +++++-
8 files changed, 419 insertions(+), 194 deletions(-)
--
1.7.9.7
From: Zheng Liu <[email protected]>
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not. In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back. But it will cause a defect when we do a lot of writes.
If we don't put extent hole in extent cache, the following writes also
need to access extent tree to look at whether or not a block has been
allocated. It brings a cache miss. This commit fixes this defect.
Meanwhile, if an inode has no any extent, this extent hole also will
be cached.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 4 +---
fs/ext4/extents.c | 23 +++++++++--------------
fs/ext4/inode.c | 6 ++----
include/trace/events/ext4.h | 3 +--
4 files changed, 13 insertions(+), 23 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6f294d3..7df9220 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -565,10 +565,8 @@ enum {
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
/* Do not take i_data_sem locking in ext4_map_blocks */
#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
- /* Do not put hole in extent cache */
-#define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0200
/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 76c2df3..6463d34 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2284,16 +2284,15 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
ext4_lblk_t block)
{
int depth = ext_depth(inode);
- unsigned long len = 0;
- ext4_lblk_t lblock = 0;
+ unsigned long len;
+ ext4_lblk_t lblock;
struct ext4_extent *ex;
ex = path[depth].p_ext;
if (ex == NULL) {
- /*
- * there is no extent yet, so gap is [0;-] and we
- * don't cache it
- */
+ /* there is no extent yet, so gap is [0;-] */
+ lblock = 0;
+ len = EXT_MAX_BLOCKS;
ext_debug("cache gap(whole file):");
} else if (block < le32_to_cpu(ex->ee_block)) {
lblock = block;
@@ -2302,9 +2301,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
block,
le32_to_cpu(ex->ee_block),
ext4_ext_get_actual_len(ex));
- if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
- ext4_es_insert_extent(inode, lblock, len, ~0,
- EXTENT_STATUS_HOLE);
} else if (block >= le32_to_cpu(ex->ee_block)
+ ext4_ext_get_actual_len(ex)) {
ext4_lblk_t next;
@@ -2318,14 +2314,14 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
block);
BUG_ON(next == lblock);
len = next - lblock;
- if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
- ext4_es_insert_extent(inode, lblock, len, ~0,
- EXTENT_STATUS_HOLE);
} else {
BUG();
}
ext_debug(" -> %u:%lu\n", lblock, len);
+ if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
+ ext4_es_insert_extent(inode, lblock, len, ~0,
+ EXTENT_STATUS_HOLE);
}
/*
@@ -4362,8 +4358,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* put just found gap into cache to speed up
* subsequent requests
*/
- if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
- ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
+ ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
goto out2;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 367a60c..d1ad9f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1485,11 +1485,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
map->m_flags |= EXT4_MAP_FROM_CLUSTER;
retval = 0;
} else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- retval = ext4_ext_map_blocks(NULL, inode, map,
- EXT4_GET_BLOCKS_NO_PUT_HOLE);
+ retval = ext4_ext_map_blocks(NULL, inode, map, 0);
else
- retval = ext4_ind_map_blocks(NULL, inode, map,
- EXT4_GET_BLOCKS_NO_PUT_HOLE);
+ retval = ext4_ind_map_blocks(NULL, inode, map, 0);
add_delayed:
if (retval == 0) {
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index ff4bd1b..9337d36 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -43,8 +43,7 @@ struct extent_status;
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
{ EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
- { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" }, \
- { EXT4_GET_BLOCKS_NO_PUT_HOLE, "NO_PUT_HOLE" })
+ { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" })
#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
1.7.9.7
From: Zheng Liu <[email protected]>
In this commit we discard the lru algorithm because it should take a
long time to keep a lru list in extent status tree shrinker and the
shrinker should take a long time to scan this lru list in order to
reclaim some objects.
For reducing the latency, this commit does two works. The first one
is to replace lru with round-robin. After that we never need to keep
a lru list. That means that the list shouldn't be sorted if the
shrinker can not reclaim any objects in the first round. The second
one is to shrink the length of the list. After using round-robin
algorithm, the shrinker takes the first inode in the list and handle
it. If this inode is skipped, it will be moved into the tail of the
list. Otherwise it will be added back when it is touched again.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 10 +-
fs/ext4/extents.c | 4 +-
fs/ext4/extents_status.c | 216 +++++++++++++++++++------------------------
fs/ext4/extents_status.h | 7 +-
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 4 +-
fs/ext4/super.c | 7 +-
include/trace/events/ext4.h | 11 +--
8 files changed, 115 insertions(+), 148 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7df9220..559af1b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -887,10 +887,9 @@ struct ext4_inode_info {
/* extents status tree */
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
- struct list_head i_es_lru;
+ struct list_head i_es_list;
unsigned int i_es_all_nr; /* protected by i_es_lock */
- unsigned int i_es_lru_nr; /* protected by i_es_lock */
- unsigned long i_touch_when; /* jiffies of last accessing */
+ unsigned int i_es_shk_nr; /* protected by i_es_lock */
/* ialloc */
ext4_group_t i_last_alloc_group;
@@ -1328,10 +1327,11 @@ struct ext4_sb_info {
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
- struct list_head s_es_lru;
+ struct list_head s_es_list;
+ long s_es_nr_inode;
struct ext4_es_stats s_es_stats;
struct mb_cache *s_mb_cache;
- spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
+ spinlock_t s_es_lock ____cacheline_aligned_in_smp;
/* Ratelimit ext4 messages. */
struct ratelimit_state s_err_ratelimit_state;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6463d34..61455d1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4621,7 +4621,7 @@ out2:
trace_ext4_ext_map_blocks_exit(inode, flags, map,
err ? err : allocated);
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
return err ? err : allocated;
}
@@ -5173,7 +5173,7 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
error = ext4_fill_fiemap_extents(inode, start_blk,
len_blks, fieinfo);
}
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
return error;
}
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index b6c3366..8d76fd5 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -149,8 +149,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
int nr_to_scan);
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
- struct ext4_inode_info *locked_ei);
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+ struct ext4_inode_info *locked_ei);
int __init ext4_init_es(void)
{
@@ -314,9 +314,9 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
* We don't count delayed extent because we never try to reclaim them
*/
if (!ext4_es_is_delayed(es)) {
- EXT4_I(inode)->i_es_lru_nr++;
+ EXT4_I(inode)->i_es_shk_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->
- s_es_stats.es_stats_lru_cnt);
+ s_es_stats.es_stats_shk_cnt);
}
EXT4_I(inode)->i_es_all_nr++;
@@ -330,12 +330,12 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
EXT4_I(inode)->i_es_all_nr--;
percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
- /* Decrease the lru counter when this es is not delayed */
+ /* Decrease the shrink counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
- BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
- EXT4_I(inode)->i_es_lru_nr--;
+ BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
+ EXT4_I(inode)->i_es_shk_nr--;
percpu_counter_dec(&EXT4_SB(inode->i_sb)->
- s_es_stats.es_stats_lru_cnt);
+ s_es_stats.es_stats_shk_cnt);
}
kmem_cache_free(ext4_es_cachep, es);
@@ -685,8 +685,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
goto error;
retry:
err = __es_insert_extent(inode, &newes);
- if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
- EXT4_I(inode)))
+ if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
+ 1, EXT4_I(inode)))
goto retry;
if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
err = 0;
@@ -843,8 +843,8 @@ retry:
es->es_lblk = orig_es.es_lblk;
es->es_len = orig_es.es_len;
if ((err == -ENOMEM) &&
- __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
- EXT4_I(inode)))
+ __es_shrink(EXT4_SB(inode->i_sb),
+ 1, EXT4_I(inode)))
goto retry;
goto out;
}
@@ -923,114 +923,117 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
return err;
}
-static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
- struct list_head *b)
+static inline void __ext4_es_list_add(struct ext4_sb_info *sbi,
+ struct ext4_inode_info *ei)
{
- struct ext4_inode_info *eia, *eib;
- eia = list_entry(a, struct ext4_inode_info, i_es_lru);
- eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+ if (list_empty(&ei->i_es_list)) {
+ list_add_tail(&ei->i_es_list, &sbi->s_es_list);
+ sbi->s_es_nr_inode++;
+ }
+}
- if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
- !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
- return 1;
- if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
- ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
- return -1;
- if (eia->i_touch_when == eib->i_touch_when)
- return 0;
- if (time_after(eia->i_touch_when, eib->i_touch_when))
- return 1;
- else
- return -1;
+void ext4_es_list_add(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ if (!list_empty(&ei->i_es_list))
+ return;
+
+ spin_lock(&sbi->s_es_lock);
+ __ext4_es_list_add(sbi, ei);
+ spin_unlock(&sbi->s_es_lock);
}
-static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
- struct ext4_inode_info *locked_ei)
+void ext4_es_list_del(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ spin_lock(&sbi->s_es_lock);
+ if (!list_empty(&ei->i_es_list)) {
+ list_del_init(&ei->i_es_list);
+ WARN_ON_ONCE(sbi->s_es_nr_inode-- < 0);
+ }
+ spin_unlock(&sbi->s_es_lock);
+}
+
+static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
+ struct ext4_inode_info *locked_ei)
{
struct ext4_inode_info *ei;
struct ext4_es_stats *es_stats;
- struct list_head *cur, *tmp;
- LIST_HEAD(skipped);
ktime_t start_time;
u64 scan_time;
+ int nr_to_walk;
int nr_shrunk = 0;
- int retried = 0, skip_precached = 1, nr_skipped = 0;
+ int retried = 0, nr_skipped = 0;
es_stats = &sbi->s_es_stats;
start_time = ktime_get();
- spin_lock(&sbi->s_es_lru_lock);
retry:
- list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
+ spin_lock(&sbi->s_es_lock);
+ nr_to_walk = sbi->s_es_nr_inode;
+ while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
int shrunk;
- /*
- * If we have already reclaimed all extents from extent
- * status tree, just stop the loop immediately.
- */
- if (percpu_counter_read_positive(
- &es_stats->es_stats_lru_cnt) == 0)
- break;
+ ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
+ i_es_list);
- ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
+ list_del_init(&ei->i_es_list);
+ sbi->s_es_nr_inode--;
+ spin_unlock(&sbi->s_es_lock);
/*
- * Skip the inode that is newer than the last_sorted
- * time. Normally we try hard to avoid shrinking
- * precached inodes, but we will as a last resort.
+ * Normally we try hard to avoid shrinking precached inodes,
+ * but we will as a last resort.
*/
- if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
- (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
- EXT4_STATE_EXT_PRECACHED))) {
+ if (!retried && ext4_test_inode_state(&ei->vfs_inode,
+ EXT4_STATE_EXT_PRECACHED)) {
nr_skipped++;
- list_move_tail(cur, &skipped);
+ spin_lock(&sbi->s_es_lock);
+ __ext4_es_list_add(sbi, ei);
+ continue;
+ }
+
+ if (ei->i_es_shk_nr == 0) {
+ spin_lock(&sbi->s_es_lock);
continue;
}
- if (ei->i_es_lru_nr == 0 || ei == locked_ei ||
- !write_trylock(&ei->i_es_lock))
+ if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
+ nr_skipped++;
+ spin_lock(&sbi->s_es_lock);
+ __ext4_es_list_add(sbi, ei);
continue;
+ }
shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
- if (ei->i_es_lru_nr == 0)
- list_del_init(&ei->i_es_lru);
write_unlock(&ei->i_es_lock);
nr_shrunk += shrunk;
nr_to_scan -= shrunk;
+
if (nr_to_scan == 0)
- break;
+ goto out;
+ spin_lock(&sbi->s_es_lock);
}
-
- /* Move the newer inodes into the tail of the LRU list. */
- list_splice_tail(&skipped, &sbi->s_es_lru);
- INIT_LIST_HEAD(&skipped);
+ spin_unlock(&sbi->s_es_lock);
/*
* If we skipped any inodes, and we weren't able to make any
- * forward progress, sort the list and try again.
+ * forward progress, try again to scan precached inodes.
*/
if ((nr_shrunk == 0) && nr_skipped && !retried) {
retried++;
- list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
- es_stats->es_stats_last_sorted = jiffies;
- ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
- i_es_lru);
- /*
- * If there are no non-precached inodes left on the
- * list, start releasing precached extents.
- */
- if (ext4_test_inode_state(&ei->vfs_inode,
- EXT4_STATE_EXT_PRECACHED))
- skip_precached = 0;
goto retry;
}
- spin_unlock(&sbi->s_es_lru_lock);
-
if (locked_ei && nr_shrunk == 0)
nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+out:
scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
if (likely(es_stats->es_stats_scan_time))
es_stats->es_stats_scan_time = (scan_time +
@@ -1045,7 +1048,7 @@ retry:
else
es_stats->es_stats_shrunk = nr_shrunk;
- trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+ trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time,
nr_skipped, retried);
return nr_shrunk;
}
@@ -1057,7 +1060,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
struct ext4_sb_info *sbi;
sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
- nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+ nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}
@@ -1070,13 +1073,13 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk;
- ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
+ ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
if (!nr_to_scan)
return ret;
- nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
+ nr_shrunk = __es_shrink(sbi, nr_to_scan, NULL);
trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
return nr_shrunk;
@@ -1104,28 +1107,24 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
return 0;
/* here we just find an inode that has the max nr. of objects */
- spin_lock(&sbi->s_es_lru_lock);
- list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+ spin_lock(&sbi->s_es_lock);
+ list_for_each_entry(ei, &sbi->s_es_list, i_es_list) {
inode_cnt++;
if (max && max->i_es_all_nr < ei->i_es_all_nr)
max = ei;
else if (!max)
max = ei;
}
- spin_unlock(&sbi->s_es_lru_lock);
+ spin_unlock(&sbi->s_es_lock);
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
- percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+ percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
seq_printf(seq, " %lu/%lu cache hits/misses\n",
es_stats->es_stats_cache_hits,
es_stats->es_stats_cache_misses);
- if (es_stats->es_stats_last_sorted != 0)
- seq_printf(seq, " %u ms last sorted interval\n",
- jiffies_to_msecs(jiffies -
- es_stats->es_stats_last_sorted));
if (inode_cnt)
- seq_printf(seq, " %d inodes on lru list\n", inode_cnt);
+ seq_printf(seq, " %d inodes on list\n", inode_cnt);
seq_printf(seq, "average:\n %llu us scan time\n",
div_u64(es_stats->es_stats_scan_time, 1000));
@@ -1134,7 +1133,7 @@ static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq,
"maximum:\n %lu inode (%u objects, %u reclaimable)\n"
" %llu us max scan time\n",
- max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+ max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_shk_nr,
div_u64(es_stats->es_stats_max_scan_time, 1000));
return 0;
@@ -1183,9 +1182,9 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
{
int err;
- INIT_LIST_HEAD(&sbi->s_es_lru);
- spin_lock_init(&sbi->s_es_lru_lock);
- sbi->s_es_stats.es_stats_last_sorted = 0;
+ INIT_LIST_HEAD(&sbi->s_es_list);
+ sbi->s_es_nr_inode = 0;
+ spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
sbi->s_es_stats.es_stats_cache_hits = 0;
sbi->s_es_stats.es_stats_cache_misses = 0;
@@ -1194,7 +1193,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0);
if (err)
return err;
- err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0);
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_shk_cnt, 0);
if (err)
goto err1;
@@ -1212,7 +1211,7 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
return 0;
err2:
- percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
return err;
@@ -1223,37 +1222,10 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
if (sbi->s_proc)
remove_proc_entry("es_shrinker_info", sbi->s_proc);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
- percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
unregister_shrinker(&sbi->s_es_shrinker);
}
-void ext4_es_lru_add(struct inode *inode)
-{
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
- ei->i_touch_when = jiffies;
-
- if (!list_empty(&ei->i_es_lru))
- return;
-
- spin_lock(&sbi->s_es_lru_lock);
- if (list_empty(&ei->i_es_lru))
- list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
- spin_unlock(&sbi->s_es_lru_lock);
-}
-
-void ext4_es_lru_del(struct inode *inode)
-{
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-
- spin_lock(&sbi->s_es_lru_lock);
- if (!list_empty(&ei->i_es_lru))
- list_del_init(&ei->i_es_lru);
- spin_unlock(&sbi->s_es_lru_lock);
-}
-
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
int nr_to_scan)
{
@@ -1265,7 +1237,7 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- if (ei->i_es_lru_nr == 0)
+ if (ei->i_es_shk_nr == 0)
return 0;
if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index efd5f97..0e6a33e 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -65,14 +65,13 @@ struct ext4_es_tree {
};
struct ext4_es_stats {
- unsigned long es_stats_last_sorted;
unsigned long es_stats_shrunk;
unsigned long es_stats_cache_hits;
unsigned long es_stats_cache_misses;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
struct percpu_counter es_stats_all_cnt;
- struct percpu_counter es_stats_lru_cnt;
+ struct percpu_counter es_stats_shk_cnt;
};
extern int __init ext4_init_es(void);
@@ -151,7 +150,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
-extern void ext4_es_lru_add(struct inode *inode);
-extern void ext4_es_lru_del(struct inode *inode);
+extern void ext4_es_list_add(struct inode *inode);
+extern void ext4_es_list_del(struct inode *inode);
#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d1ad9f9..157ca44 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -494,7 +494,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
map->m_pblk = ext4_es_pblock(&es) +
map->m_lblk - es.es_lblk;
@@ -1431,7 +1431,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, iblock, &es)) {
- ext4_es_lru_add(inode);
+ ext4_es_list_add(inode);
if (ext4_es_is_hole(&es)) {
retval = 0;
down_read(&EXT4_I(inode)->i_data_sem);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f2252e..25c9ef0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -78,8 +78,8 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
memswap(&ei1->i_disksize, &ei2->i_disksize, sizeof(ei1->i_disksize));
ext4_es_remove_extent(inode1, 0, EXT_MAX_BLOCKS);
ext4_es_remove_extent(inode2, 0, EXT_MAX_BLOCKS);
- ext4_es_lru_del(inode1);
- ext4_es_lru_del(inode2);
+ ext4_es_list_del(inode1);
+ ext4_es_list_del(inode2);
isize = i_size_read(inode1);
i_size_write(inode1, i_size_read(inode2));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7a06eb4..b03ef81 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -883,10 +883,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
spin_lock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
- INIT_LIST_HEAD(&ei->i_es_lru);
+ INIT_LIST_HEAD(&ei->i_es_list);
ei->i_es_all_nr = 0;
- ei->i_es_lru_nr = 0;
- ei->i_touch_when = 0;
+ ei->i_es_shk_nr = 0;
ei->i_reserved_data_blocks = 0;
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
@@ -975,7 +974,7 @@ void ext4_clear_inode(struct inode *inode)
dquot_drop(inode);
ext4_discard_preallocations(inode);
ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
- ext4_es_lru_del(inode);
+ ext4_es_list_del(inode);
if (EXT4_I(inode)->jinode) {
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
EXT4_I(inode)->jinode);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9337d36..27918ca 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2451,15 +2451,14 @@ TRACE_EVENT(ext4_collapse_range,
TRACE_EVENT(ext4_es_shrink,
TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
- int skip_precached, int nr_skipped, int retried),
+ int nr_skipped, int retried),
- TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+ TP_ARGS(sb, nr_shrunk, scan_time, nr_skipped, retried),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( int, nr_shrunk )
__field( unsigned long long, scan_time )
- __field( int, skip_precached )
__field( int, nr_skipped )
__field( int, retried )
),
@@ -2468,16 +2467,14 @@ TRACE_EVENT(ext4_es_shrink,
__entry->dev = sb->s_dev;
__entry->nr_shrunk = nr_shrunk;
__entry->scan_time = div_u64(scan_time, 1000);
- __entry->skip_precached = skip_precached;
__entry->nr_skipped = nr_skipped;
__entry->retried = retried;
),
- TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+ TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu "
"nr_skipped %d retried %d",
MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
- __entry->scan_time, __entry->skip_precached,
- __entry->nr_skipped, __entry->retried)
+ __entry->scan_time, __entry->nr_skipped, __entry->retried)
);
#endif /* _TRACE_EXT4_H */
--
1.7.9.7
From: Zheng Liu <[email protected]>
This commit improves the trace point of extents status tree. We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.
Further this commit fixes a variable name in trace point in order to
keep consistency with others.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 6 +++---
include/trace/events/ext4.h | 28 ++++++++++++++++++++--------
2 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 0b7e28e..47d60dc 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1021,7 +1021,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
+ trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}
@@ -1034,14 +1034,14 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int ret, nr_shrunk;
ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
- trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
+ trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
if (!nr_to_scan)
return ret;
nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
- trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
+ trace_ext4_es_shrink_scan_exit(sbi->s_sb, nr_shrunk, ret);
return nr_shrunk;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d4f70a7..849aaba 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2369,7 +2369,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
show_extent_status(__entry->found ? __entry->status : 0))
);
-TRACE_EVENT(ext4_es_shrink_enter,
+DECLARE_EVENT_CLASS(ext4__es_shrink_enter,
TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
TP_ARGS(sb, nr_to_scan, cache_cnt),
@@ -2391,26 +2391,38 @@ TRACE_EVENT(ext4_es_shrink_enter,
__entry->nr_to_scan, __entry->cache_cnt)
);
-TRACE_EVENT(ext4_es_shrink_exit,
- TP_PROTO(struct super_block *sb, int shrunk_nr, int cache_cnt),
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_count,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+DEFINE_EVENT(ext4__es_shrink_enter, ext4_es_shrink_scan_enter,
+ TP_PROTO(struct super_block *sb, int nr_to_scan, int cache_cnt),
+
+ TP_ARGS(sb, nr_to_scan, cache_cnt)
+);
+
+TRACE_EVENT(ext4_es_shrink_scan_exit,
+ TP_PROTO(struct super_block *sb, int nr_shrunk, int cache_cnt),
- TP_ARGS(sb, shrunk_nr, cache_cnt),
+ TP_ARGS(sb, nr_shrunk, cache_cnt),
TP_STRUCT__entry(
__field( dev_t, dev )
- __field( int, shrunk_nr )
+ __field( int, nr_shrunk )
__field( int, cache_cnt )
),
TP_fast_assign(
__entry->dev = sb->s_dev;
- __entry->shrunk_nr = shrunk_nr;
+ __entry->nr_shrunk = nr_shrunk;
__entry->cache_cnt = cache_cnt;
),
- TP_printk("dev %d,%d shrunk_nr %d cache_cnt %d",
+ TP_printk("dev %d,%d nr_shrunk %d cache_cnt %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
- __entry->shrunk_nr, __entry->cache_cnt)
+ __entry->nr_shrunk, __entry->cache_cnt)
);
TRACE_EVENT(ext4_collapse_range,
--
1.7.9.7
From: Zheng Liu <[email protected]>
This commit adds some statictics in extent status tree shrinker. The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker. Here we count
the following statictics:
stats:
the number of all objects on all extent status trees
the number of reclaimable objects on lru list
cache hits/misses
the last sorted interval
the number of inodes on lru list
average:
scan time for shrinking some objects
the number of shrunk objects
maximum:
the inode that has max nr. of objects on lru list
the maximum scan time for shrinking some objects
The output looks like below:
$ cat /proc/fs/ext4/sda1/es_shrinker_info
stats:
28228 objects
6341 reclaimable objects
5281/631 cache hits/misses
586 ms last sorted interval
250 inodes on lru list
average:
153 us scan time
128 shrunk objects
maximum:
255 inode (255 objects, 198 reclaimable)
125723 us max scan time
If the lru list has never been sorted, the following line will not be
printed:
586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
250 inodes on lru list
...
maximum:
255 inode (255 objects, 198 reclaimable)
0 us max scan time
Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 4 +-
fs/ext4/extents_status.c | 186 ++++++++++++++++++++++++++++++++++++++++---
fs/ext4/extents_status.h | 13 ++-
fs/ext4/super.c | 11 +--
include/trace/events/ext4.h | 31 ++++++++
5 files changed, 224 insertions(+), 21 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5b19760..6f294d3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -890,6 +890,7 @@ struct ext4_inode_info {
struct ext4_es_tree i_es_tree;
rwlock_t i_es_lock;
struct list_head i_es_lru;
+ unsigned int i_es_all_nr; /* protected by i_es_lock */
unsigned int i_es_lru_nr; /* protected by i_es_lock */
unsigned long i_touch_when; /* jiffies of last accessing */
@@ -1330,8 +1331,7 @@ struct ext4_sb_info {
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
struct list_head s_es_lru;
- unsigned long s_es_last_sorted;
- struct percpu_counter s_extent_cache_cnt;
+ struct ext4_es_stats s_es_stats;
struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 47d60dc..b6c3366 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -11,6 +11,8 @@
*/
#include <linux/rbtree.h>
#include <linux/list_sort.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
#include "ext4.h"
#include "extents_status.h"
@@ -313,19 +315,27 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
*/
if (!ext4_es_is_delayed(es)) {
EXT4_I(inode)->i_es_lru_nr++;
- percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_inc(&EXT4_SB(inode->i_sb)->
+ s_es_stats.es_stats_lru_cnt);
}
+ EXT4_I(inode)->i_es_all_nr++;
+ percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
return es;
}
static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
+ EXT4_I(inode)->i_es_all_nr--;
+ percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
+
/* Decrease the lru counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
EXT4_I(inode)->i_es_lru_nr--;
- percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
+ percpu_counter_dec(&EXT4_SB(inode->i_sb)->
+ s_es_stats.es_stats_lru_cnt);
}
kmem_cache_free(ext4_es_cachep, es);
@@ -731,6 +741,7 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
struct extent_status *es)
{
struct ext4_es_tree *tree;
+ struct ext4_es_stats *stats;
struct extent_status *es1 = NULL;
struct rb_node *node;
int found = 0;
@@ -767,11 +778,15 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
}
out:
+ stats = &EXT4_SB(inode->i_sb)->s_es_stats;
if (found) {
BUG_ON(!es1);
es->es_lblk = es1->es_lblk;
es->es_len = es1->es_len;
es->es_pblk = es1->es_pblk;
+ stats->es_stats_cache_hits++;
+ } else {
+ stats->es_stats_cache_misses++;
}
read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -933,11 +948,16 @@ static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei)
{
struct ext4_inode_info *ei;
+ struct ext4_es_stats *es_stats;
struct list_head *cur, *tmp;
LIST_HEAD(skipped);
+ ktime_t start_time;
+ u64 scan_time;
int nr_shrunk = 0;
int retried = 0, skip_precached = 1, nr_skipped = 0;
+ es_stats = &sbi->s_es_stats;
+ start_time = ktime_get();
spin_lock(&sbi->s_es_lru_lock);
retry:
@@ -948,7 +968,8 @@ retry:
* If we have already reclaimed all extents from extent
* status tree, just stop the loop immediately.
*/
- if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
+ if (percpu_counter_read_positive(
+ &es_stats->es_stats_lru_cnt) == 0)
break;
ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
@@ -958,7 +979,7 @@ retry:
* time. Normally we try hard to avoid shrinking
* precached inodes, but we will as a last resort.
*/
- if ((sbi->s_es_last_sorted < ei->i_touch_when) ||
+ if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
(skip_precached && ext4_test_inode_state(&ei->vfs_inode,
EXT4_STATE_EXT_PRECACHED))) {
nr_skipped++;
@@ -992,7 +1013,7 @@ retry:
if ((nr_shrunk == 0) && nr_skipped && !retried) {
retried++;
list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
- sbi->s_es_last_sorted = jiffies;
+ es_stats->es_stats_last_sorted = jiffies;
ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
i_es_lru);
/*
@@ -1010,6 +1031,22 @@ retry:
if (locked_ei && nr_shrunk == 0)
nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+ scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+ if (likely(es_stats->es_stats_scan_time))
+ es_stats->es_stats_scan_time = (scan_time +
+ es_stats->es_stats_scan_time*3) / 4;
+ else
+ es_stats->es_stats_scan_time = scan_time;
+ if (scan_time > es_stats->es_stats_max_scan_time)
+ es_stats->es_stats_max_scan_time = scan_time;
+ if (likely(es_stats->es_stats_shrunk))
+ es_stats->es_stats_shrunk = (nr_shrunk +
+ es_stats->es_stats_shrunk*3) / 4;
+ else
+ es_stats->es_stats_shrunk = nr_shrunk;
+
+ trace_ext4_es_shrink(sbi->s_sb, nr_shrunk, scan_time, skip_precached,
+ nr_skipped, retried);
return nr_shrunk;
}
@@ -1020,7 +1057,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
struct ext4_sb_info *sbi;
sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
- nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+ nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
return nr;
}
@@ -1033,7 +1070,7 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
int nr_to_scan = sc->nr_to_scan;
int ret, nr_shrunk;
- ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
+ ret = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_lru_cnt);
trace_ext4_es_shrink_scan_enter(sbi->s_sb, nr_to_scan, ret);
if (!nr_to_scan)
@@ -1045,19 +1082,148 @@ static unsigned long ext4_es_scan(struct shrinker *shrink,
return nr_shrunk;
}
-void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
+static void *ext4_es_seq_shrinker_info_start(struct seq_file *seq, loff_t *pos)
+{
+ return *pos ? NULL : SEQ_START_TOKEN;
+}
+
+static void *
+ext4_es_seq_shrinker_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ return NULL;
+}
+
+static int ext4_es_seq_shrinker_info_show(struct seq_file *seq, void *v)
+{
+ struct ext4_sb_info *sbi = seq->private;
+ struct ext4_es_stats *es_stats = &sbi->s_es_stats;
+ struct ext4_inode_info *ei, *max = NULL;
+ unsigned int inode_cnt = 0;
+
+ if (v != SEQ_START_TOKEN)
+ return 0;
+
+ /* here we just find an inode that has the max nr. of objects */
+ spin_lock(&sbi->s_es_lru_lock);
+ list_for_each_entry(ei, &sbi->s_es_lru, i_es_lru) {
+ inode_cnt++;
+ if (max && max->i_es_all_nr < ei->i_es_all_nr)
+ max = ei;
+ else if (!max)
+ max = ei;
+ }
+ spin_unlock(&sbi->s_es_lru_lock);
+
+ seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
+ percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
+ percpu_counter_sum_positive(&es_stats->es_stats_lru_cnt));
+ seq_printf(seq, " %lu/%lu cache hits/misses\n",
+ es_stats->es_stats_cache_hits,
+ es_stats->es_stats_cache_misses);
+ if (es_stats->es_stats_last_sorted != 0)
+ seq_printf(seq, " %u ms last sorted interval\n",
+ jiffies_to_msecs(jiffies -
+ es_stats->es_stats_last_sorted));
+ if (inode_cnt)
+ seq_printf(seq, " %d inodes on lru list\n", inode_cnt);
+
+ seq_printf(seq, "average:\n %llu us scan time\n",
+ div_u64(es_stats->es_stats_scan_time, 1000));
+ seq_printf(seq, " %lu shrunk objects\n", es_stats->es_stats_shrunk);
+ if (inode_cnt)
+ seq_printf(seq,
+ "maximum:\n %lu inode (%u objects, %u reclaimable)\n"
+ " %llu us max scan time\n",
+ max->vfs_inode.i_ino, max->i_es_all_nr, max->i_es_lru_nr,
+ div_u64(es_stats->es_stats_max_scan_time, 1000));
+
+ return 0;
+}
+
+static void ext4_es_seq_shrinker_info_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations ext4_es_seq_shrinker_info_ops = {
+ .start = ext4_es_seq_shrinker_info_start,
+ .next = ext4_es_seq_shrinker_info_next,
+ .stop = ext4_es_seq_shrinker_info_stop,
+ .show = ext4_es_seq_shrinker_info_show,
+};
+
+static int
+ext4_es_seq_shrinker_info_open(struct inode *inode, struct file *file)
+{
+ int ret;
+
+ ret = seq_open(file, &ext4_es_seq_shrinker_info_ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = PDE_DATA(inode);
+ }
+
+ return ret;
+}
+
+static int
+ext4_es_seq_shrinker_info_release(struct inode *inode, struct file *file)
+{
+ return seq_release(inode, file);
+}
+
+static const struct file_operations ext4_es_seq_shrinker_info_fops = {
+ .owner = THIS_MODULE,
+ .open = ext4_es_seq_shrinker_info_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = ext4_es_seq_shrinker_info_release,
+};
+
+int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
{
+ int err;
+
INIT_LIST_HEAD(&sbi->s_es_lru);
spin_lock_init(&sbi->s_es_lru_lock);
- sbi->s_es_last_sorted = 0;
+ sbi->s_es_stats.es_stats_last_sorted = 0;
+ sbi->s_es_stats.es_stats_shrunk = 0;
+ sbi->s_es_stats.es_stats_cache_hits = 0;
+ sbi->s_es_stats.es_stats_cache_misses = 0;
+ sbi->s_es_stats.es_stats_scan_time = 0;
+ sbi->s_es_stats.es_stats_max_scan_time = 0;
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0);
+ if (err)
+ return err;
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_lru_cnt, 0);
+ if (err)
+ goto err1;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
- register_shrinker(&sbi->s_es_shrinker);
+ err = register_shrinker(&sbi->s_es_shrinker);
+ if (err)
+ goto err2;
+
+ if (sbi->s_proc)
+ proc_create_data("es_shrinker_info", S_IRUGO, sbi->s_proc,
+ &ext4_es_seq_shrinker_info_fops, sbi);
+
+ return 0;
+
+err2:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
+err1:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+ return err;
}
void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
+ if (sbi->s_proc)
+ remove_proc_entry("es_shrinker_info", sbi->s_proc);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_lru_cnt);
unregister_shrinker(&sbi->s_es_shrinker);
}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f1b62a4..efd5f97 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -64,6 +64,17 @@ struct ext4_es_tree {
struct extent_status *cache_es; /* recently accessed extent */
};
+struct ext4_es_stats {
+ unsigned long es_stats_last_sorted;
+ unsigned long es_stats_shrunk;
+ unsigned long es_stats_cache_hits;
+ unsigned long es_stats_cache_misses;
+ u64 es_stats_scan_time;
+ u64 es_stats_max_scan_time;
+ struct percpu_counter es_stats_all_cnt;
+ struct percpu_counter es_stats_lru_cnt;
+};
+
extern int __init ext4_init_es(void);
extern void ext4_exit_es(void);
extern void ext4_es_init_tree(struct ext4_es_tree *tree);
@@ -138,7 +149,7 @@ static inline void ext4_es_store_pblock_status(struct extent_status *es,
(pb & ~ES_MASK));
}
-extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
+extern int ext4_es_register_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi);
extern void ext4_es_lru_add(struct inode *inode);
extern void ext4_es_lru_del(struct inode *inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 32b43ad..7a06eb4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -820,7 +820,6 @@ static void ext4_put_super(struct super_block *sb)
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
- percpu_counter_destroy(&sbi->s_extent_cache_cnt);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < MAXQUOTAS; i++)
@@ -885,6 +884,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
INIT_LIST_HEAD(&ei->i_es_lru);
+ ei->i_es_all_nr = 0;
ei->i_es_lru_nr = 0;
ei->i_touch_when = 0;
ei->i_reserved_data_blocks = 0;
@@ -3889,12 +3889,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_err_report.data = (unsigned long) sb;
/* Register extent status tree shrinker */
- ext4_es_register_shrinker(sbi);
-
- if ((err = percpu_counter_init(&sbi->s_extent_cache_cnt, 0)) != 0) {
- ext4_msg(sb, KERN_ERR, "insufficient memory");
+ if (ext4_es_register_shrinker(sbi))
goto failed_mount3;
- }
sbi->s_stripe = ext4_get_stripe_size(sbi);
sbi->s_extent_max_zeroout_kb = 32;
@@ -4224,10 +4220,9 @@ failed_mount_wq:
jbd2_journal_destroy(sbi->s_journal);
sbi->s_journal = NULL;
}
-failed_mount3:
ext4_es_unregister_shrinker(sbi);
+failed_mount3:
del_timer_sync(&sbi->s_err_report);
- percpu_counter_destroy(&sbi->s_extent_cache_cnt);
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
failed_mount2:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 849aaba..ff4bd1b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2450,6 +2450,37 @@ TRACE_EVENT(ext4_collapse_range,
__entry->offset, __entry->len)
);
+TRACE_EVENT(ext4_es_shrink,
+ TP_PROTO(struct super_block *sb, int nr_shrunk, u64 scan_time,
+ int skip_precached, int nr_skipped, int retried),
+
+ TP_ARGS(sb, nr_shrunk, scan_time, skip_precached, nr_skipped, retried),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( int, nr_shrunk )
+ __field( unsigned long long, scan_time )
+ __field( int, skip_precached )
+ __field( int, nr_skipped )
+ __field( int, retried )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nr_shrunk = nr_shrunk;
+ __entry->scan_time = div_u64(scan_time, 1000);
+ __entry->skip_precached = skip_precached;
+ __entry->nr_skipped = nr_skipped;
+ __entry->retried = retried;
+ ),
+
+ TP_printk("dev %d,%d nr_shrunk %d, scan_time %llu skip_precached %d "
+ "nr_skipped %d retried %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->nr_shrunk,
+ __entry->scan_time, __entry->skip_precached,
+ __entry->nr_skipped, __entry->retried)
+);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
1.7.9.7
From: Zheng Liu <[email protected]>
Currently the shrinker needs to take a long time to reclaim some objects
from a extent status tree because there are a lot of objects that are
delayed. These objects could not be reclaimed because ext4 uses them to
finish seeking data/hole, finding delayed range and other works. If a
rb-tree has a large number of delayed objects, shrinker should scan more
objects and the latency will be high. This commit uses a list to track
all reclaimble objects in order to reduce the latency when the shrinker
tries to reclaim some objects from a extent status tree.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 74 ++++++++++++++++++++++++++--------------------
fs/ext4/extents_status.h | 2 ++
2 files changed, 44 insertions(+), 32 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 8d76fd5..2f81d1e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -148,7 +148,8 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
- int nr_to_scan);
+ struct list_head *freeable,
+ int *nr_to_scan);
static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei);
@@ -171,6 +172,7 @@ void ext4_exit_es(void)
void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
+ INIT_LIST_HEAD(&tree->list);
tree->cache_es = NULL;
}
@@ -302,6 +304,7 @@ static struct extent_status *
ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
ext4_fsblk_t pblk)
{
+ struct ext4_inode_info *ei = EXT4_I(inode);
struct extent_status *es;
es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
if (es == NULL)
@@ -314,12 +317,13 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
* We don't count delayed extent because we never try to reclaim them
*/
if (!ext4_es_is_delayed(es)) {
- EXT4_I(inode)->i_es_shk_nr++;
+ list_add_tail(&es->list, &ei->i_es_tree.list);
+ ei->i_es_shk_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_shk_cnt);
}
- EXT4_I(inode)->i_es_all_nr++;
+ ei->i_es_all_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
return es;
@@ -327,13 +331,15 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
{
- EXT4_I(inode)->i_es_all_nr--;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ ei->i_es_all_nr--;
percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
/* Decrease the shrink counter when this es is not delayed */
if (!ext4_es_is_delayed(es)) {
- BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
- EXT4_I(inode)->i_es_shk_nr--;
+ list_del(&es->list);
+ BUG_ON(ei->i_es_shk_nr-- == 0);
percpu_counter_dec(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_shk_cnt);
}
@@ -958,11 +964,24 @@ void ext4_es_list_del(struct inode *inode)
spin_unlock(&sbi->s_es_lock);
}
+static void dispose_list(struct inode *inode, struct list_head *head)
+{
+ while (!list_empty(head)) {
+ struct extent_status *es;
+
+ es = list_first_entry(head, struct extent_status, list);
+ list_del_init(&es->list);
+
+ ext4_es_free_extent(inode, es);
+ }
+}
+
static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei)
{
struct ext4_inode_info *ei;
struct ext4_es_stats *es_stats;
+ LIST_HEAD(freeable);
ktime_t start_time;
u64 scan_time;
int nr_to_walk;
@@ -976,8 +995,6 @@ retry:
spin_lock(&sbi->s_es_lock);
nr_to_walk = sbi->s_es_nr_inode;
while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
- int shrunk;
-
ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
i_es_list);
@@ -1009,11 +1026,10 @@ retry:
continue;
}
- shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
+ nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
+ &nr_to_scan);
write_unlock(&ei->i_es_lock);
-
- nr_shrunk += shrunk;
- nr_to_scan -= shrunk;
+ dispose_list(&ei->vfs_inode, &freeable);
if (nr_to_scan == 0)
goto out;
@@ -1030,8 +1046,11 @@ retry:
goto retry;
}
- if (locked_ei && nr_shrunk == 0)
- nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
+ if (locked_ei && nr_shrunk == 0) {
+ nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
+ &nr_to_scan);
+ dispose_list(&locked_ei->vfs_inode, &freeable);
+ }
out:
scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
@@ -1227,12 +1246,12 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
}
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
- int nr_to_scan)
+ struct list_head *freeable,
+ int *nr_to_scan)
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
- struct rb_node *node;
- struct extent_status *es;
+ struct extent_status *es, *tmp;
unsigned long nr_shrunk = 0;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
@@ -1244,21 +1263,12 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
__ratelimit(&_rs))
ext4_warning(inode->i_sb, "forced shrink of precached extents");
- node = rb_first(&tree->root);
- while (node != NULL) {
- es = rb_entry(node, struct extent_status, rb_node);
- node = rb_next(&es->rb_node);
- /*
- * We can't reclaim delayed extent from status tree because
- * fiemap, bigallic, and seek_data/hole need to use it.
- */
- if (!ext4_es_is_delayed(es)) {
- rb_erase(&es->rb_node, &tree->root);
- ext4_es_free_extent(inode, es);
- nr_shrunk++;
- if (--nr_to_scan == 0)
- break;
- }
+ list_for_each_entry_safe(es, tmp, &tree->list, list) {
+ rb_erase(&es->rb_node, &tree->root);
+ list_move_tail(&es->list, freeable);
+ nr_shrunk++;
+ if (--*nr_to_scan == 0)
+ break;
}
tree->cache_es = NULL;
return nr_shrunk;
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 0e6a33e..a45c7fe 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -54,6 +54,7 @@ struct ext4_extent;
struct extent_status {
struct rb_node rb_node;
+ struct list_head list;
ext4_lblk_t es_lblk; /* first logical block extent covers */
ext4_lblk_t es_len; /* length of extent in block */
ext4_fsblk_t es_pblk; /* first physical block */
@@ -61,6 +62,7 @@ struct extent_status {
struct ext4_es_tree {
struct rb_root root;
+ struct list_head list;
struct extent_status *cache_es; /* recently accessed extent */
};
--
1.7.9.7
From: Zheng Liu <[email protected]>
For keeping useful extent cache in the tree, this commit uses a gc-like
algorithm to manage object. A new flag called '_ACCESSED' is defined to
track whether an extent cache is touched or not. When the shrinker tries
to reclaim some ojbects, an extent cache will be moved to the tail of
active list from inactive list if this flag is marked. The object in
active list will be reclaimed when we are under a high memory pressure.
After that, the aged extent cache should be kept as many as possible.
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 42 +++++++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 31 ++++++++++++++++++++++++-------
2 files changed, 57 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 2f81d1e..2f6bb538 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -149,7 +149,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t end);
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
struct list_head *freeable,
- int *nr_to_scan);
+ int *nr_to_scan, int force);
static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
struct ext4_inode_info *locked_ei);
@@ -172,7 +172,8 @@ void ext4_exit_es(void)
void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
- INIT_LIST_HEAD(&tree->list);
+ INIT_LIST_HEAD(&tree->active_list);
+ INIT_LIST_HEAD(&tree->inactive_list);
tree->cache_es = NULL;
}
@@ -317,7 +318,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
* We don't count delayed extent because we never try to reclaim them
*/
if (!ext4_es_is_delayed(es)) {
- list_add_tail(&es->list, &ei->i_es_tree.list);
+ list_add_tail(&es->list, &ei->i_es_tree.inactive_list);
ei->i_es_shk_nr++;
percpu_counter_inc(&EXT4_SB(inode->i_sb)->
s_es_stats.es_stats_shk_cnt);
@@ -787,6 +788,7 @@ out:
stats = &EXT4_SB(inode->i_sb)->s_es_stats;
if (found) {
BUG_ON(!es1);
+ ext4_es_mark_accessed(es1);
es->es_lblk = es1->es_lblk;
es->es_len = es1->es_len;
es->es_pblk = es1->es_pblk;
@@ -1027,7 +1029,7 @@ retry:
}
nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
- &nr_to_scan);
+ &nr_to_scan, retried);
write_unlock(&ei->i_es_lock);
dispose_list(&ei->vfs_inode, &freeable);
@@ -1048,7 +1050,7 @@ retry:
if (locked_ei && nr_shrunk == 0) {
nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
- &nr_to_scan);
+ &nr_to_scan, 1);
dispose_list(&locked_ei->vfs_inode, &freeable);
}
@@ -1247,7 +1249,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
struct list_head *freeable,
- int *nr_to_scan)
+ int *nr_to_scan, int force)
{
struct inode *inode = &ei->vfs_inode;
struct ext4_es_tree *tree = &ei->i_es_tree;
@@ -1263,13 +1265,35 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
__ratelimit(&_rs))
ext4_warning(inode->i_sb, "forced shrink of precached extents");
- list_for_each_entry_safe(es, tmp, &tree->list, list) {
+ list_for_each_entry_safe(es, tmp, &tree->inactive_list, list) {
+ if (!*nr_to_scan)
+ goto done;
+ --*nr_to_scan;
+
+ if (ext4_es_is_accessed(es)) {
+ list_move_tail(&es->list, &tree->active_list);
+ continue;
+ } else {
+ rb_erase(&es->rb_node, &tree->root);
+ list_move_tail(&es->list, freeable);
+ nr_shrunk++;
+ }
+ }
+
+ if (!force)
+ goto done;
+
+ list_for_each_entry_safe(es, tmp, &tree->active_list, list) {
+ if (!*nr_to_scan)
+ goto done;
+ --*nr_to_scan;
+
rb_erase(&es->rb_node, &tree->root);
list_move_tail(&es->list, freeable);
nr_shrunk++;
- if (--*nr_to_scan == 0)
- break;
}
+
+done:
tree->cache_es = NULL;
return nr_shrunk;
}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index a45c7fe..213e056 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -29,12 +29,12 @@
/*
* These flags live in the high bits of extent_status.es_pblk
*/
-#define ES_SHIFT 60
+#define ES_SHIFT 59
-#define EXTENT_STATUS_WRITTEN (1 << 3)
-#define EXTENT_STATUS_UNWRITTEN (1 << 2)
-#define EXTENT_STATUS_DELAYED (1 << 1)
-#define EXTENT_STATUS_HOLE (1 << 0)
+#define EXTENT_STATUS_WRITTEN (1 << 4)
+#define EXTENT_STATUS_UNWRITTEN (1 << 3)
+#define EXTENT_STATUS_DELAYED (1 << 2)
+#define EXTENT_STATUS_HOLE (1 << 1)
#define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \
EXTENT_STATUS_UNWRITTEN | \
@@ -45,9 +45,10 @@
#define ES_UNWRITTEN (1ULL << 62)
#define ES_DELAYED (1ULL << 61)
#define ES_HOLE (1ULL << 60)
+#define ES_ACCESSED (1ULL << 59)
#define ES_MASK (ES_WRITTEN | ES_UNWRITTEN | \
- ES_DELAYED | ES_HOLE)
+ ES_DELAYED | ES_HOLE | ES_ACCESSED)
struct ext4_sb_info;
struct ext4_extent;
@@ -62,7 +63,8 @@ struct extent_status {
struct ext4_es_tree {
struct rb_root root;
- struct list_head list;
+ struct list_head active_list;
+ struct list_head inactive_list;
struct extent_status *cache_es; /* recently accessed extent */
};
@@ -114,6 +116,21 @@ static inline int ext4_es_is_hole(struct extent_status *es)
return (es->es_pblk & ES_HOLE) != 0;
}
+static inline int ext4_es_is_accessed(struct extent_status *es)
+{
+ return (es->es_pblk & ES_ACCESSED) != 0;
+}
+
+static inline void ext4_es_mark_accessed(struct extent_status *es)
+{
+ es->es_pblk |= ES_ACCESSED;
+}
+
+static inline void ext4_es_clear_accessed(struct extent_status *es)
+{
+ es->es_pblk &= ~ES_ACCESSED;
+}
+
static inline unsigned int ext4_es_status(struct extent_status *es)
{
return es->es_pblk >> ES_SHIFT;
--
1.7.9.7
On Thu 07-08-14 11:35:49, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> This commit adds some statictics in extent status tree shrinker. The
> purpose to add these is that we want to collect more details when we
> encounter a stall caused by extent status tree shrinker. Here we count
> the following statictics:
> stats:
> the number of all objects on all extent status trees
> the number of reclaimable objects on lru list
> cache hits/misses
> the last sorted interval
> the number of inodes on lru list
> average:
> scan time for shrinking some objects
> the number of shrunk objects
> maximum:
> the inode that has max nr. of objects on lru list
> the maximum scan time for shrinking some objects
>
> The output looks like below:
> $ cat /proc/fs/ext4/sda1/es_shrinker_info
> stats:
> 28228 objects
> 6341 reclaimable objects
> 5281/631 cache hits/misses
> 586 ms last sorted interval
> 250 inodes on lru list
> average:
> 153 us scan time
> 128 shrunk objects
> maximum:
> 255 inode (255 objects, 198 reclaimable)
> 125723 us max scan time
When reading through this again, I realized that we probably don't want
this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
it's really a debugging interface and we don't want any tools to start
using it.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 07-08-14 11:35:50, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> Currently extent status tree doesn't cache extent hole when a write
> looks up in extent tree to make sure whether a block has been allocated
> or not. In this case, we don't put extent hole in extent cache because
> later this extent might be removed and a new delayed extent might be
> added back. But it will cause a defect when we do a lot of writes.
> If we don't put extent hole in extent cache, the following writes also
> need to access extent tree to look at whether or not a block has been
> allocated. It brings a cache miss. This commit fixes this defect.
> Meanwhile, if an inode has no any extent, this extent hole also will
> be cached.
>
> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
So I agree with this change, it certainly doesn't make things worse. But
when looking into ext4_ext_put_gap_in_cache() I have one question: That
function uses ext4_find_delalloc_range() to check that the intended range
doesn't have any delalloc blocks in it. However it doesn't make any effort
to put a smaller hole in cache - for example if we have blocks allocated
like:
5-6 = delalloc
7-10 = allocated
and ext4_ext_put_gap_in_cache() is called for block 0, the function won't
put anything into cache although it could put range 0-4 in the cache as a
hole. Why is that?
Honza
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 76c2df3..6463d34 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2284,16 +2284,15 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
> ext4_lblk_t block)
> {
> int depth = ext_depth(inode);
> - unsigned long len = 0;
> - ext4_lblk_t lblock = 0;
> + unsigned long len;
> + ext4_lblk_t lblock;
> struct ext4_extent *ex;
>
> ex = path[depth].p_ext;
> if (ex == NULL) {
> - /*
> - * there is no extent yet, so gap is [0;-] and we
> - * don't cache it
> - */
> + /* there is no extent yet, so gap is [0;-] */
> + lblock = 0;
> + len = EXT_MAX_BLOCKS;
> ext_debug("cache gap(whole file):");
> } else if (block < le32_to_cpu(ex->ee_block)) {
> lblock = block;
> @@ -2302,9 +2301,6 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
> block,
> le32_to_cpu(ex->ee_block),
> ext4_ext_get_actual_len(ex));
> - if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> - ext4_es_insert_extent(inode, lblock, len, ~0,
> - EXTENT_STATUS_HOLE);
> } else if (block >= le32_to_cpu(ex->ee_block)
> + ext4_ext_get_actual_len(ex)) {
> ext4_lblk_t next;
> @@ -2318,14 +2314,14 @@ ext4_ext_put_gap_in_cache(struct inode *inode, struct ext4_ext_path *path,
> block);
> BUG_ON(next == lblock);
> len = next - lblock;
> - if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> - ext4_es_insert_extent(inode, lblock, len, ~0,
> - EXTENT_STATUS_HOLE);
> } else {
> BUG();
> }
>
> ext_debug(" -> %u:%lu\n", lblock, len);
> + if (!ext4_find_delalloc_range(inode, lblock, lblock + len - 1))
> + ext4_es_insert_extent(inode, lblock, len, ~0,
> + EXTENT_STATUS_HOLE);
> }
>
> /*
> @@ -4362,8 +4358,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> * put just found gap into cache to speed up
> * subsequent requests
> */
> - if ((flags & EXT4_GET_BLOCKS_NO_PUT_HOLE) == 0)
> - ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
> + ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
> goto out2;
> }
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> In this commit we discard the lru algorithm because it should take a
> long time to keep a lru list in extent status tree shrinker and the
> shrinker should take a long time to scan this lru list in order to
> reclaim some objects.
>
> For reducing the latency, this commit does two works. The first one
> is to replace lru with round-robin. After that we never need to keep
> a lru list. That means that the list shouldn't be sorted if the
> shrinker can not reclaim any objects in the first round. The second
> one is to shrink the length of the list. After using round-robin
> algorithm, the shrinker takes the first inode in the list and handle
> it. If this inode is skipped, it will be moved into the tail of the
> list. Otherwise it will be added back when it is touched again.
I like this change. Some comments are below.
...
> @@ -685,8 +685,8 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> goto error;
> retry:
> err = __es_insert_extent(inode, &newes);
> - if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
> - EXT4_I(inode)))
> + if (err == -ENOMEM && __es_shrink(EXT4_SB(inode->i_sb),
> + 1, EXT4_I(inode)))
> goto retry;
> if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
> err = 0;
This comment is not directly related to this patch but looking into the
code made me think about it. It seems ugly to call __es_shrink() from
internals of ext4_es_insert_extent(). Also thinking about locking
implications makes me shudder a bit and finally this may make the pressure
on the extent cache artificially bigger because MM subsystem is not aware
of the shrinking you do here. I would prefer to leave shrinking on
the slab subsystem itself.
Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
in case we don't need the structure, we can just free it again. It may
introduce some overhead from unnecessary alloc/free but things get simpler
that way (no need for that locked_ei argument for __es_shrink(), no need
for internal calls to __es_shrink() from within the filesystem).
...
> +static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> + struct ext4_inode_info *locked_ei)
> {
> struct ext4_inode_info *ei;
> struct ext4_es_stats *es_stats;
> - struct list_head *cur, *tmp;
> - LIST_HEAD(skipped);
> ktime_t start_time;
> u64 scan_time;
> + int nr_to_walk;
> int nr_shrunk = 0;
> - int retried = 0, skip_precached = 1, nr_skipped = 0;
> + int retried = 0, nr_skipped = 0;
>
> es_stats = &sbi->s_es_stats;
> start_time = ktime_get();
> - spin_lock(&sbi->s_es_lru_lock);
>
> retry:
> - list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
> + spin_lock(&sbi->s_es_lock);
> + nr_to_walk = sbi->s_es_nr_inode;
> + while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
> int shrunk;
>
> - /*
> - * If we have already reclaimed all extents from extent
> - * status tree, just stop the loop immediately.
> - */
> - if (percpu_counter_read_positive(
> - &es_stats->es_stats_lru_cnt) == 0)
> - break;
> + ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
> + i_es_list);
>
> - ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
> + list_del_init(&ei->i_es_list);
> + sbi->s_es_nr_inode--;
> + spin_unlock(&sbi->s_es_lock);
Nothing seems to prevent reclaim from freeing the inode after we drop
s_es_lock. So we could use freed memory. I don't think we want to pin the
inode here by grabbing a refcount since we don't want to deal with iput()
in the shrinker (that could mean having to delete the inode from shrinker
context). But what we could do it to grab ei->i_es_lock before dropping
s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
always grabs i_es_lock, we are protected from inode being freed while we
hold that lock. But please add comments about this both to the
__es_shrink() and ext4_es_remove_extent().
>
> /*
> - * Skip the inode that is newer than the last_sorted
> - * time. Normally we try hard to avoid shrinking
> - * precached inodes, but we will as a last resort.
> + * Normally we try hard to avoid shrinking precached inodes,
> + * but we will as a last resort.
> */
> - if ((es_stats->es_stats_last_sorted < ei->i_touch_when) ||
> - (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
> - EXT4_STATE_EXT_PRECACHED))) {
> + if (!retried && ext4_test_inode_state(&ei->vfs_inode,
> + EXT4_STATE_EXT_PRECACHED)) {
> nr_skipped++;
> - list_move_tail(cur, &skipped);
> + spin_lock(&sbi->s_es_lock);
> + __ext4_es_list_add(sbi, ei);
> + continue;
> + }
> +
> + if (ei->i_es_shk_nr == 0) {
> + spin_lock(&sbi->s_es_lock);
> continue;
> }
>
> - if (ei->i_es_lru_nr == 0 || ei == locked_ei ||
> - !write_trylock(&ei->i_es_lock))
> + if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> + nr_skipped++;
> + spin_lock(&sbi->s_es_lock);
> + __ext4_es_list_add(sbi, ei);
> continue;
> + }
These tests will need some reorg when we rely on ei->i_es_lock but it
should be easily doable.
>
> shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> - if (ei->i_es_lru_nr == 0)
> - list_del_init(&ei->i_es_lru);
> write_unlock(&ei->i_es_lock);
>
> nr_shrunk += shrunk;
> nr_to_scan -= shrunk;
> +
> if (nr_to_scan == 0)
> - break;
> + goto out;
> + spin_lock(&sbi->s_es_lock);
> }
> -
> - /* Move the newer inodes into the tail of the LRU list. */
> - list_splice_tail(&skipped, &sbi->s_es_lru);
> - INIT_LIST_HEAD(&skipped);
> + spin_unlock(&sbi->s_es_lock);
>
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 07-08-14 11:35:52, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> Currently the shrinker needs to take a long time to reclaim some objects
> from a extent status tree because there are a lot of objects that are
> delayed. These objects could not be reclaimed because ext4 uses them to
> finish seeking data/hole, finding delayed range and other works. If a
> rb-tree has a large number of delayed objects, shrinker should scan more
> objects and the latency will be high. This commit uses a list to track
> all reclaimble objects in order to reduce the latency when the shrinker
> tries to reclaim some objects from a extent status tree.
>
> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
Hmm, this will grow structure for caching single extent from 5 to 7
longs. That is quite noticeable. Since lots of delalloc extents within an
inode isn't really a common case I would prefer to avoid that.
What we could do to limit latency caused by scanning of unreclaimable
extents is to change the shrinker to really stop after inspecting
nr_to_scan extents regardless of how many extents did we really reclaim -
this is actually how slab shrinkers are designed to work. We would also
have to record the logical block where the shrinker stopped scanning the
inode and the next shrinker invocation will start scanning at that offset -
it is enough to store this offset in the superblock, we just have to zero
it out when we remove the first inode from the list of inodes with cached
extents.
What do people think about this?
Honza
> ---
> fs/ext4/extents_status.c | 74 ++++++++++++++++++++++++++--------------------
> fs/ext4/extents_status.h | 2 ++
> 2 files changed, 44 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 8d76fd5..2f81d1e 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -148,7 +148,8 @@ static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
> static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t end);
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> - int nr_to_scan);
> + struct list_head *freeable,
> + int *nr_to_scan);
> static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> struct ext4_inode_info *locked_ei);
>
> @@ -171,6 +172,7 @@ void ext4_exit_es(void)
> void ext4_es_init_tree(struct ext4_es_tree *tree)
> {
> tree->root = RB_ROOT;
> + INIT_LIST_HEAD(&tree->list);
> tree->cache_es = NULL;
> }
>
> @@ -302,6 +304,7 @@ static struct extent_status *
> ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> ext4_fsblk_t pblk)
> {
> + struct ext4_inode_info *ei = EXT4_I(inode);
> struct extent_status *es;
> es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
> if (es == NULL)
> @@ -314,12 +317,13 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> * We don't count delayed extent because we never try to reclaim them
> */
> if (!ext4_es_is_delayed(es)) {
> - EXT4_I(inode)->i_es_shk_nr++;
> + list_add_tail(&es->list, &ei->i_es_tree.list);
> + ei->i_es_shk_nr++;
> percpu_counter_inc(&EXT4_SB(inode->i_sb)->
> s_es_stats.es_stats_shk_cnt);
> }
>
> - EXT4_I(inode)->i_es_all_nr++;
> + ei->i_es_all_nr++;
> percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
>
> return es;
> @@ -327,13 +331,15 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
>
> static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
> {
> - EXT4_I(inode)->i_es_all_nr--;
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> + ei->i_es_all_nr--;
> percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_es_stats.es_stats_all_cnt);
>
> /* Decrease the shrink counter when this es is not delayed */
> if (!ext4_es_is_delayed(es)) {
> - BUG_ON(EXT4_I(inode)->i_es_shk_nr == 0);
> - EXT4_I(inode)->i_es_shk_nr--;
> + list_del(&es->list);
> + BUG_ON(ei->i_es_shk_nr-- == 0);
> percpu_counter_dec(&EXT4_SB(inode->i_sb)->
> s_es_stats.es_stats_shk_cnt);
> }
> @@ -958,11 +964,24 @@ void ext4_es_list_del(struct inode *inode)
> spin_unlock(&sbi->s_es_lock);
> }
>
> +static void dispose_list(struct inode *inode, struct list_head *head)
> +{
> + while (!list_empty(head)) {
> + struct extent_status *es;
> +
> + es = list_first_entry(head, struct extent_status, list);
> + list_del_init(&es->list);
> +
> + ext4_es_free_extent(inode, es);
> + }
> +}
> +
> static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> struct ext4_inode_info *locked_ei)
> {
> struct ext4_inode_info *ei;
> struct ext4_es_stats *es_stats;
> + LIST_HEAD(freeable);
> ktime_t start_time;
> u64 scan_time;
> int nr_to_walk;
> @@ -976,8 +995,6 @@ retry:
> spin_lock(&sbi->s_es_lock);
> nr_to_walk = sbi->s_es_nr_inode;
> while (!list_empty(&sbi->s_es_list) && nr_to_walk-- > 0) {
> - int shrunk;
> -
> ei = list_first_entry(&sbi->s_es_list, struct ext4_inode_info,
> i_es_list);
>
> @@ -1009,11 +1026,10 @@ retry:
> continue;
> }
>
> - shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
> + nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
> + &nr_to_scan);
> write_unlock(&ei->i_es_lock);
> -
> - nr_shrunk += shrunk;
> - nr_to_scan -= shrunk;
> + dispose_list(&ei->vfs_inode, &freeable);
>
> if (nr_to_scan == 0)
> goto out;
> @@ -1030,8 +1046,11 @@ retry:
> goto retry;
> }
>
> - if (locked_ei && nr_shrunk == 0)
> - nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
> + if (locked_ei && nr_shrunk == 0) {
> + nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
> + &nr_to_scan);
> + dispose_list(&locked_ei->vfs_inode, &freeable);
> + }
>
> out:
> scan_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> @@ -1227,12 +1246,12 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
> }
>
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> - int nr_to_scan)
> + struct list_head *freeable,
> + int *nr_to_scan)
> {
> struct inode *inode = &ei->vfs_inode;
> struct ext4_es_tree *tree = &ei->i_es_tree;
> - struct rb_node *node;
> - struct extent_status *es;
> + struct extent_status *es, *tmp;
> unsigned long nr_shrunk = 0;
> static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
> DEFAULT_RATELIMIT_BURST);
> @@ -1244,21 +1263,12 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> __ratelimit(&_rs))
> ext4_warning(inode->i_sb, "forced shrink of precached extents");
>
> - node = rb_first(&tree->root);
> - while (node != NULL) {
> - es = rb_entry(node, struct extent_status, rb_node);
> - node = rb_next(&es->rb_node);
> - /*
> - * We can't reclaim delayed extent from status tree because
> - * fiemap, bigallic, and seek_data/hole need to use it.
> - */
> - if (!ext4_es_is_delayed(es)) {
> - rb_erase(&es->rb_node, &tree->root);
> - ext4_es_free_extent(inode, es);
> - nr_shrunk++;
> - if (--nr_to_scan == 0)
> - break;
> - }
> + list_for_each_entry_safe(es, tmp, &tree->list, list) {
> + rb_erase(&es->rb_node, &tree->root);
> + list_move_tail(&es->list, freeable);
> + nr_shrunk++;
> + if (--*nr_to_scan == 0)
> + break;
> }
> tree->cache_es = NULL;
> return nr_shrunk;
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index 0e6a33e..a45c7fe 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -54,6 +54,7 @@ struct ext4_extent;
>
> struct extent_status {
> struct rb_node rb_node;
> + struct list_head list;
> ext4_lblk_t es_lblk; /* first logical block extent covers */
> ext4_lblk_t es_len; /* length of extent in block */
> ext4_fsblk_t es_pblk; /* first physical block */
> @@ -61,6 +62,7 @@ struct extent_status {
>
> struct ext4_es_tree {
> struct rb_root root;
> + struct list_head list;
> struct extent_status *cache_es; /* recently accessed extent */
> };
>
> --
> 1.7.9.7
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 07-08-14 11:35:53, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> For keeping useful extent cache in the tree, this commit uses a gc-like
> algorithm to manage object. A new flag called '_ACCESSED' is defined to
> track whether an extent cache is touched or not. When the shrinker tries
> to reclaim some ojbects, an extent cache will be moved to the tail of
> active list from inactive list if this flag is marked. The object in
> active list will be reclaimed when we are under a high memory pressure.
> After that, the aged extent cache should be kept as many as possible.
So I like the idea of basic aging logic. However active / inactive list
makes it necessary to have list_head in the extent and I'd like to avoid
that. Also it seems like an overkill for a relatively simple structure like
extent cache. What we could do is to have only the ACCESSED bit - it gets
set when cache entry is used. When shrinker sees cache entry with ACCESSED
bit, it clears the bit and skips the entry. Entry without ACCESSED bit is
reclaimed. This way frequently accessed entries have higher chances of
being kept in cache. Again latency of scanning is limited by the fact we
stop scanning after inspecting nr_to_scan entries.
Honza
>
> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
> ---
> fs/ext4/extents_status.c | 42 +++++++++++++++++++++++++++++++++---------
> fs/ext4/extents_status.h | 31 ++++++++++++++++++++++++-------
> 2 files changed, 57 insertions(+), 16 deletions(-)
>
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 2f81d1e..2f6bb538 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -149,7 +149,7 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
> ext4_lblk_t end);
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> struct list_head *freeable,
> - int *nr_to_scan);
> + int *nr_to_scan, int force);
> static int __es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
> struct ext4_inode_info *locked_ei);
>
> @@ -172,7 +172,8 @@ void ext4_exit_es(void)
> void ext4_es_init_tree(struct ext4_es_tree *tree)
> {
> tree->root = RB_ROOT;
> - INIT_LIST_HEAD(&tree->list);
> + INIT_LIST_HEAD(&tree->active_list);
> + INIT_LIST_HEAD(&tree->inactive_list);
> tree->cache_es = NULL;
> }
>
> @@ -317,7 +318,7 @@ ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
> * We don't count delayed extent because we never try to reclaim them
> */
> if (!ext4_es_is_delayed(es)) {
> - list_add_tail(&es->list, &ei->i_es_tree.list);
> + list_add_tail(&es->list, &ei->i_es_tree.inactive_list);
> ei->i_es_shk_nr++;
> percpu_counter_inc(&EXT4_SB(inode->i_sb)->
> s_es_stats.es_stats_shk_cnt);
> @@ -787,6 +788,7 @@ out:
> stats = &EXT4_SB(inode->i_sb)->s_es_stats;
> if (found) {
> BUG_ON(!es1);
> + ext4_es_mark_accessed(es1);
> es->es_lblk = es1->es_lblk;
> es->es_len = es1->es_len;
> es->es_pblk = es1->es_pblk;
> @@ -1027,7 +1029,7 @@ retry:
> }
>
> nr_shrunk += __es_try_to_reclaim_extents(ei, &freeable,
> - &nr_to_scan);
> + &nr_to_scan, retried);
> write_unlock(&ei->i_es_lock);
> dispose_list(&ei->vfs_inode, &freeable);
>
> @@ -1048,7 +1050,7 @@ retry:
>
> if (locked_ei && nr_shrunk == 0) {
> nr_shrunk = __es_try_to_reclaim_extents(locked_ei, &freeable,
> - &nr_to_scan);
> + &nr_to_scan, 1);
> dispose_list(&locked_ei->vfs_inode, &freeable);
> }
>
> @@ -1247,7 +1249,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
>
> static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> struct list_head *freeable,
> - int *nr_to_scan)
> + int *nr_to_scan, int force)
> {
> struct inode *inode = &ei->vfs_inode;
> struct ext4_es_tree *tree = &ei->i_es_tree;
> @@ -1263,13 +1265,35 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> __ratelimit(&_rs))
> ext4_warning(inode->i_sb, "forced shrink of precached extents");
>
> - list_for_each_entry_safe(es, tmp, &tree->list, list) {
> + list_for_each_entry_safe(es, tmp, &tree->inactive_list, list) {
> + if (!*nr_to_scan)
> + goto done;
> + --*nr_to_scan;
> +
> + if (ext4_es_is_accessed(es)) {
> + list_move_tail(&es->list, &tree->active_list);
> + continue;
> + } else {
> + rb_erase(&es->rb_node, &tree->root);
> + list_move_tail(&es->list, freeable);
> + nr_shrunk++;
> + }
> + }
> +
> + if (!force)
> + goto done;
> +
> + list_for_each_entry_safe(es, tmp, &tree->active_list, list) {
> + if (!*nr_to_scan)
> + goto done;
> + --*nr_to_scan;
> +
> rb_erase(&es->rb_node, &tree->root);
> list_move_tail(&es->list, freeable);
> nr_shrunk++;
> - if (--*nr_to_scan == 0)
> - break;
> }
> +
> +done:
> tree->cache_es = NULL;
> return nr_shrunk;
> }
> diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
> index a45c7fe..213e056 100644
> --- a/fs/ext4/extents_status.h
> +++ b/fs/ext4/extents_status.h
> @@ -29,12 +29,12 @@
> /*
> * These flags live in the high bits of extent_status.es_pblk
> */
> -#define ES_SHIFT 60
> +#define ES_SHIFT 59
>
> -#define EXTENT_STATUS_WRITTEN (1 << 3)
> -#define EXTENT_STATUS_UNWRITTEN (1 << 2)
> -#define EXTENT_STATUS_DELAYED (1 << 1)
> -#define EXTENT_STATUS_HOLE (1 << 0)
> +#define EXTENT_STATUS_WRITTEN (1 << 4)
> +#define EXTENT_STATUS_UNWRITTEN (1 << 3)
> +#define EXTENT_STATUS_DELAYED (1 << 2)
> +#define EXTENT_STATUS_HOLE (1 << 1)
>
> #define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \
> EXTENT_STATUS_UNWRITTEN | \
> @@ -45,9 +45,10 @@
> #define ES_UNWRITTEN (1ULL << 62)
> #define ES_DELAYED (1ULL << 61)
> #define ES_HOLE (1ULL << 60)
> +#define ES_ACCESSED (1ULL << 59)
>
> #define ES_MASK (ES_WRITTEN | ES_UNWRITTEN | \
> - ES_DELAYED | ES_HOLE)
> + ES_DELAYED | ES_HOLE | ES_ACCESSED)
>
> struct ext4_sb_info;
> struct ext4_extent;
> @@ -62,7 +63,8 @@ struct extent_status {
>
> struct ext4_es_tree {
> struct rb_root root;
> - struct list_head list;
> + struct list_head active_list;
> + struct list_head inactive_list;
> struct extent_status *cache_es; /* recently accessed extent */
> };
>
> @@ -114,6 +116,21 @@ static inline int ext4_es_is_hole(struct extent_status *es)
> return (es->es_pblk & ES_HOLE) != 0;
> }
>
> +static inline int ext4_es_is_accessed(struct extent_status *es)
> +{
> + return (es->es_pblk & ES_ACCESSED) != 0;
> +}
> +
> +static inline void ext4_es_mark_accessed(struct extent_status *es)
> +{
> + es->es_pblk |= ES_ACCESSED;
> +}
> +
> +static inline void ext4_es_clear_accessed(struct extent_status *es)
> +{
> + es->es_pblk &= ~ES_ACCESSED;
> +}
> +
> static inline unsigned int ext4_es_status(struct extent_status *es)
> {
> return es->es_pblk >> ES_SHIFT;
> --
> 1.7.9.7
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Aug 07, 2014 at 11:35:48AM +0800, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> This commit improves the trace point of extents status tree. We rename
> trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
> in ext4_es_scan() and we can not identify them from the result.
>
> Further this commit fixes a variable name in trace point in order to
> keep consistency with others.
>
> Cc: "Theodore Ts'o" <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Jan Kara <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
Applied to the ext4 tree, thanks.
- Ted
On Thu, Aug 07, 2014 at 11:35:50AM +0800, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> Currently extent status tree doesn't cache extent hole when a write
> looks up in extent tree to make sure whether a block has been allocated
> or not. In this case, we don't put extent hole in extent cache because
> later this extent might be removed and a new delayed extent might be
> added back. But it will cause a defect when we do a lot of writes.
> If we don't put extent hole in extent cache, the following writes also
> need to access extent tree to look at whether or not a block has been
> allocated. It brings a cache miss. This commit fixes this defect.
> Meanwhile, if an inode has no any extent, this extent hole also will
> be cached.
Hi Zheng,
I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
is because in ext4_da_map_blocks(), if there is a hole, we will be
immediately following it up with a call to ext4_es_insert_extent() to
fill in the hole with the EXTENT_STATUS_DELAYED flag. The only time
we don't is if we run into an ENOSPC error.
Am I missing something?
- Ted
On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> This comment is not directly related to this patch but looking into the
> code made me think about it. It seems ugly to call __es_shrink() from
> internals of ext4_es_insert_extent(). Also thinking about locking
> implications makes me shudder a bit and finally this may make the pressure
> on the extent cache artificially bigger because MM subsystem is not aware
> of the shrinking you do here. I would prefer to leave shrinking on
> the slab subsystem itself.
If we fail, the allocation we only try to free at most one extent, so
I don't think it's going to make the slab system that confused; it's
the equivalent of freeing an entry and then using allocating it again.
> Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> in case we don't need the structure, we can just free it again. It may
> introduce some overhead from unnecessary alloc/free but things get simpler
> that way (no need for that locked_ei argument for __es_shrink(), no need
> for internal calls to __es_shrink() from within the filesystem).
The tricky bit is that even __es_remove_extent() can require a memory
allocation, and in the worst case, it's possible that
ext4_es_insert_extent() can require *two* allocations. For example,
if you start with a single large extent, and then need to insert a
subregion with a different set of flags into the already existing
extent, thus resulting in three extents where you started with one.
And in some cases, no allocation is required at all....
One thing that can help is that so long as we haven't done something
critical, such as erase a delalloc region, we always release the write
lock and retry the allocation with GFP_NOFS, and the try the operation
again.
So we may need to think a bit about what's the best way to improve
this, although it is separate topic from making the shrinker be less
heavyweight.
> Nothing seems to prevent reclaim from freeing the inode after we drop
> s_es_lock. So we could use freed memory. I don't think we want to pin the
> inode here by grabbing a refcount since we don't want to deal with iput()
> in the shrinker (that could mean having to delete the inode from shrinker
> context). But what we could do it to grab ei->i_es_lock before dropping
> s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> always grabs i_es_lock, we are protected from inode being freed while we
> hold that lock. But please add comments about this both to the
> __es_shrink() and ext4_es_remove_extent().
Something like this should work, yes?
- Ted
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 25da1bf..4768f7f 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -981,32 +981,27 @@ retry:
list_del_init(&ei->i_es_list);
sbi->s_es_nr_inode--;
- spin_unlock(&sbi->s_es_lock);
+ if (ei->i_es_shk_nr == 0)
+ continue;
/*
* Normally we try hard to avoid shrinking precached inodes,
* but we will as a last resort.
*/
- if (!retried && ext4_test_inode_state(&ei->vfs_inode,
- EXT4_STATE_EXT_PRECACHED)) {
+ if ((!retried && ext4_test_inode_state(&ei->vfs_inode,
+ EXT4_STATE_EXT_PRECACHED)) ||
+ ei == locked_ei ||
+ !write_trylock(&ei->i_es_lock)) {
nr_skipped++;
- spin_lock(&sbi->s_es_lock);
- __ext4_es_list_add(sbi, ei);
- continue;
- }
-
- if (ei->i_es_shk_nr == 0) {
- spin_lock(&sbi->s_es_lock);
- continue;
- }
-
- if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
- nr_skipped++;
- spin_lock(&sbi->s_es_lock);
__ext4_es_list_add(sbi, ei);
+ if (spin_is_contended(&sbi->s_es_lock)) {
+ spin_unlock(&sbi->s_es_lock);
+ spin_lock(&sbi->s_es_lock);
+ }
continue;
}
On Wed, Aug 27, 2014 at 05:13:54PM +0200, Jan Kara wrote:
>
> What we could do to limit latency caused by scanning of unreclaimable
> extents is to change the shrinker to really stop after inspecting
> nr_to_scan extents regardless of how many extents did we really reclaim -
> this is actually how slab shrinkers are designed to work. We would also
> have to record the logical block where the shrinker stopped scanning the
> inode and the next shrinker invocation will start scanning at that offset -
> it is enough to store this offset in the superblock, we just have to zero
> it out when we remove the first inode from the list of inodes with cached
> extents.
>
> What do people think about this?
Yes, I agree this would be a better approach.
- Ted
On Tue 02-09-14 23:37:38, Ted Tso wrote:
> On Wed, Aug 27, 2014 at 05:01:21PM +0200, Jan Kara wrote:
> > On Thu 07-08-14 11:35:51, Zheng Liu wrote:
> > This comment is not directly related to this patch but looking into the
> > code made me think about it. It seems ugly to call __es_shrink() from
> > internals of ext4_es_insert_extent(). Also thinking about locking
> > implications makes me shudder a bit and finally this may make the pressure
> > on the extent cache artificially bigger because MM subsystem is not aware
> > of the shrinking you do here. I would prefer to leave shrinking on
> > the slab subsystem itself.
>
> If we fail, the allocation we only try to free at most one extent, so
> I don't think it's going to make the slab system that confused; it's
> the equivalent of freeing an entry and then using allocating it again.
>
> > Now GFP_ATOMIC allocation we use for extent cache makes it hard for the
> > slab subsystem and actually we could fairly easily use GFP_NOFS. We can just
> > allocate the structure before grabbing i_es_lock with GFP_NOFS allocation and
> > in case we don't need the structure, we can just free it again. It may
> > introduce some overhead from unnecessary alloc/free but things get simpler
> > that way (no need for that locked_ei argument for __es_shrink(), no need
> > for internal calls to __es_shrink() from within the filesystem).
>
> The tricky bit is that even __es_remove_extent() can require a memory
> allocation, and in the worst case, it's possible that
> ext4_es_insert_extent() can require *two* allocations. For example,
> if you start with a single large extent, and then need to insert a
> subregion with a different set of flags into the already existing
> extent, thus resulting in three extents where you started with one.
Right, I didn't realize that.
> And in some cases, no allocation is required at all....
>
> One thing that can help is that so long as we haven't done something
> critical, such as erase a delalloc region, we always release the write
> lock and retry the allocation with GFP_NOFS, and the try the operation
> again.
Yeah, maybe we could use mempools for this. It should make the code less
clumsy.
> So we may need to think a bit about what's the best way to improve
> this, although it is separate topic from making the shrinker be less
> heavyweight.
Agreed, it's a separate topic.
> > Nothing seems to prevent reclaim from freeing the inode after we drop
> > s_es_lock. So we could use freed memory. I don't think we want to pin the
> > inode here by grabbing a refcount since we don't want to deal with iput()
> > in the shrinker (that could mean having to delete the inode from shrinker
> > context). But what we could do it to grab ei->i_es_lock before dropping
> > s_es_lock. Since ext4_es_remove_extent() called from ext4_clear_inode()
> > always grabs i_es_lock, we are protected from inode being freed while we
> > hold that lock. But please add comments about this both to the
> > __es_shrink() and ext4_es_remove_extent().
>
> Something like this should work, yes?
Yes, this should work. I would just add a comment to
ext4_es_remove_extent() about the fact that ext4_clear_inode() requires
grabbing i_es_lock so that we don't do some clever optimization in future
and break these lifetime rules...
Also one question:
> - if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> - nr_skipped++;
> - spin_lock(&sbi->s_es_lock);
> __ext4_es_list_add(sbi, ei);
> + if (spin_is_contended(&sbi->s_es_lock)) {
> + spin_unlock(&sbi->s_es_lock);
> + spin_lock(&sbi->s_es_lock);
> + }
Why not cond_resched_lock(&sbi->s_es_lock)?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> Also one question:
>
> > - if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > - nr_skipped++;
> > - spin_lock(&sbi->s_es_lock);
> > __ext4_es_list_add(sbi, ei);
> > + if (spin_is_contended(&sbi->s_es_lock)) {
> > + spin_unlock(&sbi->s_es_lock);
> > + spin_lock(&sbi->s_es_lock);
> > + }
> Why not cond_resched_lock(&sbi->s_es_lock)?
I didn't think we were allowed to reschedule or sleep while in
shrinker context?
- Ted
On Wed 03-09-14 16:00:39, Ted Tso wrote:
> On Wed, Sep 03, 2014 at 05:31:22PM +0200, Jan Kara wrote:
> > Also one question:
> >
> > > - if (ei == locked_ei || !write_trylock(&ei->i_es_lock)) {
> > > - nr_skipped++;
> > > - spin_lock(&sbi->s_es_lock);
> > > __ext4_es_list_add(sbi, ei);
> > > + if (spin_is_contended(&sbi->s_es_lock)) {
> > > + spin_unlock(&sbi->s_es_lock);
> > > + spin_lock(&sbi->s_es_lock);
> > > + }
> > Why not cond_resched_lock(&sbi->s_es_lock)?
>
> I didn't think we were allowed to reschedule or sleep while in
> shrinker context?
I believe we are allowed to sleep in the shrinker if appropriate gfp
flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
__GFP_FS is set which guarantees __GFP_WAIT.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Sep 04, 2014 at 12:14:02AM +0200, Jan Kara wrote:
> > I didn't think we were allowed to reschedule or sleep while in
> > shrinker context?
> I believe we are allowed to sleep in the shrinker if appropriate gfp
> flags are set (__GFP_WAIT) and we enter extent cache shrinker only if
> __GFP_FS is set which guarantees __GFP_WAIT.
I must be missing something. How is this guaranteed?
I can see how we can determine what gfp_mask was used in the
allocation which triggered the shrinker callback, by checking
shrink->gfp_mask, but I don't see anything that guarantees that the
extent cache shrinker is only entered if __GFP_FS is set.
I guess we could add something like:
if ((shrink->gfp_mask & __GFP_WAIT) == 0)
return 0;
to the beginning of ext4_es_scan(), but we're not doing that at the
moment.
- Ted
On Wed, Aug 27, 2014 at 03:26:07PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:49, Zheng Liu wrote:
> > From: Zheng Liu <[email protected]>
> >
> > This commit adds some statictics in extent status tree shrinker. The
> > purpose to add these is that we want to collect more details when we
> > encounter a stall caused by extent status tree shrinker. Here we count
> > the following statictics:
> > stats:
> > the number of all objects on all extent status trees
> > the number of reclaimable objects on lru list
> > cache hits/misses
> > the last sorted interval
> > the number of inodes on lru list
> > average:
> > scan time for shrinking some objects
> > the number of shrunk objects
> > maximum:
> > the inode that has max nr. of objects on lru list
> > the maximum scan time for shrinking some objects
> >
> > The output looks like below:
> > $ cat /proc/fs/ext4/sda1/es_shrinker_info
> > stats:
> > 28228 objects
> > 6341 reclaimable objects
> > 5281/631 cache hits/misses
> > 586 ms last sorted interval
> > 250 inodes on lru list
> > average:
> > 153 us scan time
> > 128 shrunk objects
> > maximum:
> > 255 inode (255 objects, 198 reclaimable)
> > 125723 us max scan time
> When reading through this again, I realized that we probably don't want
> this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
> it's really a debugging interface and we don't want any tools to start
> using it.
Fair enough. I will fix it in next version.
Thanks,
- Zheng
On Mon, Sep 01, 2014 at 10:43:50PM -0400, Theodore Ts'o wrote:
> On Thu, Aug 07, 2014 at 11:35:50AM +0800, Zheng Liu wrote:
> > From: Zheng Liu <[email protected]>
> >
> > Currently extent status tree doesn't cache extent hole when a write
> > looks up in extent tree to make sure whether a block has been allocated
> > or not. In this case, we don't put extent hole in extent cache because
> > later this extent might be removed and a new delayed extent might be
> > added back. But it will cause a defect when we do a lot of writes.
> > If we don't put extent hole in extent cache, the following writes also
> > need to access extent tree to look at whether or not a block has been
> > allocated. It brings a cache miss. This commit fixes this defect.
> > Meanwhile, if an inode has no any extent, this extent hole also will
> > be cached.
>
> Hi Zheng,
>
> I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
> is because in ext4_da_map_blocks(), if there is a hole, we will be
> immediately following it up with a call to ext4_es_insert_extent() to
> fill in the hole with the EXTENT_STATUS_DELAYED flag. The only time
> we don't is if we run into an ENOSPC error.
>
> Am I missing something?
Yes, you are right. The purpose is used to do the work like you said
above. As the commit log described this flag brings a huge number of
cache misses when an empty file is written. So it might be worth
putting a hole in the cache when delalloc is enabled.
Regards,
- Zheng
On Wed, Aug 27, 2014 at 03:55:16PM +0200, Jan Kara wrote:
> On Thu 07-08-14 11:35:50, Zheng Liu wrote:
> > From: Zheng Liu <[email protected]>
> >
> > Currently extent status tree doesn't cache extent hole when a write
> > looks up in extent tree to make sure whether a block has been allocated
> > or not. In this case, we don't put extent hole in extent cache because
> > later this extent might be removed and a new delayed extent might be
> > added back. But it will cause a defect when we do a lot of writes.
> > If we don't put extent hole in extent cache, the following writes also
> > need to access extent tree to look at whether or not a block has been
> > allocated. It brings a cache miss. This commit fixes this defect.
> > Meanwhile, if an inode has no any extent, this extent hole also will
> > be cached.
> >
> > Cc: "Theodore Ts'o" <[email protected]>
> > Cc: Andreas Dilger <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Signed-off-by: Zheng Liu <[email protected]>
> So I agree with this change, it certainly doesn't make things worse. But
> when looking into ext4_ext_put_gap_in_cache() I have one question: That
> function uses ext4_find_delalloc_range() to check that the intended range
> doesn't have any delalloc blocks in it. However it doesn't make any effort
> to put a smaller hole in cache - for example if we have blocks allocated
> like:
>
> 5-6 = delalloc
> 7-10 = allocated
>
> and ext4_ext_put_gap_in_cache() is called for block 0, the function won't
> put anything into cache although it could put range 0-4 in the cache as a
> hole. Why is that?
Oops, it should put range 0-4 in the cache. Thanks for pointing it out.
I will fix it.
Regards,
- Zheng
On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
> Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> ext4_es_scan() but we don't and we don't need to. But thinking about it
> again - if we're going to always scan at most nr_to_scan cache entries,
> there's probably no need to reduce s_es_lock latency by playing with
> spinlock_contended(), right?
I'm more generally worried contention on s_es_lock, since it's a file
system-wide spinlock that is grabbed whenever we need to add or remove
an inode from the es_list. So if someone were to try to run AIM7
benchmark on a large core count machine on an ext4 file system mounted
on a ramdisk, this lock would likely show up.
Now, this might not be a realistic scenario, but it's a common way to
test for fs scalability without having a super-expensive RAID array,
so it's quite common if you look at FAST papers over the last couple
of years, for example..
So my thinking was that if we do run into contention, the shrinker
thread should always yield, since if it gets slowed down slightly,
there's no harm done. Hmmm.... OTOH, the extra cache line bounce
could potentially be worse, so maybe it would be better to let the
shrinker thread do its thing and then get out of there.
- Ted
On Thu, Sep 04, 2014 at 08:10:48PM +0800, Zheng Liu wrote:
> > When reading through this again, I realized that we probably don't want
> > this file in /proc/fs/ext4 but rather in /sys/kernel/debug/ext4 because
> > it's really a debugging interface and we don't want any tools to start
> > using it.
>
> Fair enough. I will fix it in next version.
I was actually OK with leaving it in /proc, since we have a number of
other debugging files in /proc/fs/ext4/<dev> already, and I'm not sure
spreading ext4's foot print so that we have some files in /proc, some
in /sys, and some in debugfs is worth decreasing the possibility that
some tool would start depending on the contents of that file --- I
can't really imagine why non-fs developers would ever care about such
statistics.
I probably *would* be in favor of creating a new config option which
disables /proc/fs/ext4/<dev>/mballoc_groups and
/proc/fs/ext4/<dev>/es_shrinker_info, to help with the people who care
shrinking the total footprint of the Linux kernel for embedded
application (kernel tinification), though.
- Ted
On Thu, Sep 04, 2014 at 09:04:48PM +0800, Zheng Liu wrote:
> > I thought the reason why we have the EXT4_GET_BLOCKS_NO_PUT_HOLE flag
> > is because in ext4_da_map_blocks(), if there is a hole, we will be
> > immediately following it up with a call to ext4_es_insert_extent() to
> > fill in the hole with the EXTENT_STATUS_DELAYED flag. The only time
> > we don't is if we run into an ENOSPC error.
> >
> > Am I missing something?
>
> Yes, you are right. The purpose is used to do the work like you said
> above. As the commit log described this flag brings a huge number of
> cache misses when an empty file is written.
Right, and I was trying to figure out why that was happening, because
it looked like in the normal case we would immediately fill in the
hole afterwards. I was goign to try doing some tracing to understand
why this was resulting a huge number of cache misses, but I haven't
had time to do that investigation yet. Can you sketch out the
scenario where this was leading to so many cache misses?
Thanks,
- Ted
On Thu 04-09-14 11:44:59, Ted Tso wrote:
> On Thu, Sep 04, 2014 at 09:15:53AM +0200, Jan Kara wrote:
> > Ah, sorry. I was mistaken and thought we do check for __GFP_FS in
> > ext4_es_scan() but we don't and we don't need to. But thinking about it
> > again - if we're going to always scan at most nr_to_scan cache entries,
> > there's probably no need to reduce s_es_lock latency by playing with
> > spinlock_contended(), right?
>
> I'm more generally worried contention on s_es_lock, since it's a file
> system-wide spinlock that is grabbed whenever we need to add or remove
> an inode from the es_list. So if someone were to try to run AIM7
> benchmark on a large core count machine on an ext4 file system mounted
> on a ramdisk, this lock would likely show up.
>
> Now, this might not be a realistic scenario, but it's a common way to
> test for fs scalability without having a super-expensive RAID array,
> so it's quite common if you look at FAST papers over the last couple
> of years, for example..
>
> So my thinking was that if we do run into contention, the shrinker
> thread should always yield, since if it gets slowed down slightly,
> there's no harm done. Hmmm.... OTOH, the extra cache line bounce
> could potentially be worse, so maybe it would be better to let the
> shrinker thread do its thing and then get out of there.
Yeah. I think cache bouncing limits scalability in a similar way spinlock
itself does so there's no big win in shortening the lock hold times. If
someone is concerned about scalability of our extent cache LRU, we could
use some more fancy LRU implementation like the one implemented in
mm/list_lru.c and used for other fs objects. But I would see that as a
separate step and only once someone can show a benefit...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Zheng,
It's been a while since you've posted a revised version of this patch
series. I believe Jan had suggested some changes which you said you
would fix in the next set of patches. Do you know when you might be
able that might be done?
Note that the first two patches in these series are already the
patches that have been queued for Linus for this merge window.
Thanks,
- Ted
Hello,
On Mon 20-10-14 10:48:49, Ted Tso wrote:
> It's been a while since you've posted a revised version of this patch
> series. I believe Jan had suggested some changes which you said you
> would fix in the next set of patches. Do you know when you might be
> able that might be done?
Yeah, actually if you don't have time just say so. I can push it to
completion somehow (there wasn't that much work left). I need the thing
completed reasonably quickly since I already have some user reports for
openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
I cannot leave those unresolved for long...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Ted, Jan and others,
I deeply sorry for this because of my delay work. I don’t have any objection
for Jan’s suggestions. Until now there are still some works that push me
tough, and I can see that I don’t have time to finish it at this merge
window. It’s a shame for me!
Jan, I really really appreciate if you are willing to push this patch set
to completion. Thanks!!!
Regards,
- Zheng
> On Oct 21, 2014, at 6:22 PM, Jan Kara <[email protected]> wrote:
>
> Hello,
>
> On Mon 20-10-14 10:48:49, Ted Tso wrote:
>> It's been a while since you've posted a revised version of this patch
>> series. I believe Jan had suggested some changes which you said you
>> would fix in the next set of patches. Do you know when you might be
>> able that might be done?
> Yeah, actually if you don't have time just say so. I can push it to
> completion somehow (there wasn't that much work left). I need the thing
> completed reasonably quickly since I already have some user reports for
> openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
> I cannot leave those unresolved for long...
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
Hello,
On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> I deeply sorry for this because of my delay work. I don’t have any objection
> for Jan’s suggestions. Until now there are still some works that push me
> tough, and I can see that I don’t have time to finish it at this merge
> window. It’s a shame for me!
>
> Jan, I really really appreciate if you are willing to push this patch set
> to completion. Thanks!!!
OK, I have updated the patches according to the review I and Ted did. It
survives basic fsstress run. How were you testing your patches? I should
probably also gather some statistics etc...
Honza
> > On Oct 21, 2014, at 6:22 PM, Jan Kara <[email protected]> wrote:
> >
> > Hello,
> >
> > On Mon 20-10-14 10:48:49, Ted Tso wrote:
> >> It's been a while since you've posted a revised version of this patch
> >> series. I believe Jan had suggested some changes which you said you
> >> would fix in the next set of patches. Do you know when you might be
> >> able that might be done?
> > Yeah, actually if you don't have time just say so. I can push it to
> > completion somehow (there wasn't that much work left). I need the thing
> > completed reasonably quickly since I already have some user reports for
> > openSUSE and I'm somewhat afraid reports for SLES will come pretty soon and
> > I cannot leave those unresolved for long...
> >
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Nov 03, 2014 at 05:10:46PM +0100, Jan Kara wrote:
> Hello,
>
> On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> > I deeply sorry for this because of my delay work. I don’t have any objection
> > for Jan’s suggestions. Until now there are still some works that push me
> > tough, and I can see that I don’t have time to finish it at this merge
> > window. It’s a shame for me!
> >
> > Jan, I really really appreciate if you are willing to push this patch set
> > to completion. Thanks!!!
> OK, I have updated the patches according to the review I and Ted did. It
> survives basic fsstress run. How were you testing your patches? I should
> probably also gather some statistics etc...
Thanks!!
Here are my test cases for performance.
case 1:
[global]
ioengine=psync
bs=4k
directory=/mnt/sda1
thread
group_reporting
fallocate=0
direct=0
filesize=10g
size=20g
runtime=300
[io]
rw=randwrite:32
rw_sequencer=sequential
numjobs=25
nrfiles=10
case 2:
[global]
ioengine=psync
bs=4k
directory=/mnt/sda1
group_reporting
fallocate=0
direct=0
filesize=10g
size=20g
runtime=300
[io]
rw=write:4k
numjobs=15
nrfiles=20000
For getting a really fragmented extent status tree, I will disable
extent status tree merge as I run these test cases. The patch looks
like below:
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 09fd576..0946f50 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -351,6 +351,7 @@ static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
static int ext4_es_can_be_merged(struct extent_status *es1,
struct extent_status *es2)
{
+#if 0
if (ext4_es_status(es1) != ext4_es_status(es2))
return 0;
@@ -376,6 +377,7 @@ static int ext4_es_can_be_merged(struct extent_status *es1,
/* we need to check delayed extent is without unwritten status */
if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
return 1;
+#endif
return 0;
}
In the mean time, the following sysctl parameters are adjusted to keep
dirty data in memory as much as possible.
sudo sysctl vm.dirty_ratio=80
sudo sysctl vm.dirty_background_ratio=60
Thanks,
- Zheng
On Mon, Nov 03, 2014 at 05:10:46PM +0100, Jan Kara wrote:
> Hello,
>
> On Tue 21-10-14 23:58:10, 刘峥(文卿) wrote:
> > I deeply sorry for this because of my delay work. I don’t have any objection
> > for Jan’s suggestions. Until now there are still some works that push me
> > tough, and I can see that I don’t have time to finish it at this merge
> > window. It’s a shame for me!
> >
> > Jan, I really really appreciate if you are willing to push this patch set
> > to completion. Thanks!!!
> OK, I have updated the patches according to the review I and Ted did. It
> survives basic fsstress run. How were you testing your patches? I should
> probably also gather some statistics etc...
Hi Jan,
I was wondering how your updated extent status tree shrinker patches
are going? Are you comfortable sending them out for review/merging?
Sorry for nagging but we're already at 3.18-rc4, and it would be great
if we could get these merged for this merge window.
Thanks!
- Ted