2020-08-22 11:35:50

by Ritesh Harjani

[permalink] [raw]
Subject: [PATCHv2 0/3] Optimize ext4 DAX overwrites

Hello,

RFC -> v2
1. Addressed comments from Jan.
2. Added xfstests results in cover letter.

In case of dax writes, currently we start a journal txn irrespective of whether
it's an overwrite or not. In case of an overwrite we don't need to start a
jbd2 txn since the blocks are already allocated.
So this patch optimizes away the txn start in case of DAX overwrites.
This could significantly boost performance for multi-threaded writes
specially random writes (overwrite).
Fio script used to collect perf numbers is mentioned below.

Below numbers were calculated on a QEMU setup on ppc64 box with simulated
pmem device.

Didn't observe any new failures with this patch in xfstests "-g quick,dax"

Performance numbers with different threads - (~10x improvement)
==========================================

vanilla_kernel(kIOPS) (randomwrite)
60 +-+------+-------+--------+--------+--------+-------+------+-+
| + + + +** + + |
55 +-+ ** +-+
| ** ** |
| ** ** |
50 +-+ ** ** +-+
| ** ** |
45 +-+ ** ** +-+
| ** ** |
| ** ** |
40 +-+ ** ** +-+
| ** ** |
35 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** ** |
30 +-+ ** ** ** ** ** ** +-+
| ** +** +** +** ** +** |
25 +-+------**------+**------+**------+**------**------+**----+-+
1 2 4 8 12 16
Threads
patched_kernel(kIOPS) (randomwrite)
600 +-+-----+--------+--------+-------+--------+-------+------+-+
| + + + + + +** |
| ** |
500 +-+ ** +-+
| ** |
| ** ** |
400 +-+ ** ** +-+
| ** ** |
300 +-+ ** ** ** +-+
| ** ** ** |
| ** ** ** |
200 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** |
100 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
| +** +** ** +** +** +** |
0 +-+-----+**------+**------**------+**------+**-----+**----+-+
1 2 4 8 12 16
Threads
fio script
==========
[global]
rw=randwrite
norandommap=1
invalidate=0
bs=4k
numjobs=16 --> changed this for different thread options
time_based=1
ramp_time=30
runtime=60
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32
directory=/mnt/

[name]
directory=/mnt/
direct=1

NOTE:
======
1. Looking at ~10x perf delta, I probed a bit deeper to understand what's causing
this scalability problem. It seems when we are starting a jbd2 txn then slab
alloc code is observing some serious contention around spinlock.

Even though the spinlock contention could be related to some other
issue (looking into it internally). But I could still see the perf improvement
of close to ~2x on QEMU setup on x86 with simulated pmem device with the
patched_kernel v/s vanilla_kernel with same fio workload.

perf report from vanilla_kernel (this is not seen with patched kernel) (ppc64)
=======================================================================

47.86% fio [kernel.vmlinux] [k] do_raw_spin_lock
|
---do_raw_spin_lock
|
|--19.43%--_raw_spin_lock
| |
| --19.31%--0
| |
| |--9.77%--deactivate_slab.isra.61
| | ___slab_alloc
| | __slab_alloc
| | kmem_cache_alloc
| | jbd2__journal_start
| | __ext4_journal_start_sb
<...>

2. This problem was reported by Dan Williams at [1]

Links
======
[1]: https://lore.kernel.org/linux-ext4/[email protected]/T/

Ritesh Harjani (3):
ext4: Refactor ext4_overwrite_io() to take ext4_map_blocks as argument
ext4: Extend ext4_overwrite_io() for dax path
ext4: Optimize ext4 DAX overwrites

fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 28 ++++++++++++++++++----------
fs/ext4/inode.c | 11 +++++++++--
3 files changed, 29 insertions(+), 12 deletions(-)

--
2.25.4


2020-08-22 11:36:04

by Ritesh Harjani

[permalink] [raw]
Subject: [PATCHv2 2/3] ext4: Extend ext4_overwrite_io() for dax path

DAX uses ->iomap_begin path which gets called to map/allocate
extents. In order to avoid starting journal txn where extent
allocation is not required, we need to check if ext4_map_blocks()
has returned any mapped extent entry.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 42f5060f3cdf..8a1b468bfb49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3232,6 +3232,8 @@ extern const struct dentry_operations ext4_dentry_ops;
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+extern bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map,
+ bool is_dax);

/* inline.c */
extern int ext4_get_max_inline_size(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 84f73ed91af2..6c252498334b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -188,7 +188,8 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
}

/* Is IO overwriting allocated and initialized blocks? */
-static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map)
+bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map,
+ bool is_dax)
{
unsigned int blkbits = inode->i_blkbits;
loff_t end = (map->m_lblk + map->m_len) << blkbits;
@@ -198,12 +199,19 @@ static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks *map)
return false;

err = ext4_map_blocks(NULL, inode, map, 0);
+
/*
+ * In case of dax to avoid starting a transaction in ext4_iomap_begin()
+ * we check if ext4_map_blocks() can return any mapped extent.
+ *
* 'err==len' means that all of the blocks have been preallocated,
* regardless of whether they have been initialized or not. To exclude
* unwritten extents, we need to check m_flags.
*/
- return err == blklen && (map->m_flags & EXT4_MAP_MAPPED);
+ if (is_dax)
+ return err > 0 && (map->m_flags & EXT4_MAP_MAPPED);
+ else
+ return err == blklen && (map->m_flags & EXT4_MAP_MAPPED);
}

static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
@@ -427,7 +435,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
* in file_modified().
*/
if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
- !ext4_overwrite_io(inode, &map))) {
+ !ext4_overwrite_io(inode, &map, false))) {
inode_unlock_shared(inode);
*ilock_shared = false;
inode_lock(inode);
--
2.25.4

2020-08-22 11:38:46

by Ritesh Harjani

[permalink] [raw]
Subject: [PATCHv2 3/3] ext4: Optimize ext4 DAX overwrites

Currently in case of DAX, we are starting a journal txn everytime for
IOMAP_WRITE case. This can be optimized away in case of an overwrite
(where the blocks were already allocated).

This could give a significant performance boost for multi-threaded writes
specially random writes.
On PPC64 VM with simulated pmem device, ~10x perf improvement could be
seen in random writes (overwrite). Also bcoz this optimizes away the
spinlock contention during jbd2 slab cache allocation (jbd2_journal_handle)
On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..c18009c91e68 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3437,6 +3437,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;

+ /*
+ * In case of DAX write, we check if this is overwrite request, to avoid
+ * starting a journal txn in ext4_iomap_alloc()
+ */
+ if ((flags & IOMAP_WRITE) && IS_DAX(inode) &&
+ ext4_overwrite_io(inode, &map, true))
+ goto out_set;
+
if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, &map, flags);
else
@@ -3444,9 +3452,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,

if (ret < 0)
return ret;
-
+out_set:
ext4_set_iomap(inode, iomap, &map, offset, length);
-
return 0;
}

--
2.25.4