2016-04-29 16:28:16

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

v4->v5:
- Change patch 1 to disable i_dio_count update in do_dax_io().

v3->v4:
- For patch 1, add the DIO_SKIP_DIO_COUNT flag to dax_do_io() calls
only to address issue raised by Dave Chinner.

v2->v3:
- Remove the percpu_stats helper functions and use percpu_counters
instead.

v1->v2:
- Remove percpu_stats_reset() which is not really needed in this
patchset.
- Move some percpu_stats* functions to the newly created
lib/percpu_stats.c.
- Add a new patch to support 64-bit statistics counts in 32-bit
architectures.
- Rearrange the patches by moving the percpu_stats patches to the
front followed by the ext4 patches.

This patchset aims to improve parallel I/O performance of the ext4
filesystem on DAX.

Patch 1 disables update of the i_dio_count as all DAX I/Os are synchronous
and should be protected from whatever locking was done by the filesystem
caller or within dax_do_io() for read (DIO_LOCKING).

Patch 2 converts some ext4 statistics counts into percpu counts using
the helper functions.

Waiman Long (2):
dax: Don't touch i_dio_count in dax_do_io()
ext4: Make cache hits/misses per-cpu counts

fs/dax.c | 14 ++++++--------
fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
3 files changed, 37 insertions(+), 19 deletions(-)


2016-04-29 16:29:56

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 2/2] ext4: Make cache hits/misses per-cpu counts

This patch changes the es_stats_cache_hits and es_stats_cache_misses
statistics counts to percpu counters to reduce cacheline contention
issues whem multiple threads are trying to update those counts
simultaneously.

With a 38-threads fio I/O test with 2 shared files (on DAX-mount
NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
the aggregated bandwidths before and after the patch were:

Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 16499MB/s 17215MB/s +4.3%
Read-write 4361MB/s 4794MB/s +9.9%

Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e38b987..92ca56d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -770,6 +770,15 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
}

/*
+ * For pure statistics count, use a large batch size to make sure that
+ * it does percpu update as much as possible.
+ */
+static inline void ext4_es_stats_inc(struct percpu_counter *fbc)
+{
+ __percpu_counter_add(fbc, 1, (1 << 30));
+}
+
+/*
* ext4_es_lookup_extent() looks up an extent in extent status tree.
*
* ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
@@ -825,9 +834,9 @@ out:
es->es_pblk = es1->es_pblk;
if (!ext4_es_is_referenced(es1))
ext4_es_set_referenced(es1);
- stats->es_stats_cache_hits++;
+ ext4_es_stats_inc(&stats->es_stats_cache_hits);
} else {
- stats->es_stats_cache_misses++;
+ ext4_es_stats_inc(&stats->es_stats_cache_misses);
}

read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -1113,9 +1122,9 @@ int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
- seq_printf(seq, " %lu/%lu cache hits/misses\n",
- es_stats->es_stats_cache_hits,
- es_stats->es_stats_cache_misses);
+ seq_printf(seq, " %lld/%lld cache hits/misses\n",
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_hits),
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_misses));
if (inode_cnt)
seq_printf(seq, " %d inodes on list\n", inode_cnt);

@@ -1142,8 +1151,6 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
sbi->s_es_nr_inode = 0;
spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
- sbi->s_es_stats.es_stats_cache_hits = 0;
- sbi->s_es_stats.es_stats_cache_misses = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
@@ -1153,15 +1160,26 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err1;

+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_hits, 0, GFP_KERNEL);
+ if (err)
+ goto err2;
+
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_misses, 0, GFP_KERNEL);
+ if (err)
+ goto err3;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
err = register_shrinker(&sbi->s_es_shrinker);
if (err)
- goto err2;
+ goto err4;

return 0;
-
+err4:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
+err3:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
err2:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
@@ -1173,6 +1191,8 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
unregister_shrinker(&sbi->s_es_shrinker);
}

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f7aa24f..d537868 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -69,10 +69,10 @@ struct ext4_es_tree {

struct ext4_es_stats {
unsigned long es_stats_shrunk;
- unsigned long es_stats_cache_hits;
- unsigned long es_stats_cache_misses;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
+ struct percpu_counter es_stats_cache_hits;
+ struct percpu_counter es_stats_cache_misses;
struct percpu_counter es_stats_all_cnt;
struct percpu_counter es_stats_shk_cnt;
};
--
1.7.1

2016-04-29 16:30:29

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

The purpose of the i_dio_count is to protect against truncation while
the I/O operation is in progress. As dax_do_io() only does synchronous
I/O, the locking performed by the caller or within dax_do_io() for
read should be enough to protect it against truncation. There is no
need to touch the i_dio_count.

Eliminating two atomic operations can sometimes give a noticeable
improvement in I/O performance as NVDIMM is much faster than other
disk devices.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
fs/dax.c | 14 ++++++--------
1 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 90322eb..1b4b500 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -253,8 +253,12 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
* If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
* caller for writes. For reads, we take and release the i_mutex ourselves.
* If DIO_LOCKING is not set, the filesystem takes care of its own locking.
- * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
- * is in progress.
+ *
+ * The do_blockdev_direct_IO() function increment i_dio_count while the I/O
+ * is in progress. However, the dax_do_io() always does synchronous I/O. The
+ * locking done by the caller or within dax_do_io() for read (DIO_LOCKING)
+ * should be enough to protect against concurrent truncation. We don't really
+ * need to touch i_dio_count here.
*/
ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
struct iov_iter *iter, loff_t pos, get_block_t get_block,
@@ -277,10 +281,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
}
}

- /* Protects against truncate */
- if (!(flags & DIO_SKIP_DIO_COUNT))
- inode_dio_begin(inode);
-
retval = dax_io(inode, iter, pos, end, get_block, &bh);

if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ)
@@ -294,8 +294,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
retval = err;
}

- if (!(flags & DIO_SKIP_DIO_COUNT))
- inode_dio_end(inode);
out:
return retval;
}
--
1.7.1

2016-04-29 16:38:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On 04/29/2016 12:27 PM, Waiman Long wrote:
> v4->v5:
> - Change patch 1 to disable i_dio_count update in do_dax_io().
>
> v3->v4:
> - For patch 1, add the DIO_SKIP_DIO_COUNT flag to dax_do_io() calls
> only to address issue raised by Dave Chinner.
>
> v2->v3:
> - Remove the percpu_stats helper functions and use percpu_counters
> instead.
>
> v1->v2:
> - Remove percpu_stats_reset() which is not really needed in this
> patchset.
> - Move some percpu_stats* functions to the newly created
> lib/percpu_stats.c.
> - Add a new patch to support 64-bit statistics counts in 32-bit
> architectures.
> - Rearrange the patches by moving the percpu_stats patches to the
> front followed by the ext4 patches.
>
> This patchset aims to improve parallel I/O performance of the ext4
> filesystem on DAX.
>
> Patch 1 disables update of the i_dio_count as all DAX I/Os are synchronous
> and should be protected from whatever locking was done by the filesystem
> caller or within dax_do_io() for read (DIO_LOCKING).
>
> Patch 2 converts some ext4 statistics counts into percpu counts using
> the helper functions.
>
> Waiman Long (2):
> dax: Don't touch i_dio_count in dax_do_io()
> ext4: Make cache hits/misses per-cpu counts
>
> fs/dax.c | 14 ++++++--------
> fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
> fs/ext4/extents_status.h | 4 ++--
> 3 files changed, 37 insertions(+), 19 deletions(-)
>

From my testing, it looked like that parallel overwrites to the same
file in an ext4 filesystem on DAX can happen in parallel even if their
range overlaps. It was mainly because the code will drop the i_mutex
before the write. That means the overlapped blocks can get garbage. I
think this is a problem, but I am not expert in the ext4 filesystem to
say for sure. I would like to know your thought on that.

Thanks,
Longman