2016-04-29 16:27:54

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

v4->v5:
- Change patch 1 to disable i_dio_count update in do_dax_io().

v3->v4:
- For patch 1, add the DIO_SKIP_DIO_COUNT flag to dax_do_io() calls
only to address issue raised by Dave Chinner.

v2->v3:
- Remove the percpu_stats helper functions and use percpu_counters
instead.

v1->v2:
- Remove percpu_stats_reset() which is not really needed in this
patchset.
- Move some percpu_stats* functions to the newly created
lib/percpu_stats.c.
- Add a new patch to support 64-bit statistics counts in 32-bit
architectures.
- Rearrange the patches by moving the percpu_stats patches to the
front followed by the ext4 patches.

This patchset aims to improve parallel I/O performance of the ext4
filesystem on DAX.

Patch 1 disables update of the i_dio_count as all DAX I/Os are synchronous
and should be protected from whatever locking was done by the filesystem
caller or within dax_do_io() for read (DIO_LOCKING).

Patch 2 converts some ext4 statistics counts into percpu counts using
the helper functions.

Waiman Long (2):
dax: Don't touch i_dio_count in dax_do_io()
ext4: Make cache hits/misses per-cpu counts

fs/dax.c | 14 ++++++--------
fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
3 files changed, 37 insertions(+), 19 deletions(-)


2016-04-29 16:28:17

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

The purpose of the i_dio_count is to protect against truncation while
the I/O operation is in progress. As dax_do_io() only does synchronous
I/O, the locking performed by the caller or within dax_do_io() for
read should be enough to protect it against truncation. There is no
need to touch the i_dio_count.

Eliminating two atomic operations can sometimes give a noticeable
improvement in I/O performance as NVDIMM is much faster than other
disk devices.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
fs/dax.c | 14 ++++++--------
1 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 90322eb..1b4b500 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -253,8 +253,12 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
* If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
* caller for writes. For reads, we take and release the i_mutex ourselves.
* If DIO_LOCKING is not set, the filesystem takes care of its own locking.
- * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
- * is in progress.
+ *
+ * The do_blockdev_direct_IO() function increment i_dio_count while the I/O
+ * is in progress. However, the dax_do_io() always does synchronous I/O. The
+ * locking done by the caller or within dax_do_io() for read (DIO_LOCKING)
+ * should be enough to protect against concurrent truncation. We don't really
+ * need to touch i_dio_count here.
*/
ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
struct iov_iter *iter, loff_t pos, get_block_t get_block,
@@ -277,10 +281,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
}
}

- /* Protects against truncate */
- if (!(flags & DIO_SKIP_DIO_COUNT))
- inode_dio_begin(inode);

2016-04-29 16:27:56

by Waiman Long

[permalink] [raw]
Subject: [PATCH v5 2/2] ext4: Make cache hits/misses per-cpu counts

This patch changes the es_stats_cache_hits and es_stats_cache_misses
statistics counts to percpu counters to reduce cacheline contention
issues whem multiple threads are trying to update those counts
simultaneously.

With a 38-threads fio I/O test with 2 shared files (on DAX-mount
NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
the aggregated bandwidths before and after the patch were:

Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 16499MB/s 17215MB/s +4.3%
Read-write 4361MB/s 4794MB/s +9.9%

Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e38b987..92ca56d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -770,6 +770,15 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
}

/*
+ * For pure statistics count, use a large batch size to make sure that
+ * it does percpu update as much as possible.
+ */
+static inline void ext4_es_stats_inc(struct percpu_counter *fbc)
+{
+ __percpu_counter_add(fbc, 1, (1 << 30));
+}
+
+/*
* ext4_es_lookup_extent() looks up an extent in extent status tree.
*
* ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
@@ -825,9 +834,9 @@ out:
es->es_pblk = es1->es_pblk;
if (!ext4_es_is_referenced(es1))
ext4_es_set_referenced(es1);
- stats->es_stats_cache_hits++;
+ ext4_es_stats_inc(&stats->es_stats_cache_hits);
} else {
- stats->es_stats_cache_misses++;
+ ext4_es_stats_inc(&stats->es_stats_cache_misses);
}

read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -1113,9 +1122,9 @@ int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
- seq_printf(seq, " %lu/%lu cache hits/misses\n",
- es_stats->es_stats_cache_hits,
- es_stats->es_stats_cache_misses);
+ seq_printf(seq, " %lld/%lld cache hits/misses\n",
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_hits),
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_misses));
if (inode_cnt)
seq_printf(seq, " %d inodes on list\n", inode_cnt);

@@ -1142,8 +1151,6 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
sbi->s_es_nr_inode = 0;
spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
- sbi->s_es_stats.es_stats_cache_hits = 0;
- sbi->s_es_stats.es_stats_cache_misses = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
@@ -1153,15 +1160,26 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err1;

+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_hits, 0, GFP_KERNEL);
+ if (err)
+ goto err2;
+
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_misses, 0, GFP_KERNEL);
+ if (err)
+ goto err3;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
err = register_shrinker(&sbi->s_es_shrinker);
if (err)
- goto err2;
+ goto err4;

return 0;
-
+err4:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
+err3:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
err2:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
@@ -1173,6 +1191,8 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
unregister_shrinker(&sbi->s_es_shrinker);
}

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f7aa24f..d537868 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -69,10 +69,10 @@ struct ext4_es_tree {

struct ext4_es_stats {
unsigned long es_stats_shrunk;
- unsigned long es_stats_cache_hits;
- unsigned long es_stats_cache_misses;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
+ struct percpu_counter es_stats_cache_hits;
+ struct percpu_counter es_stats_cache_misses;
struct percpu_counter es_stats_all_cnt;
struct percpu_counter es_stats_shk_cnt;
};
--
1.7.1

2016-04-29 16:38:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On 04/29/2016 12:27 PM, Waiman Long wrote:
> v4->v5:
> - Change patch 1 to disable i_dio_count update in do_dax_io().
>
> v3->v4:
> - For patch 1, add the DIO_SKIP_DIO_COUNT flag to dax_do_io() calls
> only to address issue raised by Dave Chinner.
>
> v2->v3:
> - Remove the percpu_stats helper functions and use percpu_counters
> instead.
>
> v1->v2:
> - Remove percpu_stats_reset() which is not really needed in this
> patchset.
> - Move some percpu_stats* functions to the newly created
> lib/percpu_stats.c.
> - Add a new patch to support 64-bit statistics counts in 32-bit
> architectures.
> - Rearrange the patches by moving the percpu_stats patches to the
> front followed by the ext4 patches.
>
> This patchset aims to improve parallel I/O performance of the ext4
> filesystem on DAX.
>
> Patch 1 disables update of the i_dio_count as all DAX I/Os are synchronous
> and should be protected from whatever locking was done by the filesystem
> caller or within dax_do_io() for read (DIO_LOCKING).
>
> Patch 2 converts some ext4 statistics counts into percpu counts using
> the helper functions.
>
> Waiman Long (2):
> dax: Don't touch i_dio_count in dax_do_io()
> ext4: Make cache hits/misses per-cpu counts
>
> fs/dax.c | 14 ++++++--------
> fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
> fs/ext4/extents_status.h | 4 ++--
> 3 files changed, 37 insertions(+), 19 deletions(-)
>

From my testing, it looked like that parallel overwrites to the same
file in an ext4 filesystem on DAX can happen in parallel even if their
range overlaps. It was mainly because the code will drop the i_mutex
before the write. That means the overlapped blocks can get garbage. I
think this is a problem, but I am not expert in the ext4 filesystem to
say for sure. I would like to know your thought on that.

Thanks,
Longman

2016-05-01 17:28:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On Fri, Apr 29, 2016 at 12:38:20PM -0400, Waiman Long wrote:
> From my testing, it looked like that parallel overwrites to the same file in
> an ext4 filesystem on DAX can happen in parallel even if their range
> overlaps. It was mainly because the code will drop the i_mutex before the
> write. That means the overlapped blocks can get garbage. I think this is a
> problem, but I am not expert in the ext4 filesystem to say for sure. I would
> like to know your thought on that.

That's another issue with dax I/O pretending to be direct I/O.. Because
it isn't we'll need to synchronize it like buffered I/O and not like
direct I/O in all file systems.

2016-05-02 17:45:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On 05/01/2016 01:28 PM, Christoph Hellwig wrote:
> On Fri, Apr 29, 2016 at 12:38:20PM -0400, Waiman Long wrote:
>> From my testing, it looked like that parallel overwrites to the same file in
>> an ext4 filesystem on DAX can happen in parallel even if their range
>> overlaps. It was mainly because the code will drop the i_mutex before the
>> write. That means the overlapped blocks can get garbage. I think this is a
>> problem, but I am not expert in the ext4 filesystem to say for sure. I would
>> like to know your thought on that.
> That's another issue with dax I/O pretending to be direct I/O.. Because
> it isn't we'll need to synchronize it like buffered I/O and not like
> direct I/O in all file systems.

From what I saw in the code, I think filemap_write_and_wait_range()
should have prevented concurrent overwrites from stepping on each
other for non-DAX I/O. However it is essentially a no-op for DAX
I/O and so the protection is gone.

I am planning to send out a patch to disable mutex dropping for DAX
overwrite. There is still an issue on the read side. If journal is
disabled and the dioread_nolock mount option is used, read will done
without locking. Again, the filemap_write_and_wait_range() check on
the read side will not protect against write.

Cheers,
Longman


2016-05-05 01:58:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On Sun, May 01, 2016 at 10:28:54AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 29, 2016 at 12:38:20PM -0400, Waiman Long wrote:
> > From my testing, it looked like that parallel overwrites to the same file in
> > an ext4 filesystem on DAX can happen in parallel even if their range
> > overlaps. It was mainly because the code will drop the i_mutex before the
> > write. That means the overlapped blocks can get garbage. I think this is a
> > problem, but I am not expert in the ext4 filesystem to say for sure. I would
> > like to know your thought on that.
>
> That's another issue with dax I/O pretending to be direct I/O.. Because
> it isn't we'll need to synchronize it like buffered I/O and not like
> direct I/O in all file systems.

We did this intentionally. DAX IO needs to have the same parallel
write semantics of direct IO, because otherwise a single writer
prevents any IO concurrency and that's a bigger problem for DAX that
traditional storage due to the access speed and bandwidth available.

This was always intended to be fixed by the introduction of proper
range locking for IO, not by reverting to total exclusion for write
IOs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-05-05 14:03:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] ext4: Make cache hits/misses per-cpu counts

On Fri 29-04-16 12:27:56, Waiman Long wrote:
> This patch changes the es_stats_cache_hits and es_stats_cache_misses
> statistics counts to percpu counters to reduce cacheline contention
> issues whem multiple threads are trying to update those counts
> simultaneously.
>
> With a 38-threads fio I/O test with 2 shared files (on DAX-mount
> NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
> the aggregated bandwidths before and after the patch were:
>
> Test W/O patch With patch % change
> ---- --------- ---------- --------
> Read-only 16499MB/s 17215MB/s +4.3%
> Read-write 4361MB/s 4794MB/s +9.9%
>
> Signed-off-by: Waiman Long <[email protected]>

The patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-05-05 14:16:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

On Fri 29-04-16 12:27:55, Waiman Long wrote:
> The purpose of the i_dio_count is to protect against truncation while
> the I/O operation is in progress. As dax_do_io() only does synchronous
> I/O, the locking performed by the caller or within dax_do_io() for
> read should be enough to protect it against truncation. There is no
> need to touch the i_dio_count.
>
> Eliminating two atomic operations can sometimes give a noticeable
> improvement in I/O performance as NVDIMM is much faster than other
> disk devices.
>
> Suggested-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>

We cannot easily do this currently - the reason is that in several places we
wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
logic with this patch.

If we indeed put all writes under i_mutex, this problem would go away but
as Dave explains in his email, we consciously do as much IO as we can
without i_mutex to allow reasonable scalability of multiple writers into
the same file.

The downside of that is that overwrites and writes vs reads are not atomic
wrt each other as POSIX requires. It has been that way for direct IO in XFS
case for a long time, with DAX this non-conforming behavior is proliferating
more. I agree that's not ideal but serializing all writes on a file is
rather harsh for persistent memory as well...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-05-05 14:19:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 0/2] ext4: Improve parallel I/O performance on NVDIMM

On Thu, May 05, 2016 at 11:57:30AM +1000, Dave Chinner wrote:
> We did this intentionally. DAX IO needs to have the same parallel
> write semantics of direct IO, because otherwise a single writer
> prevents any IO concurrency and that's a bigger problem for DAX that
> traditional storage due to the access speed and bandwidth available.

But this is secondard - even buffered I/O would benefit from not
having the global lock. The difference is that Posix requires exclusive
writers, and we can't just break those semantics because we're sitting
on whizz bang fancy backend storage.


2016-05-05 14:27:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote:
> We cannot easily do this currently - the reason is that in several places we
> wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
> holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
> logic with this patch.
>
> If we indeed put all writes under i_mutex, this problem would go away but
> as Dave explains in his email, we consciously do as much IO as we can
> without i_mutex to allow reasonable scalability of multiple writers into
> the same file.

So the above should be fine for xfs, but you're telling me that ext4
is doing DAX I/O without any inode lock at all? In that case it's
indeed not going to work.

> The downside of that is that overwrites and writes vs reads are not atomic
> wrt each other as POSIX requires. It has been that way for direct IO in XFS
> case for a long time, with DAX this non-conforming behavior is proliferating
> more. I agree that's not ideal but serializing all writes on a file is
> rather harsh for persistent memory as well...

For non-O_DIRECT I/O it's simply required..

2016-05-05 15:48:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] dax: Don't touch i_dio_count in dax_do_io()

On Thu 05-05-16 07:27:48, Christoph Hellwig wrote:
> On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote:
> > We cannot easily do this currently - the reason is that in several places we
> > wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
> > holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
> > logic with this patch.
> >
> > If we indeed put all writes under i_mutex, this problem would go away but
> > as Dave explains in his email, we consciously do as much IO as we can
> > without i_mutex to allow reasonable scalability of multiple writers into
> > the same file.
>
> So the above should be fine for xfs, but you're telling me that ext4
> is doing DAX I/O without any inode lock at all? In that case it's
> indeed not going to work.

By default ext4 uses i_mutex to serialize both direct (and thus dax) reads
and writes. However with dioread_nolock mount option, we use only i_data_sem
(ext4 local rwsem) for direct reads and overwrites. That is enough to
guarantee ext4 metadata consistency and gives you better scalability but
you lose write vs read and write vs write atomicity (essentially you get
the same behavior as for XFS direct IO).

> > The downside of that is that overwrites and writes vs reads are not atomic
> > wrt each other as POSIX requires. It has been that way for direct IO in XFS
> > case for a long time, with DAX this non-conforming behavior is proliferating
> > more. I agree that's not ideal but serializing all writes on a file is
> > rather harsh for persistent memory as well...
>
> For non-O_DIRECT I/O it's simply required..

Well, we already break write vs read atomicity for buffered IO for all
filesystems except XFS which has its special locking. So that's not a new
thing. I agree that also breaking write vs write atomicity for 'normal' IO
is a new thing, in a way more serious as the corrupted result ends up being
stored on disk, and some applications may be broken by that. So we should
fix that.

I was hoping that Davidlohr would come up with a more scalable
range-locking implementation than my original RB-tree based one and we
could use that but that seems to be taking longer than I originally
expected...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR