2016-04-12 18:12:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 0/2] ext4: Improve parallel I/O performance on NVDIMM

v2->v3:
- Remove the percpu_stats helper functions and use percpu_counters
instead.

v1->v2:
- Remove percpu_stats_reset() which is not really needed in this
patchset.
- Move some percpu_stats* functions to the newly created
lib/percpu_stats.c.
- Add a new patch to support 64-bit statistics counts in 32-bit
architectures.
- Rearrange the patches by moving the percpu_stats patches to the
front followed by the ext4 patches.

This patchset aims to improve parallel I/O performance of the ext4
filesystem on fast storage devices like NVDIMM.

Patch 1 eliminates duplicated inode_dio_begin()/inode_dio_end() calls.

Patch 2 converts some ext4 statistics counts into percpu counts using
the helper functions.

Waiman Long (2):
ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called
ext4: Make cache hits/misses per-cpu counts

fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
fs/ext4/indirect.c | 10 ++++++++--
fs/ext4/inode.c | 12 +++++++++---
4 files changed, 48 insertions(+), 16 deletions(-)


2016-04-12 18:12:54

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

When performing direct I/O, the current ext4 code does
not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
inode_dio_begin()/inode_dio_end() internally. This doubling of
inode_dio_begin()/inode_dio_end() calls are wasteful.

This patch removes the extra internal inode_dio_begin()/inode_dio_end()
calls when those calls are being issued by the caller directly. For
really fast storage systems like NVDIMM, the removal of the extra
inode_dio_begin()/inode_dio_end() can give a meaningful boost to
I/O performance.

On a 4-socket Haswell-EX system (72 cores) running 4.6-rc1 kernel,
fio with 38 threads doing parallel I/O on two shared files on an
NVDIMM with DAX gave the following aggregrate bandwidth with and
without the patch:

Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 8688MB/s 10173MB/s +17.1%
Read-write 2687MB/s 2830MB/s +5.3%

Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/indirect.c | 10 ++++++++--
fs/ext4/inode.c | 12 +++++++++---
2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa6..4304be6 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -706,14 +706,20 @@ retry:
inode_dio_end(inode);
goto locked;
}
+ /*
+ * Need to pass in DIO_SKIP_DIO_COUNT to prevent
+ * duplicated inode_dio_begin/inode_dio_end sequence.
+ */
if (IS_DAX(inode))
ret = dax_do_io(iocb, inode, iter, offset,
- ext4_dio_get_block, NULL, 0);
+ ext4_dio_get_block, NULL,
+ DIO_SKIP_DIO_COUNT);
else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter,
offset, ext4_dio_get_block,
- NULL, NULL, 0);
+ NULL, NULL,
+ DIO_SKIP_DIO_COUNT);
inode_dio_end(inode);
} else {
locked:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2..779aa33 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3358,9 +3358,15 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
* Make all waiters for direct IO properly wait also for extent
* conversion. This also disallows race between truncate() and
* overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+ *
+ * Both dax_do_io() and __blockdev_direct_IO() will unnecessarily
+ * call inode_dio_begin()/inode_dio_end() again if the
+ * DIO_SKIP_DIO_COUNT flag is not set.
*/
- if (iov_iter_rw(iter) == WRITE)
+ if (iov_iter_rw(iter) == WRITE) {
+ dio_flags = DIO_SKIP_DIO_COUNT;
inode_dio_begin(inode);
+ }

/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);
@@ -3393,10 +3399,10 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
get_block_func = ext4_dio_get_block_overwrite;
else if (is_sync_kiocb(iocb)) {
get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
} else {
get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
}
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
--
1.7.1

2016-04-12 18:13:20

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3 2/2] ext4: Make cache hits/misses per-cpu counts

This patch changes the es_stats_cache_hits and es_stats_cache_misses
statistics counts to per-cpu variables to reduce cacheline contention
issues whem multiple threads are trying to update those counts
simultaneously. It uses the new per-cpu stats APIs provided by the
percpu_stats.h header file.

With a 38-threads fio I/O test with 2 shared files (on DAX-mount
NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
the aggregated bandwidths before and after the patch were:

Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 10173MB/s 16141MB/s +58.7%
Read-write 2830MB/s 4315MB/s +52.5%

Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/extents_status.c | 38 +++++++++++++++++++++++++++++---------
fs/ext4/extents_status.h | 4 ++--
2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e38b987..92ca56d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -770,6 +770,15 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
}

/*
+ * For pure statistics count, use a large batch size to make sure that
+ * it does percpu update as much as possible.
+ */
+static inline void ext4_es_stats_inc(struct percpu_counter *fbc)
+{
+ __percpu_counter_add(fbc, 1, (1 << 30));
+}
+
+/*
* ext4_es_lookup_extent() looks up an extent in extent status tree.
*
* ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
@@ -825,9 +834,9 @@ out:
es->es_pblk = es1->es_pblk;
if (!ext4_es_is_referenced(es1))
ext4_es_set_referenced(es1);
- stats->es_stats_cache_hits++;
+ ext4_es_stats_inc(&stats->es_stats_cache_hits);
} else {
- stats->es_stats_cache_misses++;
+ ext4_es_stats_inc(&stats->es_stats_cache_misses);
}

read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -1113,9 +1122,9 @@ int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
- seq_printf(seq, " %lu/%lu cache hits/misses\n",
- es_stats->es_stats_cache_hits,
- es_stats->es_stats_cache_misses);
+ seq_printf(seq, " %lld/%lld cache hits/misses\n",
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_hits),
+ percpu_counter_sum_positive(&es_stats->es_stats_cache_misses));
if (inode_cnt)
seq_printf(seq, " %d inodes on list\n", inode_cnt);

@@ -1142,8 +1151,6 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
sbi->s_es_nr_inode = 0;
spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
- sbi->s_es_stats.es_stats_cache_hits = 0;
- sbi->s_es_stats.es_stats_cache_misses = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
@@ -1153,15 +1160,26 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err1;

+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_hits, 0, GFP_KERNEL);
+ if (err)
+ goto err2;
+
+ err = percpu_counter_init(&sbi->s_es_stats.es_stats_cache_misses, 0, GFP_KERNEL);
+ if (err)
+ goto err3;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
err = register_shrinker(&sbi->s_es_shrinker);
if (err)
- goto err2;
+ goto err4;

return 0;
-
+err4:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
+err3:
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
err2:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
@@ -1173,6 +1191,8 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_hits);
+ percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
unregister_shrinker(&sbi->s_es_shrinker);
}

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f7aa24f..d537868 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -69,10 +69,10 @@ struct ext4_es_tree {

struct ext4_es_stats {
unsigned long es_stats_shrunk;
- unsigned long es_stats_cache_hits;
- unsigned long es_stats_cache_misses;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
+ struct percpu_counter es_stats_cache_hits;
+ struct percpu_counter es_stats_cache_misses;
struct percpu_counter es_stats_all_cnt;
struct percpu_counter es_stats_shk_cnt;
};
--
1.7.1


2016-04-14 03:16:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> When performing direct I/O, the current ext4 code does
> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> inode_dio_begin()/inode_dio_end() internally. This doubling of
> inode_dio_begin()/inode_dio_end() calls are wasteful.
>
> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> calls when those calls are being issued by the caller directly. For
> really fast storage systems like NVDIMM, the removal of the extra
> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> I/O performance.

Doesn't this break truncate IO serialisation?

i.e. it appears to me that the ext4 use of inode_dio_begin()/
inode_dio_end() does not cover AIO, where the IO is still in flight
when submission returns. i.e. the inode_dio_end() call
needs to be in IO completion, not in the submitter context. The only
reason it doesn't break right now is that the duplicate accounting
in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
accounting will cause AIO writes to race with truncate.

Same AIO vs truncate problem occurs with the indirect read case you
modified to skip the direct IO layer accounting.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-14 16:21:13

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/13/2016 11:16 PM, Dave Chinner wrote:
> On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
>> When performing direct I/O, the current ext4 code does
>> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
>> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
>> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
>> inode_dio_begin()/inode_dio_end() internally. This doubling of
>> inode_dio_begin()/inode_dio_end() calls are wasteful.
>>
>> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
>> calls when those calls are being issued by the caller directly. For
>> really fast storage systems like NVDIMM, the removal of the extra
>> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
>> I/O performance.
> Doesn't this break truncate IO serialisation?
>
> i.e. it appears to me that the ext4 use of inode_dio_begin()/
> inode_dio_end() does not cover AIO, where the IO is still in flight
> when submission returns. i.e. the inode_dio_end() call
> needs to be in IO completion, not in the submitter context. The only
> reason it doesn't break right now is that the duplicate accounting
> in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> accounting will cause AIO writes to race with truncate.
>
> Same AIO vs truncate problem occurs with the indirect read case you
> modified to skip the direct IO layer accounting.

I don't quite understand how the duplicate accounting is correct wrt
AIO. Both the direct and indirect paths are something like:

inode_dio_begin()
...
inode_dio_begin()
...
inode_dio_end()
...
inode_dio_end()

What the patch does is to eliminate the innermost inode_dio_begin/end
pair. Unless there is a difference between a dio count of 2 vs. 1, I
can't see how the code correctness differ with and without my patch.

Cheers,
Longman

2016-04-15 08:17:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> On 04/13/2016 11:16 PM, Dave Chinner wrote:
> >On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> >>When performing direct I/O, the current ext4 code does
> >>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> >>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> >>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> >>inode_dio_begin()/inode_dio_end() internally. This doubling of
> >>inode_dio_begin()/inode_dio_end() calls are wasteful.
> >>
> >>This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> >>calls when those calls are being issued by the caller directly. For
> >>really fast storage systems like NVDIMM, the removal of the extra
> >>inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> >>I/O performance.
> >Doesn't this break truncate IO serialisation?
> >
> >i.e. it appears to me that the ext4 use of inode_dio_begin()/
> >inode_dio_end() does not cover AIO, where the IO is still in flight
> >when submission returns. i.e. the inode_dio_end() call
> >needs to be in IO completion, not in the submitter context. The only
> >reason it doesn't break right now is that the duplicate accounting
> >in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> >accounting will cause AIO writes to race with truncate.
> >
> >Same AIO vs truncate problem occurs with the indirect read case you
> >modified to skip the direct IO layer accounting.
>
> I don't quite understand how the duplicate accounting is correct wrt
> AIO. Both the direct and indirect paths are something like:
>
> inode_dio_begin()
> ...
> inode_dio_begin()
> ...
> inode_dio_end()
> ...
> inode_dio_end()

With AIO:

inode_dio_begin()
...
inode_dio_begin()
<submit IO, no wait>
...
inode_dio_end()
<ext4 returns to userspace with AIO+DIO in progress>

<some time later DIO completes>
dio_complete
inode_dio_end()

IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
does not wait for IO completion before returning.

> What the patch does is to eliminate the innermost
> inode_dio_begin/end pair.

Yes, and with that change inode_dio_wait() no longer waits for
AIO+DIO writes on ext4, hence breaking truncate IO barrier
requirements of inode_dio_wait().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-15 17:17:50

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/15/2016 04:17 AM, Dave Chinner wrote:
> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>> On 04/13/2016 11:16 PM, Dave Chinner wrote:
>>> On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
>>>> When performing direct I/O, the current ext4 code does
>>>> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
>>>> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
>>>> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
>>>> inode_dio_begin()/inode_dio_end() internally. This doubling of
>>>> inode_dio_begin()/inode_dio_end() calls are wasteful.
>>>>
>>>> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
>>>> calls when those calls are being issued by the caller directly. For
>>>> really fast storage systems like NVDIMM, the removal of the extra
>>>> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
>>>> I/O performance.
>>> Doesn't this break truncate IO serialisation?
>>>
>>> i.e. it appears to me that the ext4 use of inode_dio_begin()/
>>> inode_dio_end() does not cover AIO, where the IO is still in flight
>>> when submission returns. i.e. the inode_dio_end() call
>>> needs to be in IO completion, not in the submitter context. The only
>>> reason it doesn't break right now is that the duplicate accounting
>>> in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
>>> accounting will cause AIO writes to race with truncate.
>>>
>>> Same AIO vs truncate problem occurs with the indirect read case you
>>> modified to skip the direct IO layer accounting.
>> I don't quite understand how the duplicate accounting is correct wrt
>> AIO. Both the direct and indirect paths are something like:
>>
>> inode_dio_begin()
>> ...
>> inode_dio_begin()
>> ...
>> inode_dio_end()
>> ...
>> inode_dio_end()
> With AIO:
>
> inode_dio_begin()
> ...
> inode_dio_begin()
> <submit IO, no wait>
> ...
> inode_dio_end()
> <ext4 returns to userspace with AIO+DIO in progress>
>
> <some time later DIO completes>
> dio_complete
> inode_dio_end()
>
> IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
> does not wait for IO completion before returning.
>
>> What the patch does is to eliminate the innermost
>> inode_dio_begin/end pair.
> Yes, and with that change inode_dio_wait() no longer waits for
> AIO+DIO writes on ext4, hence breaking truncate IO barrier
> requirements of inode_dio_wait().
>
> Cheers,
>
> Dave.

You are right and thank for pointing this out to me. I think I focus too
much on the dax_do_io() internal and didn't realize that inode_dio_end()
can be deferred in __blockdev_direct_IO(). I will update my patch to
eliminate the extra inode_dio_begin/end pair only for dax_do_io().

Cheers,
Longman

2016-04-15 22:19:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
> On 04/15/2016 04:17 AM, Dave Chinner wrote:
> >On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> >>On 04/13/2016 11:16 PM, Dave Chinner wrote:
> >>>On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> >>>>When performing direct I/O, the current ext4 code does
> >>>>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> >>>>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> >>>>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> >>>>inode_dio_begin()/inode_dio_end() internally. This doubling of
> >>>>inode_dio_begin()/inode_dio_end() calls are wasteful.
> >>>>
> >>>>This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> >>>>calls when those calls are being issued by the caller directly. For
> >>>>really fast storage systems like NVDIMM, the removal of the extra
> >>>>inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> >>>>I/O performance.
> >>>Doesn't this break truncate IO serialisation?
> >>>
> >>>i.e. it appears to me that the ext4 use of inode_dio_begin()/
> >>>inode_dio_end() does not cover AIO, where the IO is still in flight
> >>>when submission returns. i.e. the inode_dio_end() call
> >>>needs to be in IO completion, not in the submitter context. The only
> >>>reason it doesn't break right now is that the duplicate accounting
> >>>in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> >>>accounting will cause AIO writes to race with truncate.
> >>>
> >>>Same AIO vs truncate problem occurs with the indirect read case you
> >>>modified to skip the direct IO layer accounting.
> >>I don't quite understand how the duplicate accounting is correct wrt
> >>AIO. Both the direct and indirect paths are something like:
> >>
> >> inode_dio_begin()
> >> ...
> >> inode_dio_begin()
> >> ...
> >> inode_dio_end()
> >> ...
> >> inode_dio_end()
> >With AIO:
> >
> > inode_dio_begin()
> > ...
> > inode_dio_begin()
> > <submit IO, no wait>
> > ...
> > inode_dio_end()
> ><ext4 returns to userspace with AIO+DIO in progress>
> >
> ><some time later DIO completes>
> > dio_complete
> > inode_dio_end()
> >
> >IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
> >does not wait for IO completion before returning.
> >
> >>What the patch does is to eliminate the innermost
> >>inode_dio_begin/end pair.
> >Yes, and with that change inode_dio_wait() no longer waits for
> >AIO+DIO writes on ext4, hence breaking truncate IO barrier
> >requirements of inode_dio_wait().
> >
> >Cheers,
> >
> >Dave.
>
> You are right and thank for pointing this out to me. I think I focus too
> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
> the extra inode_dio_begin/end pair only for dax_do_io().

Even there there is the risk that a future change will break ext4.
the ext4 code needs fixing first, then you can look at skipping the
DIO based counting everywhere.

i.e. fix the root cause of the problem, don't hack around it or
throw band-aids over it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-18 19:46:46

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/15/2016 06:19 PM, Dave Chinner wrote:
> On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
>> On 04/15/2016 04:17 AM, Dave Chinner wrote:
>>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>>>> What the patch does is to eliminate the innermost
>>>> inode_dio_begin/end pair.
>>> Yes, and with that change inode_dio_wait() no longer waits for
>>> AIO+DIO writes on ext4, hence breaking truncate IO barrier
>>> requirements of inode_dio_wait().
>>>
>>> Cheers,
>>>
>>> Dave.
>> You are right and thank for pointing this out to me. I think I focus too
>> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
>> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
>> the extra inode_dio_begin/end pair only for dax_do_io().
> Even there there is the risk that a future change will break ext4.
> the ext4 code needs fixing first, then you can look at skipping the
> DIO based counting everywhere.
>
> i.e. fix the root cause of the problem, don't hack around it or
> throw band-aids over it.
>
> Cheers,
>
> Dave.

I agree that the ext4 code needs fixing w.r.t. the problem that you
found. That will take more time and testing. In the mean time, I think
it is OK to pick the low-hanging fruits that are handled by my patch.

Cheers,
Longman

2016-04-19 23:02:31

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote:
> On 04/15/2016 06:19 PM, Dave Chinner wrote:
> >On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
> >>On 04/15/2016 04:17 AM, Dave Chinner wrote:
> >>>On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> >>>>What the patch does is to eliminate the innermost
> >>>>inode_dio_begin/end pair.
> >>>Yes, and with that change inode_dio_wait() no longer waits for
> >>>AIO+DIO writes on ext4, hence breaking truncate IO barrier
> >>>requirements of inode_dio_wait().
> >>>
> >>>Cheers,
> >>>
> >>>Dave.
> >>You are right and thank for pointing this out to me. I think I focus too
> >>much on the dax_do_io() internal and didn't realize that inode_dio_end() can
> >>be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
> >>the extra inode_dio_begin/end pair only for dax_do_io().
> >Even there there is the risk that a future change will break ext4.
> >the ext4 code needs fixing first, then you can look at skipping the
> >DIO based counting everywhere.
> >
> >i.e. fix the root cause of the problem, don't hack around it or
> >throw band-aids over it.
>
> I agree that the ext4 code needs fixing w.r.t. the problem that you
> found. That will take more time and testing. In the mean time, I
> think it is OK to pick the low-hanging fruits that are handled by my
> patch.

IOWs, you're saying that you won't fix the problem, because all you
care about is scalability results. This is how we end up with code
that breaks randomly in future because if it doesn't get fixed now,
nobody will fix the underlying problem. So, fix it now, fix it
properly and you still get your scalability improvement without
leaving a landmine that will explode on someone else in future.

Fix it now, fix it properly.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-20 15:59:36

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/19/2016 07:01 PM, Dave Chinner wrote:
> On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote:
>> On 04/15/2016 06:19 PM, Dave Chinner wrote:
>>> On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
>>>> On 04/15/2016 04:17 AM, Dave Chinner wrote:
>>>>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>>>>>> What the patch does is to eliminate the innermost
>>>>>> inode_dio_begin/end pair.
>>>>> Yes, and with that change inode_dio_wait() no longer waits for
>>>>> AIO+DIO writes on ext4, hence breaking truncate IO barrier
>>>>> requirements of inode_dio_wait().
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>> You are right and thank for pointing this out to me. I think I focus too
>>>> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
>>>> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
>>>> the extra inode_dio_begin/end pair only for dax_do_io().
>>> Even there there is the risk that a future change will break ext4.
>>> the ext4 code needs fixing first, then you can look at skipping the
>>> DIO based counting everywhere.
>>>
>>> i.e. fix the root cause of the problem, don't hack around it or
>>> throw band-aids over it.
>> I agree that the ext4 code needs fixing w.r.t. the problem that you
>> found. That will take more time and testing. In the mean time, I
>> think it is OK to pick the low-hanging fruits that are handled by my
>> patch.
> IOWs, you're saying that you won't fix the problem, because all you
> care about is scalability results. This is how we end up with code
> that breaks randomly in future because if it doesn't get fixed now,
> nobody will fix the underlying problem. So, fix it now, fix it
> properly and you still get your scalability improvement without
> leaving a landmine that will explode on someone else in future.
>
> Fix it now, fix it properly.

I am not saying that I will not fix it. I am just saying that I need
more time to fully understand what code changes need to be done. I am
not that well versed in the filesystem internal, though it will be a
good learning experience for me.

Cheers,
Longman

2016-04-20 20:58:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

FYI, none of the Dax code even needs to ever touch the dio_count,
as dax I/O can't be asynchronous, and we thus don't need it to protect
against truncate. I'd suggest to remove it and then end_io callback
from the DAX code entirely as a start and then move from there.

2016-04-21 18:15:24

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
> FYI, none of the Dax code even needs to ever touch the dio_count,
> as dax I/O can't be asynchronous, and we thus don't need it to protect
> against truncate. I'd suggest to remove it and then end_io callback
> from the DAX code entirely as a start and then move from there.

Yes, it seems like we may not need to change the dio_count in
dax_do_io() after all. BTW, what do mean by using end_io callback as a
start?

Cheers,
Longman

2016-04-25 11:48:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote:
> On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
> >FYI, none of the Dax code even needs to ever touch the dio_count,
> >as dax I/O can't be asynchronous, and we thus don't need it to protect
> >against truncate. I'd suggest to remove it and then end_io callback
> >from the DAX code entirely as a start and then move from there.
>
> Yes, it seems like we may not need to change the dio_count in dax_do_io()
> after all. BTW, what do mean by using end_io callback as a start?

I mean to remove both the i_dio_count manipulation, and the unessecary
end_io callback from dax_do_io.

2016-04-26 16:32:17

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

On 04/25/2016 07:48 AM, Christoph Hellwig wrote:
> On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote:
>> On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
>>> FYI, none of the Dax code even needs to ever touch the dio_count,
>>> as dax I/O can't be asynchronous, and we thus don't need it to protect
>>> against truncate. I'd suggest to remove it and then end_io callback
>> >from the DAX code entirely as a start and then move from there.
>>
>> Yes, it seems like we may not need to change the dio_count in dax_do_io()
>> after all. BTW, what do mean by using end_io callback as a start?
> I mean to remove both the i_dio_count manipulation, and the unessecary
> end_io callback from dax_do_io.

Thanks for the clarification.

Since DAX I/O is always synchronous, the locking done by the caller or
in the dax_do_Io() for read should be enough to prevent truncation from
happening at the same time. So we don't need to use i_dio_count for that
purpose.

However, I have not understood enough of the block IO layer to determine
if the end_io callback is really redundant. I am not confident enough to
touch the end_io callback.

Cheers,
Longman