This patchset aims to improve parallel I/O performance of the ext4
filesystem on fast storage devices like NVDIMM.
Patch 1 eliminates duplicated inode_dio_begin()/inode_dio_end() calls.
Patch 2 provides a set of simple percpu statistics count helper functions.
Patch 3 converts some ext4 statistics counts into percpu counts using
the helper functions.
Waiman Long (3):
ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called
percpu_stats: Simple per-cpu statistics count helper functions
ext4: Make cache hits/misses per-cpu counts
fs/ext4/extents_status.c | 20 +++++---
fs/ext4/extents_status.h | 11 ++++-
fs/ext4/indirect.c | 10 +++-
fs/ext4/inode.c | 12 ++++-
include/linux/percpu_stats.h | 103 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 141 insertions(+), 15 deletions(-)
create mode 100644 include/linux/percpu_stats.h
This patch changes the es_stats_cache_hits and es_stats_cache_misses
statistics counts to per-cpu variables to reduce cacheline contention
issues whem multiple threads are trying to update those counts
simultaneously. It uses the new per-cpu stats APIs provided by the
percpu_stats.h header file.
With a 38-threads fio I/O test with 2 shared files (on DAX-mount
NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
the aggregated bandwidths before and after the patch were:
Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 10173MB/s 16141MB/s +58.7%
Read-write 2830MB/s 4315MB/s +52.5%
Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/extents_status.c | 20 ++++++++++++--------
fs/ext4/extents_status.h | 11 +++++++++--
2 files changed, 21 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e38b987..01f8436 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -825,9 +825,9 @@ out:
es->es_pblk = es1->es_pblk;
if (!ext4_es_is_referenced(es1))
ext4_es_set_referenced(es1);
- stats->es_stats_cache_hits++;
+ percpu_stats_inc(&stats->es_stats, es_stats_cache_hits);
} else {
- stats->es_stats_cache_misses++;
+ percpu_stats_inc(&stats->es_stats, es_stats_cache_misses);
}
read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -1114,8 +1114,8 @@ int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v)
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
seq_printf(seq, " %lu/%lu cache hits/misses\n",
- es_stats->es_stats_cache_hits,
- es_stats->es_stats_cache_misses);
+ percpu_stats_sum(&es_stats->es_stats, es_stats_cache_hits),
+ percpu_stats_sum(&es_stats->es_stats, es_stats_cache_misses));
if (inode_cnt)
seq_printf(seq, " %d inodes on list\n", inode_cnt);
@@ -1142,8 +1142,6 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
sbi->s_es_nr_inode = 0;
spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
- sbi->s_es_stats.es_stats_cache_hits = 0;
- sbi->s_es_stats.es_stats_cache_misses = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
@@ -1153,15 +1151,20 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err1;
+ err = percpu_stats_init(&sbi->s_es_stats.es_stats, es_stats_cnt);
+ if (err)
+ goto err2;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
err = register_shrinker(&sbi->s_es_shrinker);
if (err)
- goto err2;
+ goto err3;
return 0;
-
+err3:
+ percpu_stats_destroy(&sbi->s_es_stats.es_stats);
err2:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
@@ -1173,6 +1176,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
+ percpu_stats_destroy(&sbi->s_es_stats.es_stats);
unregister_shrinker(&sbi->s_es_shrinker);
}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f7aa24f..c163189 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -11,6 +11,8 @@
#ifndef _EXT4_EXTENTS_STATUS_H
#define _EXT4_EXTENTS_STATUS_H
+#include <linux/percpu_stats.h>
+
/*
* Turn on ES_DEBUG__ to get lots of info about extent status operations.
*/
@@ -67,10 +69,15 @@ struct ext4_es_tree {
struct extent_status *cache_es; /* recently accessed extent */
};
+enum ext4_es_stat_type {
+ es_stats_cache_hits,
+ es_stats_cache_misses,
+ es_stats_cnt,
+};
+
struct ext4_es_stats {
unsigned long es_stats_shrunk;
- unsigned long es_stats_cache_hits;
- unsigned long es_stats_cache_misses;
+ struct percpu_stats es_stats;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
struct percpu_counter es_stats_all_cnt;
--
1.7.1
This patch introduces a set of simple per-cpu statictics count helper
functions that can be used by other kernel subsystems for keeping
track of the number of events that happens. It is per-cpu based to
reduce overhead and improve accuracy of the counter. Using per-cpu
counter is usually overkill for such purpose.
The following APIs are provided:
- int percpu_stats_init(struct percpu_stats *pcs, int num)
Initialize the per-cpu statictics counts structure which should have
the given number of statistics counts. Return -ENOMEM on error.
- void percpu_stats_destroy(struct percpu_stats *pcs)
Free the percpu memory allocated.
- void percpu_stats_inc(struct percpu_stats *pcs, int stat)
void percpu_stats_dec(struct percpu_stats *pcs, int stat)
Increment and decrement the given per-cpu statistics count.
- unsigned long percpu_stats_sum(struct percpu_stats *pcs, int stat)
Return the current aggregated sum of the given statistics count.
- void percpu_stats_reset(struct percpu_stats *pcs)
Clear all the statistics counts defined in the given percpu_stats
structure.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/percpu_stats.h | 103 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 103 insertions(+), 0 deletions(-)
create mode 100644 include/linux/percpu_stats.h
diff --git a/include/linux/percpu_stats.h b/include/linux/percpu_stats.h
new file mode 100644
index 0000000..a4f715e
--- /dev/null
+++ b/include/linux/percpu_stats.h
@@ -0,0 +1,103 @@
+#ifndef _LINUX_PERCPU_STATS_H
+#define _LINUX_PERCPU_STATS_H
+/*
+ * Simple per-cpu statistics counts that have less overhead than the
+ * per-cpu counters.
+ */
+#include <linux/percpu.h>
+#include <linux/types.h>
+
+struct percpu_stats {
+ unsigned long __percpu *stats;
+ int nstats; /* Number of statistics counts in stats array */
+};
+
+/*
+ * Reset the all statistics counts to 0 in the percpu_stats structure
+ */
+static inline void percpu_stats_reset(struct percpu_stats *pcs)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
+ int stat;
+
+ for (stat = 0; stat < pcs->nstats; stat++, pstats++)
+ *pstats = 0;
+ }
+
+ /*
+ * If a statistics count is in the middle of being updated, it
+ * is possible that the above clearing may not work. So we need
+ * to double check again to make sure that the counters are really
+ * cleared. Still there is a still a very small chance that the
+ * second clearing does not work.
+ */
+ for_each_possible_cpu(cpu) {
+ unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
+ int stat;
+
+ for (stat = 0; stat < pcs->nstats; stat++, pstats++)
+ if (*pstats)
+ *pstats = 0;
+ }
+}
+
+static inline int percpu_stats_init(struct percpu_stats *pcs, int num)
+{
+ pcs->nstats = num;
+ pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
+ __alignof__(unsigned long));
+ if (!pcs->stats)
+ return -ENOMEM;
+
+ percpu_stats_reset(pcs);
+ return 0;
+}
+
+static inline void percpu_stats_destroy(struct percpu_stats *pcs)
+{
+ free_percpu(pcs->stats);
+ pcs->stats = NULL;
+ pcs->nstats = 0;
+}
+
+static inline void
+__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
+{
+ unsigned long *pstat;
+
+ if ((unsigned int)stat >= pcs->nstats)
+ return;
+ preempt_disable();
+ pstat = this_cpu_ptr(&pcs->stats[stat]);
+ *pstat += cnt;
+ preempt_enable();
+}
+
+static inline void percpu_stats_inc(struct percpu_stats *pcs, int stat)
+{
+ __percpu_stats_add(pcs, stat, 1);
+}
+
+static inline void percpu_stats_dec(struct percpu_stats *pcs, int stat)
+{
+ __percpu_stats_add(pcs, stat, -1);
+}
+
+static inline unsigned long
+percpu_stats_sum(struct percpu_stats *pcs, int stat)
+{
+ int cpu;
+ unsigned long sum = 0;
+
+ if ((unsigned int)stat >= pcs->nstats)
+ return sum;
+
+ for_each_possible_cpu(cpu)
+ sum += per_cpu(pcs->stats[stat], cpu);
+ return sum;
+}
+
+#endif /* _LINUX_PERCPU_STATS_H */
--
1.7.1
When performing direct I/O, the current ext4 code does
not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
inode_dio_begin()/inode_dio_end() internally. This doubling of
inode_dio_begin()/inode_dio_end() calls are wasteful.
This patch removes the extra internal inode_dio_begin()/inode_dio_end()
calls when those calls are being issued by the caller directly. For
really fast storage systems like NVDIMM, the removal of the extra
inode_dio_begin()/inode_dio_end() can give a meaningful boost to
I/O performance.
On a 4-socket Haswell-EX system (72 cores) running 4.6-rc1 kernel,
fio with 38 threads doing parallel I/O on two shared files on an
NVDIMM with DAX gave the following aggregrate bandwidth with and
without the patch:
Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 8688MB/s 10173MB/s +17.1%
Read-write 2687MB/s 2830MB/s +5.3%
Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/indirect.c | 10 ++++++++--
fs/ext4/inode.c | 12 +++++++++---
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa6..4304be6 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -706,14 +706,20 @@ retry:
inode_dio_end(inode);
goto locked;
}
+ /*
+ * Need to pass in DIO_SKIP_DIO_COUNT to prevent
+ * duplicated inode_dio_begin/inode_dio_end sequence.
+ */
if (IS_DAX(inode))
ret = dax_do_io(iocb, inode, iter, offset,
- ext4_dio_get_block, NULL, 0);
+ ext4_dio_get_block, NULL,
+ DIO_SKIP_DIO_COUNT);
else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter,
offset, ext4_dio_get_block,
- NULL, NULL, 0);
+ NULL, NULL,
+ DIO_SKIP_DIO_COUNT);
inode_dio_end(inode);
} else {
locked:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2..779aa33 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3358,9 +3358,15 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
* Make all waiters for direct IO properly wait also for extent
* conversion. This also disallows race between truncate() and
* overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+ *
+ * Both dax_do_io() and __blockdev_direct_IO() will unnecessarily
+ * call inode_dio_begin()/inode_dio_end() again if the
+ * DIO_SKIP_DIO_COUNT flag is not set.
*/
- if (iov_iter_rw(iter) == WRITE)
+ if (iov_iter_rw(iter) == WRITE) {
+ dio_flags = DIO_SKIP_DIO_COUNT;
inode_dio_begin(inode);
+ }
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);
@@ -3393,10 +3399,10 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
get_block_func = ext4_dio_get_block_overwrite;
else if (is_sync_kiocb(iocb)) {
get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
} else {
get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
}
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
--
1.7.1
On 04/02/2016 06:09 AM, Waiman Long wrote:
> This patch introduces a set of simple per-cpu statictics count helper
> functions that can be used by other kernel subsystems for keeping
> track of the number of events that happens. It is per-cpu based to
> reduce overhead and improve accuracy of the counter. Using per-cpu
> counter is usually overkill for such purpose.
>
> The following APIs are provided:
>
> - int percpu_stats_init(struct percpu_stats *pcs, int num)
> Initialize the per-cpu statictics counts structure which should have
> the given number of statistics counts. Return -ENOMEM on error.
>
> - void percpu_stats_destroy(struct percpu_stats *pcs)
> Free the percpu memory allocated.
>
> - void percpu_stats_inc(struct percpu_stats *pcs, int stat)
> void percpu_stats_dec(struct percpu_stats *pcs, int stat)
> Increment and decrement the given per-cpu statistics count.
>
> - unsigned long percpu_stats_sum(struct percpu_stats *pcs, int stat)
> Return the current aggregated sum of the given statistics count.
>
> - void percpu_stats_reset(struct percpu_stats *pcs)
> Clear all the statistics counts defined in the given percpu_stats
> structure.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/percpu_stats.h | 103 ++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 103 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/percpu_stats.h
Just one minor nit below.
[..]
> +static inline void
> +__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
> +{
> + unsigned long *pstat;
> +
> + if ((unsigned int)stat >= pcs->nstats)
> + return;
> + preempt_disable();
> + pstat = this_cpu_ptr(&pcs->stats[stat]);
> + *pstat += cnt;
> + preempt_enable();
> +}
pstat = get_cpu_ptr(&pcs->stats[stat]);
*pstat += cnt;
put_cpu_ptr(&pcs->stats[stat]);
It will generate identical code but this one uses APIs, making the
intention clearer. But as I said this is just a minor nit.
you can add my Reviewed-by: Nikolay Borisov <[email protected]> for this
particular patch.
Hello,
On Fri, Apr 01, 2016 at 11:09:37PM -0400, Waiman Long wrote:
...
> +struct percpu_stats {
> + unsigned long __percpu *stats;
I'm not sure ulong is the best choice here. Atomic reads on 32bit are
nice but people often need 64bit counters for stats. It probably is a
better idea to use u64_stats_sync.
> +/*
> + * Reset the all statistics counts to 0 in the percpu_stats structure
Proper function description please.
> + */
> +static inline void percpu_stats_reset(struct percpu_stats *pcs)
Why is this function inline?
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
^^
> + int stat;
> +
> + for (stat = 0; stat < pcs->nstats; stat++, pstats++)
> + *pstats = 0;
> + }
> +
> + /*
> + * If a statistics count is in the middle of being updated, it
> + * is possible that the above clearing may not work. So we need
> + * to double check again to make sure that the counters are really
> + * cleared. Still there is a still a very small chance that the
> + * second clearing does not work.
> + */
> + for_each_possible_cpu(cpu) {
> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
> + int stat;
> +
> + for (stat = 0; stat < pcs->nstats; stat++, pstats++)
> + if (*pstats)
> + *pstats = 0;
> + }
I don't think this is acceptable.
> +}
> +
> +static inline int percpu_stats_init(struct percpu_stats *pcs, int num)
> +{
> + pcs->nstats = num;
> + pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
> + __alignof__(unsigned long));
> + if (!pcs->stats)
> + return -ENOMEM;
> +
> + percpu_stats_reset(pcs);
> + return 0;
> +}
> +
> +static inline void percpu_stats_destroy(struct percpu_stats *pcs)
> +{
> + free_percpu(pcs->stats);
> + pcs->stats = NULL;
> + pcs->nstats = 0;
> +}
Why inline the above functions?
> +static inline void
> +__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
> +{
> + unsigned long *pstat;
> +
> + if ((unsigned int)stat >= pcs->nstats)
> + return;
This is a critical bug. Please don't fail silently. BUG_ON(),
please.
> + preempt_disable();
> + pstat = this_cpu_ptr(&pcs->stats[stat]);
> + *pstat += cnt;
> + preempt_enable();
> +}
this_cpu_add() is atomic w.r.t. local operations.
> +static inline unsigned long
> +percpu_stats_sum(struct percpu_stats *pcs, int stat)
> +{
> + int cpu;
> + unsigned long sum = 0;
> +
> + if ((unsigned int)stat >= pcs->nstats)
> + return sum;
Ditto.
> + for_each_possible_cpu(cpu)
> + sum += per_cpu(pcs->stats[stat], cpu);
> + return sum;
> +}
Thanks.
--
tejun
On 04/04/2016 03:36 AM, Nikolay Borisov wrote:
>
> On 04/02/2016 06:09 AM, Waiman Long wrote:
>> This patch introduces a set of simple per-cpu statictics count helper
>> functions that can be used by other kernel subsystems for keeping
>> track of the number of events that happens. It is per-cpu based to
>> reduce overhead and improve accuracy of the counter. Using per-cpu
>> counter is usually overkill for such purpose.
>>
>> The following APIs are provided:
>>
>> - int percpu_stats_init(struct percpu_stats *pcs, int num)
>> Initialize the per-cpu statictics counts structure which should have
>> the given number of statistics counts. Return -ENOMEM on error.
>>
>> - void percpu_stats_destroy(struct percpu_stats *pcs)
>> Free the percpu memory allocated.
>>
>> - void percpu_stats_inc(struct percpu_stats *pcs, int stat)
>> void percpu_stats_dec(struct percpu_stats *pcs, int stat)
>> Increment and decrement the given per-cpu statistics count.
>>
>> - unsigned long percpu_stats_sum(struct percpu_stats *pcs, int stat)
>> Return the current aggregated sum of the given statistics count.
>>
>> - void percpu_stats_reset(struct percpu_stats *pcs)
>> Clear all the statistics counts defined in the given percpu_stats
>> structure.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> include/linux/percpu_stats.h | 103 ++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 103 insertions(+), 0 deletions(-)
>> create mode 100644 include/linux/percpu_stats.h
> Just one minor nit below.
> [..]
>> +static inline void
>> +__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
>> +{
>> + unsigned long *pstat;
>> +
>> + if ((unsigned int)stat>= pcs->nstats)
>> + return;
>> + preempt_disable();
>> + pstat = this_cpu_ptr(&pcs->stats[stat]);
>> + *pstat += cnt;
>> + preempt_enable();
>> +}
> pstat = get_cpu_ptr(&pcs->stats[stat]);
> *pstat += cnt;
> put_cpu_ptr(&pcs->stats[stat]);
>
> It will generate identical code but this one uses APIs, making the
> intention clearer. But as I said this is just a minor nit.
>
> you can add my Reviewed-by: Nikolay Borisov<[email protected]> for this
> particular patch.
Yes, that will certainly make it look nicer. I will update the patch
once I get feedback from my other ext4 patches.
Cheers,
Longman
On Mon, 4 Apr 2016, Waiman Long wrote:
> > > + if ((unsigned int)stat>= pcs->nstats)
> > > + return;
> > > + preempt_disable();
> > > + pstat = this_cpu_ptr(&pcs->stats[stat]);
> > > + *pstat += cnt;
> > > + preempt_enable();
> > > +}
> > pstat = get_cpu_ptr(&pcs->stats[stat]);
> > *pstat += cnt;
> > put_cpu_ptr(&pcs->stats[stat]);
> >
> > It will generate identical code but this one uses APIs, making the
> > intention clearer. But as I said this is just a minor nit.
> >
> > you can add my Reviewed-by: Nikolay Borisov<[email protected]> for this
> > particular patch.
>
> Yes, that will certainly make it look nicer. I will update the patch once I
> get feedback from my other ext4 patches.
Why not
this_cpu_add(pci->stats[stat], cnt)
This is a single instruction on x86.
On 04/04/2016 12:02 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Apr 01, 2016 at 11:09:37PM -0400, Waiman Long wrote:
> ...
>> +struct percpu_stats {
>> + unsigned long __percpu *stats;
> I'm not sure ulong is the best choice here. Atomic reads on 32bit are
> nice but people often need 64bit counters for stats. It probably is a
> better idea to use u64_stats_sync.
>
>> +/*
>> + * Reset the all statistics counts to 0 in the percpu_stats structure
> Proper function description please.
>
>> + */
>> +static inline void percpu_stats_reset(struct percpu_stats *pcs)
> Why is this function inline?
>
>> +{
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
> ^^
>> + int stat;
>> +
>> + for (stat = 0; stat< pcs->nstats; stat++, pstats++)
>> + *pstats = 0;
>> + }
>> +
>> + /*
>> + * If a statistics count is in the middle of being updated, it
>> + * is possible that the above clearing may not work. So we need
>> + * to double check again to make sure that the counters are really
>> + * cleared. Still there is a still a very small chance that the
>> + * second clearing does not work.
>> + */
>> + for_each_possible_cpu(cpu) {
>> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
>> + int stat;
>> +
>> + for (stat = 0; stat< pcs->nstats; stat++, pstats++)
>> + if (*pstats)
>> + *pstats = 0;
>> + }
> I don't think this is acceptable.
>
>> +}
>> +
>> +static inline int percpu_stats_init(struct percpu_stats *pcs, int num)
>> +{
>> + pcs->nstats = num;
>> + pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
>> + __alignof__(unsigned long));
>> + if (!pcs->stats)
>> + return -ENOMEM;
>> +
>> + percpu_stats_reset(pcs);
>> + return 0;
>> +}
>> +
>> +static inline void percpu_stats_destroy(struct percpu_stats *pcs)
>> +{
>> + free_percpu(pcs->stats);
>> + pcs->stats = NULL;
>> + pcs->nstats = 0;
>> +}
> Why inline the above functions?
>
>> +static inline void
>> +__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
>> +{
>> + unsigned long *pstat;
>> +
>> + if ((unsigned int)stat>= pcs->nstats)
>> + return;
> This is a critical bug. Please don't fail silently. BUG_ON(),
> please.
Sure.
>
>> + preempt_disable();
>> + pstat = this_cpu_ptr(&pcs->stats[stat]);
>> + *pstat += cnt;
>> + preempt_enable();
>> +}
> this_cpu_add() is atomic w.r.t. local operations.
Will use this_cpu_add().
>> +static inline unsigned long
>> +percpu_stats_sum(struct percpu_stats *pcs, int stat)
>> +{
>> + int cpu;
>> + unsigned long sum = 0;
>> +
>> + if ((unsigned int)stat>= pcs->nstats)
>> + return sum;
> Ditto.
>
>> + for_each_possible_cpu(cpu)
>> + sum += per_cpu(pcs->stats[stat], cpu);
>> + return sum;
>> +}
> Thanks.
>
Cheers,
Longman
On 04/04/2016 12:02 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Apr 01, 2016 at 11:09:37PM -0400, Waiman Long wrote:
> ...
>> +struct percpu_stats {
>> + unsigned long __percpu *stats;
> I'm not sure ulong is the best choice here. Atomic reads on 32bit are
> nice but people often need 64bit counters for stats. It probably is a
> better idea to use u64_stats_sync.
Got that, will incorporate 64-bit counter support for 32-bit architecture.
>> +/*
>> + * Reset the all statistics counts to 0 in the percpu_stats structure
> Proper function description please.
Sure. Will do that for all the functions.
>> + */
>> +static inline void percpu_stats_reset(struct percpu_stats *pcs)
> Why is this function inline?
It doesn't need to be inlined, but I need to add a lib/percpu_stats.c
file to hold the function which I will do in my v2 patch.
>
>> +{
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
> ^^
>> + int stat;
>> +
>> + for (stat = 0; stat< pcs->nstats; stat++, pstats++)
>> + *pstats = 0;
>> + }
>> +
>> + /*
>> + * If a statistics count is in the middle of being updated, it
>> + * is possible that the above clearing may not work. So we need
>> + * to double check again to make sure that the counters are really
>> + * cleared. Still there is a still a very small chance that the
>> + * second clearing does not work.
>> + */
>> + for_each_possible_cpu(cpu) {
>> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
>> + int stat;
>> +
>> + for (stat = 0; stat< pcs->nstats; stat++, pstats++)
>> + if (*pstats)
>> + *pstats = 0;
>> + }
> I don't think this is acceptable.
I am not sure what you mean here by not acceptable. Please enlighten me
on that.
>> +}
>> +
>> +static inline int percpu_stats_init(struct percpu_stats *pcs, int num)
>> +{
>> + pcs->nstats = num;
>> + pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
>> + __alignof__(unsigned long));
>> + if (!pcs->stats)
>> + return -ENOMEM;
>> +
>> + percpu_stats_reset(pcs);
>> + return 0;
>> +}
>> +
>> +static inline void percpu_stats_destroy(struct percpu_stats *pcs)
>> +{
>> + free_percpu(pcs->stats);
>> + pcs->stats = NULL;
>> + pcs->nstats = 0;
>> +}
> Why inline the above functions?
Will move this function to lib/percpu_stats.c.
>> +static inline void
>> +__percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
>> +{
>> + unsigned long *pstat;
>> +
>> + if ((unsigned int)stat>= pcs->nstats)
>> + return;
> This is a critical bug. Please don't fail silently. BUG_ON(),
> please.
Sure.
>
>> + preempt_disable();
>> + pstat = this_cpu_ptr(&pcs->stats[stat]);
>> + *pstat += cnt;
>> + preempt_enable();
>> +}
> this_cpu_add() is atomic w.r.t. local operations.
Will use this_cpu_add().
>> +static inline unsigned long
>> +percpu_stats_sum(struct percpu_stats *pcs, int stat)
>> +{
>> + int cpu;
>> + unsigned long sum = 0;
>> +
>> + if ((unsigned int)stat>= pcs->nstats)
>> + return sum;
> Ditto.
>
>> + for_each_possible_cpu(cpu)
>> + sum += per_cpu(pcs->stats[stat], cpu);
>> + return sum;
>> +}
> Thanks.
>
Cheers,
Longman
On 04/04/2016 03:09 PM, Christoph Lameter wrote:
> On Mon, 4 Apr 2016, Waiman Long wrote:
>
>>>> + if ((unsigned int)stat>= pcs->nstats)
>>>> + return;
>>>> + preempt_disable();
>>>> + pstat = this_cpu_ptr(&pcs->stats[stat]);
>>>> + *pstat += cnt;
>>>> + preempt_enable();
>>>> +}
>>> pstat = get_cpu_ptr(&pcs->stats[stat]);
>>> *pstat += cnt;
>>> put_cpu_ptr(&pcs->stats[stat]);
>>>
>>> It will generate identical code but this one uses APIs, making the
>>> intention clearer. But as I said this is just a minor nit.
>>>
>>> you can add my Reviewed-by: Nikolay Borisov<[email protected]> for this
>>> particular patch.
>> Yes, that will certainly make it look nicer. I will update the patch once I
>> get feedback from my other ext4 patches.
> Why not
>
> this_cpu_add(pci->stats[stat], cnt)
>
> This is a single instruction on x86.
>
Yes, using this_cpu_add() will be even simpler.
Cheers,
Longman
Hello,
On Wed, Apr 06, 2016 at 05:51:45PM -0400, Waiman Long wrote:
> >>+ /*
> >>+ * If a statistics count is in the middle of being updated, it
> >>+ * is possible that the above clearing may not work. So we need
> >>+ * to double check again to make sure that the counters are really
> >>+ * cleared. Still there is a still a very small chance that the
> >>+ * second clearing does not work.
> >>+ */
> >>+ for_each_possible_cpu(cpu) {
> >>+ unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
> >>+ int stat;
> >>+
> >>+ for (stat = 0; stat< pcs->nstats; stat++, pstats++)
> >>+ if (*pstats)
> >>+ *pstats = 0;
> >>+ }
> >I don't think this is acceptable.
>
> I am not sure what you mean here by not acceptable. Please enlighten me on
> that.
Hmmm... I thought that was pretty clear. Try-twice-and-we-are-probably-okay
is simply not acceptable. Please make it watertight.
Thanks.
--
tejun
On 04/06/2016 06:54 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 06, 2016 at 05:51:45PM -0400, Waiman Long wrote:
>>>> + /*
>>>> + * If a statistics count is in the middle of being updated, it
>>>> + * is possible that the above clearing may not work. So we need
>>>> + * to double check again to make sure that the counters are really
>>>> + * cleared. Still there is a still a very small chance that the
>>>> + * second clearing does not work.
>>>> + */
>>>> + for_each_possible_cpu(cpu) {
>>>> + unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
>>>> + int stat;
>>>> +
>>>> + for (stat = 0; stat< pcs->nstats; stat++, pstats++)
>>>> + if (*pstats)
>>>> + *pstats = 0;
>>>> + }
>>> I don't think this is acceptable.
>> I am not sure what you mean here by not acceptable. Please enlighten me on
>> that.
> Hmmm... I thought that was pretty clear. Try-twice-and-we-are-probably-okay
> is simply not acceptable. Please make it watertight.
>
> Thanks.
OK, I got it now.
We can certainly make it watertight. However, that will certainly
require adding performance overhead in the percpu stats update fast path
which I am not willing to pay.
The purpose of this stat counters reset functionality is to allow
developers to reset the stat counters, run certain workload and see how
things are going in the kernel when the workload completes assuming that
those stat counters are exposed via sysfs, debugfs, etc. The developers
can certainly check the stat counters after the reset to make sure that
they are properly reset. So I don't think we need an airtight way of
doing it. If you have scenarios in your mind that require airtight reset
of the stat counters, please let me know and I will see what I can do
about it.
Cheers,
Longman
Hello, Waiman.
On Thu, Apr 07, 2016 at 11:58:13AM -0400, Waiman Long wrote:
> We can certainly make it watertight. However, that will certainly require
> adding performance overhead in the percpu stats update fast path which I am
> not willing to pay.
There are multiple options depending on the specific balance you want
to hit. Reset can be made very heavy (involving RCU sync operation)
to make hot path overhead minimal, local locking or atomic ops can
also be used which while more expensive than this_cpu_*() ops still
avoids cacheline bouncing.
> The purpose of this stat counters reset functionality is to allow developers
> to reset the stat counters, run certain workload and see how things are
> going in the kernel when the workload completes assuming that those stat
> counters are exposed via sysfs, debugfs, etc. The developers can certainly
> check the stat counters after the reset to make sure that they are properly
> reset. So I don't think we need an airtight way of doing it. If you have
> scenarios in your mind that require airtight reset of the stat counters,
> please let me know and I will see what I can do about it.
No matter what, don't create something which can yield a completely
surprising result once in a blue moon. You might think it's okay
because the likelihood is low but that just means that the resulting
malfunctions will be that much more obscure and difficult to
reproduce.
Thanks.
--
tejun
Hello, Waiman.
On Thu, Apr 07, 2016 at 02:52:33PM -0400, Waiman Long wrote:
> As long as atomic reset is an optional feature that caller can choose at
> init time, I am OK to provide this functionality. I just don't want it to be
> the default because of the performance overhead.
Please take a look at how percpu-ref coordinates global
synchronization. The hot path overhead is one branch which is
extremely easy to predict and shouldn't show up anywhere. If you're
gonna provide reset at all (which btw always kinda baffles me, what's
wrong with taking a snapshot value and taking delta from there?), you
need to make it actually work reliably.
Thanks.
--
tejun
On 04/07/2016 12:06 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Apr 07, 2016 at 11:58:13AM -0400, Waiman Long wrote:
>> We can certainly make it watertight. However, that will certainly require
>> adding performance overhead in the percpu stats update fast path which I am
>> not willing to pay.
> There are multiple options depending on the specific balance you want
> to hit. Reset can be made very heavy (involving RCU sync operation)
> to make hot path overhead minimal, local locking or atomic ops can
> also be used which while more expensive than this_cpu_*() ops still
> avoids cacheline bouncing.
>
>> The purpose of this stat counters reset functionality is to allow developers
>> to reset the stat counters, run certain workload and see how things are
>> going in the kernel when the workload completes assuming that those stat
>> counters are exposed via sysfs, debugfs, etc. The developers can certainly
>> check the stat counters after the reset to make sure that they are properly
>> reset. So I don't think we need an airtight way of doing it. If you have
>> scenarios in your mind that require airtight reset of the stat counters,
>> please let me know and I will see what I can do about it.
> No matter what, don't create something which can yield a completely
> surprising result once in a blue moon. You might think it's okay
> because the likelihood is low but that just means that the resulting
> malfunctions will be that much more obscure and difficult to
> reproduce.
>
> Thanks.
>
As long as atomic reset is an optional feature that caller can choose at
init time, I am OK to provide this functionality. I just don't want it
to be the default because of the performance overhead.
Cheers,
Longman
On 04/07/2016 02:58 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Apr 07, 2016 at 02:52:33PM -0400, Waiman Long wrote:
>> As long as atomic reset is an optional feature that caller can choose at
>> init time, I am OK to provide this functionality. I just don't want it to be
>> the default because of the performance overhead.
> Please take a look at how percpu-ref coordinates global
> synchronization. The hot path overhead is one branch which is
> extremely easy to predict and shouldn't show up anywhere. If you're
> gonna provide reset at all (which btw always kinda baffles me, what's
> wrong with taking a snapshot value and taking delta from there?), you
> need to make it actually work reliably.
>
> Thanks.
>
I would say that because I am lazy, I don't want compute the deltas
every time I want to see the effect of running a certain type of
workload on the statistics counts. I have use case that I need to track
10 or so statistics counts and monitor their changes after running a
job. It is much more convenient to do a reset and see what you get than
doing manual subtractions to find out.
I had taken a look at percpu-refcount.[ch]. I think the synchronization
code is a bit overkill for this purpose as no one really need a very
precise statistics counts nor precise atomic reset. I would prefer
providing an optional atomic reset feature with slower statistics count
update path for the time being. If we come across a use case where we
need atomic reset with negligible slowdown, we could then refactor the
code to use something similar to what the percpu-refcount code is doing.
Cheers,
Longman
Hello, Waiman.
On Thu, Apr 07, 2016 at 04:37:06PM -0400, Waiman Long wrote:
> I would say that because I am lazy, I don't want compute the deltas every
> time I want to see the effect of running a certain type of workload on the
> statistics counts. I have use case that I need to track 10 or so statistics
> counts and monitor their changes after running a job. It is much more
> convenient to do a reset and see what you get than doing manual subtractions
> to find out.
I don't know. Write a simple script? Even if you wanna keep it in
kernel, you can just have a base counter which offsets the summed up
value on read.
> I had taken a look at percpu-refcount.[ch]. I think the synchronization code
> is a bit overkill for this purpose as no one really need a very precise
> statistics counts nor precise atomic reset. I would prefer providing an
> optional atomic reset feature with slower statistics count update path for
> the time being. If we come across a use case where we need atomic reset with
> negligible slowdown, we could then refactor the code to use something
> similar to what the percpu-refcount code is doing.
Please either drop reset or make it actually work; otherwise, I don't
think this should go in.
Thanks.
--
tejun
On 04/07/2016 04:41 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Apr 07, 2016 at 04:37:06PM -0400, Waiman Long wrote:
>> I would say that because I am lazy, I don't want compute the deltas every
>> time I want to see the effect of running a certain type of workload on the
>> statistics counts. I have use case that I need to track 10 or so statistics
>> counts and monitor their changes after running a job. It is much more
>> convenient to do a reset and see what you get than doing manual subtractions
>> to find out.
> I don't know. Write a simple script? Even if you wanna keep it in
> kernel, you can just have a base counter which offsets the summed up
> value on read.
>
>> I had taken a look at percpu-refcount.[ch]. I think the synchronization code
>> is a bit overkill for this purpose as no one really need a very precise
>> statistics counts nor precise atomic reset. I would prefer providing an
>> optional atomic reset feature with slower statistics count update path for
>> the time being. If we come across a use case where we need atomic reset with
>> negligible slowdown, we could then refactor the code to use something
>> similar to what the percpu-refcount code is doing.
> Please either drop reset or make it actually work; otherwise, I don't
> think this should go in.
>
> Thanks.
>
In this case, I think I will drop this reset functionality. It is not
really needed for this patchset.
Thanks for the feedback!
Cheers,
Longman