v1->v2:
- Remove percpu_stats_reset() which is not really needed in this
patchset.
- Move some percpu_stats* functions to the newly created
lib/percpu_stats.c.
- Add a new patch to support 64-bit statistics counts in 32-bit
architectures.
- Rearrange the patches by moving the percpu_stats patches to the
front followed by the ext4 patches.
This patchset aims to improve parallel I/O performance of the ext4
filesystem on fast storage devices like NVDIMM. It also introduces a
set of percpu statistics count helper functions that facilitate the
management of percpu statistics counts.
Patch 1 provides a set of simple percpu statistics count helper
functions.
Patch 2 enables the use of 64-bit counts for 32-bit architectures.
Patch 3 eliminates duplicated inode_dio_begin()/inode_dio_end() calls.
Patch 4 converts some ext4 statistics counts into percpu counts using
the helper functions.
Waiman Long (4):
percpu_stats: Simple per-cpu statistics count helper functions
percpu_stats: Enable 64-bit counts in 32-bit architectures
ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called
ext4: Make cache hits/misses per-cpu counts
fs/ext4/extents_status.c | 22 ++++---
fs/ext4/extents_status.h | 11 +++-
fs/ext4/indirect.c | 10 +++-
fs/ext4/inode.c | 12 +++-
include/linux/percpu_stats.h | 64 +++++++++++++++++++++
lib/Makefile | 2 +-
lib/percpu_stats.c | 128 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 232 insertions(+), 17 deletions(-)
create mode 100644 include/linux/percpu_stats.h
create mode 100644 lib/percpu_stats.c
This patch introduces a set of simple per-cpu statictics count helper
functions that can be used by other kernel subsystems for keeping
track of the number of events that happens. It is per-cpu based to
reduce overhead and improve accuracy of the counter. Using per-cpu
counter is usually overkill for such purpose.
The following APIs are provided:
- int percpu_stats_init(struct percpu_stats *pcs, int num)
Initialize the per-cpu statictics counts structure which should have
the given number of statistics counts. Return -ENOMEM on error.
- void percpu_stats_destroy(struct percpu_stats *pcs)
Free the percpu memory allocated.
- void percpu_stats_inc(struct percpu_stats *pcs, int stat)
void percpu_stats_dec(struct percpu_stats *pcs, int stat)
void percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
Increment, decrement and add to the given per-cpu statistics count.
- unsigned long percpu_stats_sum(struct percpu_stats *pcs, int stat)
Return the current aggregated sum of the given statistics count.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/percpu_stats.h | 42 +++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/percpu_stats.c | 64 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 107 insertions(+), 1 deletions(-)
create mode 100644 include/linux/percpu_stats.h
create mode 100644 lib/percpu_stats.c
diff --git a/include/linux/percpu_stats.h b/include/linux/percpu_stats.h
new file mode 100644
index 0000000..ed6e8ac
--- /dev/null
+++ b/include/linux/percpu_stats.h
@@ -0,0 +1,42 @@
+#ifndef _LINUX_PERCPU_STATS_H
+#define _LINUX_PERCPU_STATS_H
+/*
+ * Simple per-cpu statistics counts that have less overhead than the
+ * per-cpu counters.
+ */
+#include <linux/percpu.h>
+#include <linux/types.h>
+
+struct percpu_stats {
+ unsigned long __percpu *stats;
+ int nstats; /* Number of statistics counts in stats array */
+};
+
+extern void percpu_stats_destroy(struct percpu_stats *pcs);
+extern int percpu_stats_init(struct percpu_stats *pcs, int num);
+extern uint64_t percpu_stats_sum(struct percpu_stats *pcs, int stat);
+
+/**
+ * percpu_stats_add - Add the given value to a statistics count
+ * @pcs: Pointer to percpu_stats structure
+ * @stat: The statistics count that needs to be updated
+ * @cnt: The value to be added to the statistics count
+ */
+static inline void
+percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
+{
+ BUG_ON((unsigned int)stat >= pcs->nstats);
+ this_cpu_add(pcs->stats[stat], cnt);
+}
+
+static inline void percpu_stats_inc(struct percpu_stats *pcs, int stat)
+{
+ percpu_stats_add(pcs, stat, 1);
+}
+
+static inline void percpu_stats_dec(struct percpu_stats *pcs, int stat)
+{
+ percpu_stats_add(pcs, stat, -1);
+}
+
+#endif /* _LINUX_PERCPU_STATS_H */
diff --git a/lib/Makefile b/lib/Makefile
index 7bd6fd4..5037c62 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
- once.o
+ once.o percpu_stats.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += hexdump.o
diff --git a/lib/percpu_stats.c b/lib/percpu_stats.c
new file mode 100644
index 0000000..bc9f26d
--- /dev/null
+++ b/lib/percpu_stats.c
@@ -0,0 +1,64 @@
+/*
+ * Simple per-cpu statistics counts that have less overhead than the
+ * per-cpu counters.
+ */
+#include <linux/percpu_stats.h>
+#include <linux/bug.h>
+
+/**
+ * percpu_stats_init - allocate memory for the percpu statistics counts
+ * @pcs: Pointer to percpu_stats structure
+ * @num: Number of statistics counts to be used
+ * Return: 0 if successful, -ENOMEM if memory allocation fails.
+ */
+int percpu_stats_init(struct percpu_stats *pcs, int num)
+{
+ int cpu;
+
+ pcs->nstats = num;
+ pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
+ __alignof__(unsigned long));
+ if (!pcs->stats)
+ return -ENOMEM;
+
+ for_each_possible_cpu(cpu) {
+ unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
+ int stat;
+
+ for (stat = 0; stat < pcs->nstats; stat++, pstats++)
+ *pstats = 0;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(percpu_stats_init);
+
+/**
+ * percpu_stats_destroy - free the memory used by the statistics counts
+ * @pcs: Pointer to percpu_stats structure
+ */
+void percpu_stats_destroy(struct percpu_stats *pcs)
+{
+ free_percpu(pcs->stats);
+ pcs->stats = NULL;
+ pcs->nstats = 0;
+}
+EXPORT_SYMBOL(percpu_stats_destroy);
+
+/**
+ * percpu_stats_sum - compute the percpu sum of the given statistics count
+ * @pcs : Pointer to percpu_stats structure
+ * @stat : The statistics count whose sum needs to be computed
+ * Return: Sum of percpu count values
+ */
+uint64_t percpu_stats_sum(struct percpu_stats *pcs, int stat)
+{
+ int cpu;
+ uint64_t sum = 0;
+
+ BUG_ON((unsigned int)stat >= pcs->nstats);
+
+ for_each_possible_cpu(cpu)
+ sum += per_cpu(pcs->stats[stat], cpu);
+ return sum;
+}
+EXPORT_SYMBOL(percpu_stats_sum);
--
1.7.1
This patch changes the es_stats_cache_hits and es_stats_cache_misses
statistics counts to per-cpu variables to reduce cacheline contention
issues whem multiple threads are trying to update those counts
simultaneously. It uses the new per-cpu stats APIs provided by the
percpu_stats.h header file.
With a 38-threads fio I/O test with 2 shared files (on DAX-mount
NVDIMM) running on a 4-socket Haswell-EX server with 4.6-rc1 kernel,
the aggregated bandwidths before and after the patch were:
Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 10173MB/s 16141MB/s +58.7%
Read-write 2830MB/s 4315MB/s +52.5%
Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/extents_status.c | 22 +++++++++++++---------
fs/ext4/extents_status.h | 11 +++++++++--
2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e38b987..bfa8d63 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -825,9 +825,9 @@ out:
es->es_pblk = es1->es_pblk;
if (!ext4_es_is_referenced(es1))
ext4_es_set_referenced(es1);
- stats->es_stats_cache_hits++;
+ percpu_stats_inc(&stats->es_stats, es_stats_cache_hits);
} else {
- stats->es_stats_cache_misses++;
+ percpu_stats_inc(&stats->es_stats, es_stats_cache_misses);
}
read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -1113,9 +1113,9 @@ int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v)
seq_printf(seq, "stats:\n %lld objects\n %lld reclaimable objects\n",
percpu_counter_sum_positive(&es_stats->es_stats_all_cnt),
percpu_counter_sum_positive(&es_stats->es_stats_shk_cnt));
- seq_printf(seq, " %lu/%lu cache hits/misses\n",
- es_stats->es_stats_cache_hits,
- es_stats->es_stats_cache_misses);
+ seq_printf(seq, " %llu/%llu cache hits/misses\n",
+ percpu_stats_sum(&es_stats->es_stats, es_stats_cache_hits),
+ percpu_stats_sum(&es_stats->es_stats, es_stats_cache_misses));
if (inode_cnt)
seq_printf(seq, " %d inodes on list\n", inode_cnt);
@@ -1142,8 +1142,6 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
sbi->s_es_nr_inode = 0;
spin_lock_init(&sbi->s_es_lock);
sbi->s_es_stats.es_stats_shrunk = 0;
- sbi->s_es_stats.es_stats_cache_hits = 0;
- sbi->s_es_stats.es_stats_cache_misses = 0;
sbi->s_es_stats.es_stats_scan_time = 0;
sbi->s_es_stats.es_stats_max_scan_time = 0;
err = percpu_counter_init(&sbi->s_es_stats.es_stats_all_cnt, 0, GFP_KERNEL);
@@ -1153,15 +1151,20 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
if (err)
goto err1;
+ err = percpu_stats_init(&sbi->s_es_stats.es_stats, es_stats_cnt, 0);
+ if (err)
+ goto err2;
+
sbi->s_es_shrinker.scan_objects = ext4_es_scan;
sbi->s_es_shrinker.count_objects = ext4_es_count;
sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
err = register_shrinker(&sbi->s_es_shrinker);
if (err)
- goto err2;
+ goto err3;
return 0;
-
+err3:
+ percpu_stats_destroy(&sbi->s_es_stats.es_stats);
err2:
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
err1:
@@ -1173,6 +1176,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
{
percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
+ percpu_stats_destroy(&sbi->s_es_stats.es_stats);
unregister_shrinker(&sbi->s_es_shrinker);
}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index f7aa24f..c163189 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -11,6 +11,8 @@
#ifndef _EXT4_EXTENTS_STATUS_H
#define _EXT4_EXTENTS_STATUS_H
+#include <linux/percpu_stats.h>
+
/*
* Turn on ES_DEBUG__ to get lots of info about extent status operations.
*/
@@ -67,10 +69,15 @@ struct ext4_es_tree {
struct extent_status *cache_es; /* recently accessed extent */
};
+enum ext4_es_stat_type {
+ es_stats_cache_hits,
+ es_stats_cache_misses,
+ es_stats_cnt,
+};
+
struct ext4_es_stats {
unsigned long es_stats_shrunk;
- unsigned long es_stats_cache_hits;
- unsigned long es_stats_cache_misses;
+ struct percpu_stats es_stats;
u64 es_stats_scan_time;
u64 es_stats_max_scan_time;
struct percpu_counter es_stats_all_cnt;
--
1.7.1
When performing direct I/O, the current ext4 code does
not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
inode_dio_begin()/inode_dio_end() internally. This doubling of
inode_dio_begin()/inode_dio_end() calls are wasteful.
This patch removes the extra internal inode_dio_begin()/inode_dio_end()
calls when those calls are being issued by the caller directly. For
really fast storage systems like NVDIMM, the removal of the extra
inode_dio_begin()/inode_dio_end() can give a meaningful boost to
I/O performance.
On a 4-socket Haswell-EX system (72 cores) running 4.6-rc1 kernel,
fio with 38 threads doing parallel I/O on two shared files on an
NVDIMM with DAX gave the following aggregrate bandwidth with and
without the patch:
Test W/O patch With patch % change
---- --------- ---------- --------
Read-only 8688MB/s 10173MB/s +17.1%
Read-write 2687MB/s 2830MB/s +5.3%
Signed-off-by: Waiman Long <[email protected]>
---
fs/ext4/indirect.c | 10 ++++++++--
fs/ext4/inode.c | 12 +++++++++---
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa6..4304be6 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -706,14 +706,20 @@ retry:
inode_dio_end(inode);
goto locked;
}
+ /*
+ * Need to pass in DIO_SKIP_DIO_COUNT to prevent
+ * duplicated inode_dio_begin/inode_dio_end sequence.
+ */
if (IS_DAX(inode))
ret = dax_do_io(iocb, inode, iter, offset,
- ext4_dio_get_block, NULL, 0);
+ ext4_dio_get_block, NULL,
+ DIO_SKIP_DIO_COUNT);
else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter,
offset, ext4_dio_get_block,
- NULL, NULL, 0);
+ NULL, NULL,
+ DIO_SKIP_DIO_COUNT);
inode_dio_end(inode);
} else {
locked:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2..779aa33 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3358,9 +3358,15 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
* Make all waiters for direct IO properly wait also for extent
* conversion. This also disallows race between truncate() and
* overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+ *
+ * Both dax_do_io() and __blockdev_direct_IO() will unnecessarily
+ * call inode_dio_begin()/inode_dio_end() again if the
+ * DIO_SKIP_DIO_COUNT flag is not set.
*/
- if (iov_iter_rw(iter) == WRITE)
+ if (iov_iter_rw(iter) == WRITE) {
+ dio_flags = DIO_SKIP_DIO_COUNT;
inode_dio_begin(inode);
+ }
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);
@@ -3393,10 +3399,10 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
get_block_func = ext4_dio_get_block_overwrite;
else if (is_sync_kiocb(iocb)) {
get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
} else {
get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
+ dio_flags |= DIO_LOCKING;
}
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
--
1.7.1
The unsigned long type in 32-bit architectures is only 32-bit. This may
not be enough from some statistics counts that may well go over 2^32
over time. This patch optionally enables the use of 64-bit counts in
32-bit architecture, though it does add a bit of performance overhead
if enabled.
This patch adds a flags argument to the percpu_stats_init() function:
int percpu_stats_init(struct percpu_stats *pcs, int num, int flags)
Currently, the following 2 flags are supported:
1) PCPU_STAT_64BIT - enable 64-bit counts
2) PCPU_STAT_INTSAFE - make the 64-bit count update interrupt safe
The second flag isn't active if the first flag is not set.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/percpu_stats.h | 28 +++++++++++-
lib/percpu_stats.c | 96 +++++++++++++++++++++++++++++++++++-------
2 files changed, 105 insertions(+), 19 deletions(-)
diff --git a/include/linux/percpu_stats.h b/include/linux/percpu_stats.h
index ed6e8ac..641f211 100644
--- a/include/linux/percpu_stats.h
+++ b/include/linux/percpu_stats.h
@@ -6,15 +6,34 @@
*/
#include <linux/percpu.h>
#include <linux/types.h>
+#include <linux/u64_stats_sync.h>
+
+/*
+ * Supported flags for percpu_stats_init()
+ */
+#define PCPU_STAT_64BIT 1 /* Use 64-bit statistics count */
+#define PCPU_STAT_INTSAFE 2 /* Make percpu_add interrupt safe */
struct percpu_stats {
- unsigned long __percpu *stats;
+ union {
+ unsigned long __percpu *stats;
+ uint64_t __percpu *stats64;
+ };
+ struct u64_stats_sync sync;
int nstats; /* Number of statistics counts in stats array */
+ int flags;
};
extern void percpu_stats_destroy(struct percpu_stats *pcs);
-extern int percpu_stats_init(struct percpu_stats *pcs, int num);
+extern int percpu_stats_init(struct percpu_stats *pcs, int num, int flags);
extern uint64_t percpu_stats_sum(struct percpu_stats *pcs, int stat);
+extern void __percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt);
+
+#ifdef CONFIG_64BIT
+#define PERCPU_STATS_FLAGS(pcs) false
+#else
+#define PERCPU_STATS_FLAGS(pcs) ((pcs)->flags)
+#endif
/**
* percpu_stats_add - Add the given value to a statistics count
@@ -26,7 +45,10 @@ static inline void
percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
{
BUG_ON((unsigned int)stat >= pcs->nstats);
- this_cpu_add(pcs->stats[stat], cnt);
+ if (unlikely(PERCPU_STATS_FLAGS(pcs)))
+ __percpu_stats_add(pcs, stat, cnt);
+ else
+ this_cpu_add(pcs->stats[stat], cnt);
}
static inline void percpu_stats_inc(struct percpu_stats *pcs, int stat)
diff --git a/lib/percpu_stats.c b/lib/percpu_stats.c
index bc9f26d..2ec739e 100644
--- a/lib/percpu_stats.c
+++ b/lib/percpu_stats.c
@@ -5,29 +5,47 @@
#include <linux/percpu_stats.h>
#include <linux/bug.h>
+#ifdef CONFIG_64BIT
+/*
+ * Ignore PCPU_STAT_64BIT & PCPU_STAT_INTSAFE flags for 64-bit architectures
+ * as 64-bit count is the default.
+ */
+#define IS_STATS64(pcs) false
+#define GET_FLAGS(f) ((f) & ~(PCPU_STAT_64BIT | PCPU_STAT_INTSAFE))
+#else
+#define IS_STATS64(pcs) ((pcs)->flags & PCPU_STAT_64BIT)
+#define GET_FLAGS(f) (f)
+#endif
+
/**
* percpu_stats_init - allocate memory for the percpu statistics counts
- * @pcs: Pointer to percpu_stats structure
- * @num: Number of statistics counts to be used
+ * @pcs : Pointer to percpu_stats structure
+ * @num : Number of statistics counts to be used
+ * @flags: Optional feature bits
* Return: 0 if successful, -ENOMEM if memory allocation fails.
*/
-int percpu_stats_init(struct percpu_stats *pcs, int num)
+int percpu_stats_init(struct percpu_stats *pcs, int num, int flags)
{
- int cpu;
+ int cpu, size;
+ pcs->flags = GET_FLAGS(flags);
pcs->nstats = num;
- pcs->stats = __alloc_percpu(sizeof(unsigned long) * num,
- __alignof__(unsigned long));
- if (!pcs->stats)
- return -ENOMEM;
+ if (IS_STATS64(pcs)) {
+ size = sizeof(uint64_t) * num;
+ pcs->stats64 = __alloc_percpu(size, __alignof__(uint64_t));
+ if (!pcs->stats64)
+ return -ENOMEM;
+ u64_stats_init(&pcs->sync);
+ } else {
+ size = sizeof(unsigned long) * num;
+ pcs->stats = __alloc_percpu(size, __alignof__(unsigned long));
+ if (!pcs->stats)
+ return -ENOMEM;
+ }
- for_each_possible_cpu(cpu) {
- unsigned long *pstats = per_cpu_ptr(pcs->stats, cpu);
- int stat;
+ for_each_possible_cpu(cpu)
+ memset(per_cpu_ptr(pcs->stats, cpu), 0, size);
- for (stat = 0; stat < pcs->nstats; stat++, pstats++)
- *pstats = 0;
- }
return 0;
}
EXPORT_SYMBOL(percpu_stats_init);
@@ -57,8 +75,54 @@ uint64_t percpu_stats_sum(struct percpu_stats *pcs, int stat)
BUG_ON((unsigned int)stat >= pcs->nstats);
- for_each_possible_cpu(cpu)
- sum += per_cpu(pcs->stats[stat], cpu);
+ if (IS_STATS64(pcs)) {
+ for_each_possible_cpu(cpu) {
+ uint64_t val;
+ unsigned int seq;
+
+ do {
+ seq = u64_stats_fetch_begin(&pcs->sync);
+ val = per_cpu(pcs->stats64[stat], cpu);
+ } while (u64_stats_fetch_retry(&pcs->sync, seq));
+ sum += val;
+ }
+ } else {
+ for_each_possible_cpu(cpu)
+ sum += per_cpu(pcs->stats[stat], cpu);
+ }
return sum;
}
EXPORT_SYMBOL(percpu_stats_sum);
+
+/**
+ * __percpu_stats_add - add given count to percpu value
+ * @pcs : Pointer to percpu_stats structure
+ * @stat: The statistics count that needs to be updated
+ * @cnt: The value to be added to the statistics count
+ */
+void __percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
+{
+ /*
+ * u64_stats_update_begin/u64_stats_update_end alone are not safe
+ * against recursive add on the same CPU caused by interrupt.
+ * So we need to set the PCPU_STAT_INTSAFE flag if this is required.
+ */
+ if (IS_STATS64(pcs)) {
+ uint64_t *pstats64;
+ unsigned long flags;
+
+ pstats64 = get_cpu_ptr(pcs->stats64);
+ if (pcs->flags & PCPU_STAT_INTSAFE)
+ local_irq_save(flags);
+
+ u64_stats_update_begin(&pcs->sync);
+ pstats64[stat] += cnt;
+ u64_stats_update_end(&pcs->sync);
+
+ if (pcs->flags & PCPU_STAT_INTSAFE)
+ local_irq_restore(flags);
+
+ put_cpu_ptr(pcs->stats64);
+ }
+}
+EXPORT_SYMBOL(__percpu_stats_add);
--
1.7.1
Hi Waiman,
[auto build test ERROR on ext4/dev]
[also build test ERROR on v4.6-rc2 next-20160408]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url: https://github.com/0day-ci/linux/commits/Waiman-Long/ext4-Improve-parallel-I-O-performance-on-NVDIMM/20160409-002128
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: alpha-defconfig (attached as .config)
reproduce:
wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=alpha
Note: the linux-review/Waiman-Long/ext4-Improve-parallel-I-O-performance-on-NVDIMM/20160409-002128 HEAD 712a939b92b9178cb79df4050bba8e6b1d03ca63 builds fine.
It only hurts bisectibility.
All errors (new ones prefixed by >>):
In file included from lib/percpu_stats.c:5:0:
include/linux/percpu_stats.h: In function 'percpu_stats_add':
>> include/linux/percpu_stats.h:29:2: error: implicit declaration of function 'raw_local_irq_save' [-Werror=implicit-function-declaration]
this_cpu_add(pcs->stats[stat], cnt);
^
>> include/linux/percpu_stats.h:29:2: error: implicit declaration of function 'raw_local_irq_restore' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
vim +/raw_local_irq_save +29 include/linux/percpu_stats.h
23 * @cnt: The value to be added to the statistics count
24 */
25 static inline void
26 percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
27 {
28 BUG_ON((unsigned int)stat >= pcs->nstats);
> 29 this_cpu_add(pcs->stats[stat], cnt);
30 }
31
32 static inline void percpu_stats_inc(struct percpu_stats *pcs, int stat)
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hello, Waiman.
On Fri, Apr 08, 2016 at 12:16:20PM -0400, Waiman Long wrote:
> +/**
> + * __percpu_stats_add - add given count to percpu value
> + * @pcs : Pointer to percpu_stats structure
> + * @stat: The statistics count that needs to be updated
> + * @cnt: The value to be added to the statistics count
> + */
> +void __percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
> +{
> + /*
> + * u64_stats_update_begin/u64_stats_update_end alone are not safe
> + * against recursive add on the same CPU caused by interrupt.
> + * So we need to set the PCPU_STAT_INTSAFE flag if this is required.
> + */
> + if (IS_STATS64(pcs)) {
> + uint64_t *pstats64;
> + unsigned long flags;
> +
> + pstats64 = get_cpu_ptr(pcs->stats64);
> + if (pcs->flags & PCPU_STAT_INTSAFE)
> + local_irq_save(flags);
> +
> + u64_stats_update_begin(&pcs->sync);
> + pstats64[stat] += cnt;
> + u64_stats_update_end(&pcs->sync);
> +
> + if (pcs->flags & PCPU_STAT_INTSAFE)
> + local_irq_restore(flags);
> +
> + put_cpu_ptr(pcs->stats64);
> + }
> +}
Heh, that's a handful, and, right, u64_stats needs separate irq
protection. I'm not sure. If we have to do the above, it's likely
that it'll perform worse than percpu_counter on 32bits. On 64bits,
percpu_counter would incur extra preempt_disable/enable() operations
but that comes from it not using this_cpu_add_return(). I wonder
whether it'd be better to either use percpu_counter instead or if
necessary extend it to handle multiple counters. What do you think?
Thanks.
--
tejun
On 04/08/2016 12:47 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Apr 08, 2016 at 12:16:20PM -0400, Waiman Long wrote:
>> +/**
>> + * __percpu_stats_add - add given count to percpu value
>> + * @pcs : Pointer to percpu_stats structure
>> + * @stat: The statistics count that needs to be updated
>> + * @cnt: The value to be added to the statistics count
>> + */
>> +void __percpu_stats_add(struct percpu_stats *pcs, int stat, int cnt)
>> +{
>> + /*
>> + * u64_stats_update_begin/u64_stats_update_end alone are not safe
>> + * against recursive add on the same CPU caused by interrupt.
>> + * So we need to set the PCPU_STAT_INTSAFE flag if this is required.
>> + */
>> + if (IS_STATS64(pcs)) {
>> + uint64_t *pstats64;
>> + unsigned long flags;
>> +
>> + pstats64 = get_cpu_ptr(pcs->stats64);
>> + if (pcs->flags& PCPU_STAT_INTSAFE)
>> + local_irq_save(flags);
>> +
>> + u64_stats_update_begin(&pcs->sync);
>> + pstats64[stat] += cnt;
>> + u64_stats_update_end(&pcs->sync);
>> +
>> + if (pcs->flags& PCPU_STAT_INTSAFE)
>> + local_irq_restore(flags);
>> +
>> + put_cpu_ptr(pcs->stats64);
>> + }
>> +}
> Heh, that's a handful, and, right, u64_stats needs separate irq
> protection. I'm not sure. If we have to do the above, it's likely
> that it'll perform worse than percpu_counter on 32bits. On 64bits,
> percpu_counter would incur extra preempt_disable/enable() operations
> but that comes from it not using this_cpu_add_return(). I wonder
> whether it'd be better to either use percpu_counter instead or if
> necessary extend it to handle multiple counters. What do you think?
>
> Thanks.
>
Yes, I think it will be more efficient to use percpu_counter in this
case. The preempt_disable/enable() calls are pretty cheap. Once in a
while, you need to take the lock and update the global count. How about
I change the 2nd patch to use percpu_counter internally when 64-bit
counts are needed in 32-bit archs, but use the regular percpu counts on
64-bit archs? If you are OK with that, I can update the patch accordingly.
Cheers,
Longman
Hello,
On Fri, Apr 08, 2016 at 01:32:52PM -0400, Waiman Long wrote:
> Yes, I think it will be more efficient to use percpu_counter in this case.
> The preempt_disable/enable() calls are pretty cheap. Once in a while, you
> need to take the lock and update the global count. How about I change the
> 2nd patch to use percpu_counter internally when 64-bit counts are needed in
> 32-bit archs, but use the regular percpu counts on 64-bit archs? If you are
> OK with that, I can update the patch accordingly.
Does having percpu_stats as a separate construct make sense after
that? Just use percpu_counter directly? You end up wasting a bit
more space that way but most of space overhead for these things are in
percpu part anyway, so in proportion it shouldn't make that much of a
difference.
Thanks.
--
tejun
Hi Waiman,
[auto build test ERROR on ext4/dev]
[also build test ERROR on v4.6-rc2 next-20160408]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url: https://github.com/0day-ci/linux/commits/Waiman-Long/ext4-Improve-parallel-I-O-performance-on-NVDIMM/20160409-002128
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: um-x86_64_defconfig (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=um SUBARCH=x86_64
Note: the linux-review/Waiman-Long/ext4-Improve-parallel-I-O-performance-on-NVDIMM/20160409-002128 HEAD 712a939b92b9178cb79df4050bba8e6b1d03ca63 builds fine.
It only hurts bisectibility.
All error/warnings (new ones prefixed by >>):
In file included from arch/um/include/generated/asm/percpu.h:1:0,
from include/linux/percpu.h:12,
from include/linux/percpu_stats.h:7,
from lib/percpu_stats.c:5:
include/linux/percpu_stats.h: In function 'percpu_stats_add':
>> include/asm-generic/percpu.h:120:2: error: implicit declaration of function 'raw_local_irq_save' [-Werror=implicit-function-declaration]
raw_local_irq_save(__flags); \
^
>> include/asm-generic/percpu.h:322:34: note: in expansion of macro 'this_cpu_generic_to_op'
#define this_cpu_add_1(pcp, val) this_cpu_generic_to_op(pcp, val, +=)
^
>> include/linux/percpu-defs.h:364:11: note: in expansion of macro 'this_cpu_add_1'
case 1: stem##1(variable, __VA_ARGS__);break; \
^
include/linux/percpu-defs.h:496:33: note: in expansion of macro '__pcpu_size_call'
#define this_cpu_add(pcp, val) __pcpu_size_call(this_cpu_add_, pcp, val)
^
>> include/linux/percpu_stats.h:29:2: note: in expansion of macro 'this_cpu_add'
this_cpu_add(pcs->stats[stat], cnt);
^
>> include/asm-generic/percpu.h:122:2: error: implicit declaration of function 'raw_local_irq_restore' [-Werror=implicit-function-declaration]
raw_local_irq_restore(__flags); \
^
>> include/asm-generic/percpu.h:322:34: note: in expansion of macro 'this_cpu_generic_to_op'
#define this_cpu_add_1(pcp, val) this_cpu_generic_to_op(pcp, val, +=)
^
>> include/linux/percpu-defs.h:364:11: note: in expansion of macro 'this_cpu_add_1'
case 1: stem##1(variable, __VA_ARGS__);break; \
^
include/linux/percpu-defs.h:496:33: note: in expansion of macro '__pcpu_size_call'
#define this_cpu_add(pcp, val) __pcpu_size_call(this_cpu_add_, pcp, val)
^
>> include/linux/percpu_stats.h:29:2: note: in expansion of macro 'this_cpu_add'
this_cpu_add(pcs->stats[stat], cnt);
^
cc1: some warnings being treated as errors
vim +/raw_local_irq_save +120 include/asm-generic/percpu.h
eba117889a Tejun Heo 2014-06-17 114 __ret; \
9c28278a24 Tejun Heo 2014-06-17 115 })
9c28278a24 Tejun Heo 2014-06-17 116
eba117889a Tejun Heo 2014-06-17 117 #define this_cpu_generic_to_op(pcp, val, op) \
9c28278a24 Tejun Heo 2014-06-17 118 do { \
eba117889a Tejun Heo 2014-06-17 119 unsigned long __flags; \
eba117889a Tejun Heo 2014-06-17 @120 raw_local_irq_save(__flags); \
9c28278a24 Tejun Heo 2014-06-17 121 *raw_cpu_ptr(&(pcp)) op val; \
eba117889a Tejun Heo 2014-06-17 @122 raw_local_irq_restore(__flags); \
9c28278a24 Tejun Heo 2014-06-17 123 } while (0)
9c28278a24 Tejun Heo 2014-06-17 124
eba117889a Tejun Heo 2014-06-17 125 #define this_cpu_generic_add_return(pcp, val) \
:::::: The code at line 120 was first introduced by commit
:::::: eba117889ac444bea6e8270049cbaeed48169889 percpu: preffity percpu header files
:::::: TO: Tejun Heo <[email protected]>
:::::: CC: Tejun Heo <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 04/08/2016 01:46 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Apr 08, 2016 at 01:32:52PM -0400, Waiman Long wrote:
>> Yes, I think it will be more efficient to use percpu_counter in this case.
>> The preempt_disable/enable() calls are pretty cheap. Once in a while, you
>> need to take the lock and update the global count. How about I change the
>> 2nd patch to use percpu_counter internally when 64-bit counts are needed in
>> 32-bit archs, but use the regular percpu counts on 64-bit archs? If you are
>> OK with that, I can update the patch accordingly.
> Does having percpu_stats as a separate construct make sense after
> that? Just use percpu_counter directly? You end up wasting a bit
> more space that way but most of space overhead for these things are in
> percpu part anyway, so in proportion it shouldn't make that much of a
> difference.
>
> Thanks.
>
The percpu_stats construct allows minimal overhead in maintaining
statistics counts. The percpu_counter construct, on the other hand, has
a higher performance overhead and a bit more complex to set up and tear
down when more than one statistics counts are needed. In fact, my first
draft of the ext4 patch used percpu_counter for that purpose. However, I
just feel that using percpu_counter is kind of an overkill if what we
just want is to keep some counts that we want to have their sums
returned when requested. That are the main reasons for creating a
separate percpu_stats.
Of course, if you think we don't need a separate percpu_stats construct.
I am fine with that and I can rework the ext4 patch to use
percpu_counter instead.
Cheers,
Longman
Hello, Waiman.
On Fri, Apr 08, 2016 at 02:45:38PM -0400, Waiman Long wrote:
> The percpu_stats construct allows minimal overhead in maintaining statistics
> counts. The percpu_counter construct, on the other hand, has a higher
> performance overhead and a bit more complex to set up and tear down when
If you're referring to the preemption on/off, as I wrote before, we'll
probably be able to improve that with this_cpu_add_return so that the
only extra overhead is an easily predictable branch which is extremely
cheap. It's better to improve common constructs anyway.
> more than one statistics counts are needed. In fact, my first draft of the
And yeah, it can be cumbersome to set up and tear down multiple
percpu_counters. If there are enough consumers, we can extend
percpu_counter to handle multiple counters, right?
Thanks.
--
tejun
On 04/11/2016 06:17 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Apr 08, 2016 at 02:45:38PM -0400, Waiman Long wrote:
>> The percpu_stats construct allows minimal overhead in maintaining statistics
>> counts. The percpu_counter construct, on the other hand, has a higher
>> performance overhead and a bit more complex to set up and tear down when
> If you're referring to the preemption on/off, as I wrote before, we'll
> probably be able to improve that with this_cpu_add_return so that the
> only extra overhead is an easily predictable branch which is extremely
> cheap. It's better to improve common constructs anyway.
>
>> more than one statistics counts are needed. In fact, my first draft of the
> And yeah, it can be cumbersome to set up and tear down multiple
> percpu_counters. If there are enough consumers, we can extend
> percpu_counter to handle multiple counters, right?
>
> Thanks.
>
I have updated my patch to use percpu_counter instead. Thanks for all
the comments and review so far.
Cheers,
Longman