2016-03-05 02:52:04

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

This patchset allows the degeneration of per-cpu counters back to
global counters when:

1) The number of CPUs in the system is large, hence a high cost for
calling percpu_counter_sum().
2) The initial count value is small so that it has a high chance of
excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
percpu_counter: Allow falling back to global counter on large system
xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

fs/xfs/xfs_mount.c | 1 -
fs/xfs/xfs_mount.h | 5 +++
fs/xfs/xfs_super.c | 6 +++
include/linux/percpu_counter.h | 10 +++++
lib/percpu_counter.c | 72 +++++++++++++++++++++++++++++++++++++++-
5 files changed, 92 insertions(+), 2 deletions(-)


2016-03-05 02:52:10

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 2/2] xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

Small XFS filesystems on systems with large number of CPUs can incur a
significant overhead due to excessive calls to the percpu_counter_sum()
function which needs to walk through a large number of different
cachelines.

This patch uses the newly added percpu_counter_set_limit() API to
potentially switch the m_fdblocks and m_ifree per-cpu counters to
a global counter with locks at filesystem mount time if its size
is small relatively to the number of CPUs available.

A possible use case is the use of the NVDIMM as an application scratch
storage area for log file and other small files. Current battery-backed
NVDIMMs are pretty small in size, e.g. 8G per DIMM. So we cannot create
large filesystem on top of them.

On a 4-socket 80-thread system running 4.5-rc6 kernel, this patch can
improve the throughput of the AIM7 XFS disk workload by 25%. Before
the patch, the perf profile was:

18.68% 0.08% reaim [k] __percpu_counter_compare
18.05% 9.11% reaim [k] __percpu_counter_sum
0.37% 0.36% reaim [k] __percpu_counter_add

After the patch, the perf profile was:

0.73% 0.36% reaim [k] __percpu_counter_add
0.27% 0.27% reaim [k] __percpu_counter_compare

Signed-off-by: Waiman Long <[email protected]>
---
fs/xfs/xfs_mount.c | 1 -
fs/xfs/xfs_mount.h | 5 +++++
fs/xfs/xfs_super.c | 6 ++++++
3 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb753b3..fe74b91 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1163,7 +1163,6 @@ xfs_mod_ifree(
* a large batch count (1024) to minimise global counter updates except when
* we get near to ENOSPC and we have to be very accurate with our updates.
*/
-#define XFS_FDBLOCKS_BATCH 1024
int
xfs_mod_fdblocks(
struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b570984..d9520f4 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -206,6 +206,11 @@ typedef struct xfs_mount {
#define XFS_WSYNC_WRITEIO_LOG 14 /* 16k */

/*
+ * FD blocks batch size for per-cpu compare
+ */
+#define XFS_FDBLOCKS_BATCH 1024
+
+/*
* Allow large block sizes to be reported to userspace programs if the
* "largeio" mount option is used.
*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 59c9b7b..c0b4f79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1412,6 +1412,12 @@ xfs_reinit_percpu_counters(
percpu_counter_set(&mp->m_icount, mp->m_sb.sb_icount);
percpu_counter_set(&mp->m_ifree, mp->m_sb.sb_ifree);
percpu_counter_set(&mp->m_fdblocks, mp->m_sb.sb_fdblocks);
+
+ /*
+ * Use default batch size for m_ifree
+ */
+ percpu_counter_set_limit(&mp->m_ifree, 0);
+ percpu_counter_set_limit(&mp->m_fdblocks, 4 * XFS_FDBLOCKS_BATCH);
}

static void
--
1.7.1

2016-03-05 02:52:09

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system

Per-cpu counters are used in quite a number of places within
the kernel. On large system with a lot of CPUs, however, doing a
percpu_counter_sum() can be very expensive as nr_cpu cachelines will
need to be read. In __percpu_counter_compare(), the chance of calling
percpu_counter_sum() also increases with increasing number of CPUs
if the global counter value is relatively small.

On large system, using a global counter with lock may actually be
faster than doing a percpu_counter_sum() which can be frequently
called from __percpu_counter_compare().

This patch provides a mechanism to selectively degenerate per-cpu
counters to global counters at per-cpu counter initialization time. The
following new API is added:

percpu_counter_set_limit(struct percpu_counter *fbc,
u32 percpu_limit)

The function should be called after percpu_counter_set(). It will
compare the total limit (nr_cpu * percpu_limit) against the current
counter value. If the limit is not smaller, it will disable per-cpu
counter and use only the global counter instead. At run time, when
the counter value grows past the total limit, per-cpu counter will
be enabled again.

Runtime disabling of per-cpu counters, however, is not currently
supported as it will slow down the per-cpu fast path.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/percpu_counter.h | 10 +++++
lib/percpu_counter.c | 72 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 84a1094..04a3783 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,14 @@

#ifdef CONFIG_SMP

+/*
+ * The per-cpu counter will be degenerated into a global counter when limit
+ * is set at initialization time. It will change back to a real per-cpu
+ * counter once the count exceed the given limit.
+ */
struct percpu_counter {
raw_spinlock_t lock;
+ u32 limit;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
@@ -42,6 +48,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit);

static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
{
@@ -170,6 +177,9 @@ static inline int percpu_counter_initialized(struct percpu_counter *fbc)
return 1;
}

+static inline void percpu_counter_set_limit(struct percpu_counter *fbc,
+ u32 percpu_limit) { }
+
#endif /* CONFIG_SMP */

static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index f051d69..f101c06 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -75,11 +75,25 @@ EXPORT_SYMBOL(percpu_counter_set);
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
+ unsigned long flags;
+
+ if (fbc->limit) {
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ if (unlikely(!fbc->limit)) {
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+ goto percpu_add;
+ }
+ fbc->count += amount;
+ if (abs(fbc->count) > fbc->limit)
+ fbc->limit = 0; /* Revert back to per-cpu counter */

+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+ return;
+ }
+percpu_add:
preempt_disable();
count = __this_cpu_read(*fbc->counters) + amount;
if (count >= batch || count <= -batch) {
- unsigned long flags;
raw_spin_lock_irqsave(&fbc->lock, flags);
fbc->count += count;
__this_cpu_sub(*fbc->counters, count - amount);
@@ -94,6 +108,8 @@ EXPORT_SYMBOL(__percpu_counter_add);
/*
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
+ *
+ * If a limit is set, the count can be returned directly without locking.
*/
s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
@@ -101,6 +117,9 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
int cpu;
unsigned long flags;

+ if (READ_ONCE(fbc->limit))
+ return READ_ONCE(fbc->count);
+
raw_spin_lock_irqsave(&fbc->lock, flags);
ret = fbc->count;
for_each_online_cpu(cpu) {
@@ -120,6 +139,7 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp,
raw_spin_lock_init(&fbc->lock);
lockdep_set_class(&fbc->lock, key);
fbc->count = amount;
+ fbc->limit = 0;
fbc->counters = alloc_percpu_gfp(s32, gfp);
if (!fbc->counters)
return -ENOMEM;
@@ -202,6 +222,9 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
s64 count;

count = percpu_counter_read(fbc);
+ if (READ_ONCE(fbc->limit))
+ goto compare;
+
/* Check to see if rough count will be sufficient for comparison */
if (abs(count - rhs) > (batch * num_online_cpus())) {
if (count > rhs)
@@ -211,6 +234,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
}
/* Need to use precise count */
count = percpu_counter_sum(fbc);
+compare:
if (count > rhs)
return 1;
else if (count < rhs)
@@ -220,6 +244,52 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
}
EXPORT_SYMBOL(__percpu_counter_compare);

+/*
+ * Set the limit if the count is less than the given per-cpu limit * # of cpus.
+ *
+ * This function should only be called at initialization time right after
+ * percpu_counter_set(). Limit will only be set if there is more than
+ * 32 cpus in the system and the current counter value is not bigger than
+ * the limit. Once it is set, it can be cleared as soon as the counter
+ * value exceeds the given limit and real per-cpu counters are used again.
+ * However, switching from per-cpu counters back to global counter is not
+ * currently supported as that will slow down the per-cpu counter fastpath.
+ *
+ * The magic number 32 is chosen to be a compromise between the cost of
+ * reading all the per-cpu counters and that of locking. It can be changed
+ * if there is a better value.
+ */
+#define PERCPU_SET_LIMIT_CPU_THRESHOLD 32
+void percpu_counter_set_limit(struct percpu_counter *fbc, u32 percpu_limit)
+{
+ unsigned long flags;
+ int nrcpus = num_possible_cpus();
+ u32 limit;
+
+ if (nrcpus <= PERCPU_SET_LIMIT_CPU_THRESHOLD)
+ return;
+
+ if (!fbc->count) {
+ WARN(1, "percpu_counter_set_limit() called without an initial counter value!\n");
+ return;
+ }
+ /*
+ * Use default batch size if the given percpu limit is 0.
+ */
+ if (!percpu_limit)
+ percpu_limit = percpu_counter_batch;
+ limit = percpu_limit * nrcpus;
+
+ /*
+ * Limit will not be set if the count is large enough
+ */
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ if (abs(fbc->count) <= limit)
+ fbc->limit = limit;
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_set_limit);
+
static int __init percpu_counter_startup(void)
{
compute_batch_value();
--
1.7.1

2016-03-05 06:34:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
>
> 1) The number of CPUs in the system is large, hence a high cost for
> calling percpu_counter_sum().
> 2) The initial count value is small so that it has a high chance of
> excessive percpu_counter_sum() calls.
>
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
>
> Patch 1 implements the new percpu_counter_set_limit() API.
>
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
>
> Waiman Long (2):
> percpu_counter: Allow falling back to global counter on large system
> xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters. We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
slow everything down massively, and
d) is indicative that the workload is about to suffer
from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-07 17:42:33

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

On 03/05/2016 01:34 AM, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
>> This patchset allows the degeneration of per-cpu counters back to
>> global counters when:
>>
>> 1) The number of CPUs in the system is large, hence a high cost for
>> calling percpu_counter_sum().
>> 2) The initial count value is small so that it has a high chance of
>> excessive percpu_counter_sum() calls.
>>
>> When the above 2 conditions are true, this patchset allows the user of
>> per-cpu counters to selectively degenerate them into global counters
>> with lock. This is done by calling the new percpu_counter_set_limit()
>> API after percpu_counter_set(). Without this call, there is no change
>> in the behavior of the per-cpu counters.
>>
>> Patch 1 implements the new percpu_counter_set_limit() API.
>>
>> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
>> per-cpu counters.
>>
>> Waiman Long (2):
>> percpu_counter: Allow falling back to global counter on large system
>> xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
> NACK.
>
> This change to turns off per-counter free block counters for 32p for
> the XFS free block counters. We proved 10 years ago that a global
> lock for these counters was a massive scalability limitation for
> concurrent buffered writes on 16p machines.
>
> IOWs, this change is going to cause fast path concurrent sequential
> write regressions for just about everyone, even on empty
> filesystems.

That is not really the case here. The patch won't change anything if
there is enough free blocks available in the filesystem. It will turn on
global lock at mount time iff the number of free blocks available is
less than the given limit. In the case of XFS, it is 12MB per CPU. On
the 80-thread system that I used for testing, it will be a bit less than
1GB. Even if global lock is enabled at the beginning, it will be
transitioned back to percpu lock as soon as enough free blocks become
available.

I am aware that if there are enough threads pounding on the lock, it can
cause a scalability bottleneck. However, the qspinlock used in x86
should greatly alleviate the scalability impact compared with 10 years
ago when we used the ticket lock. BTW, what exactly was the
microbenchmark that you used to exercise concurrent sequential write? I
would like to try it out on the new hardware and kernel.

The AIM7 microbenchmark that I used was not able to generate more than
1% CPU time in spinlock contention for __percpu_counter_add() on my
80-thread test system. On the other hand, the overhead of doing
percpu_counter_sum() had consumed more than 18% of CPU time with the
same microbenchmark when the filesystem was small. If the number of
__percpu_counter_add() call is large enough to cause significant
spinlock contention, I think the time wasted in percpu_counter_sum()
will be even more for a small filesytem. In the borderline case when the
filesystem is small enough to trigger the use of global lock with my
patch, but not small enough to trigger excessive percpu_counter_sum()
call, then my patch will have caused a degradation in performance.

So I don't think this patch will cause any problem with the free block
count. The other percpu count m_ifree, however, is a problem in the
current code. It used the default batch size, which is my 80-thread
system, is 12800 (2*nr_cpus^2). However, the number of free inodes in
the in the various XFS filesystems were less than 2k. So
percpu_counter_sum() was called every time xfs_mod_ifree() was called.
This costed about 3%CPU time with my microbenchmark, which was also
eliminated by my patch.

> The behaviour you are seeing only occurs when the filesystem is near
> to ENOSPC. As i asked you last time - if you want to make this
> problem go away, please increase the size of the filesystem you are
> running your massively concurrent benchmarks on.
>
> IOWs, please stop trying to optimise a filesystem slow path that:
>
> a) 99.9% of production workloads never execute,
> b) where we expect performance to degrade as allocation gets
> computationally expensive as we close in on ENOSPC,
> c) we start to execute blocking data flush operations that
> slow everything down massively, and
> d) is indicative that the workload is about to suffer
> from a fatal, unrecoverable error (i.e. ENOSPC)
>

I totally agree. I am not trying to optimize a filesystem slowpath.
There are use cases, however, where we may want to create relatively
small filesystem. One example that I cited in patch 2 is the battery
backed NVDIMM that I have played with recently. They can be used for log
files or other small files. Each dimm is 8 GB. You can have a few of
those available. So the filesystem size could be 32GB or so. That can
come close to the the limit where excessive percpu_counter_sum() call
can happen. What I want to do here is to try to reduce the chance of
excessive percpu_counter_sum() calls causing a performance problem. For
a large filesystem that is nowhere near ENOSPC, my patch will have no
performance impact whatsoever.

Cheers,
Longman


Subject: Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system

On Fri, 4 Mar 2016, Waiman Long wrote:

> This patch provides a mechanism to selectively degenerate per-cpu
> counters to global counters at per-cpu counter initialization time. The
> following new API is added:
>
> percpu_counter_set_limit(struct percpu_counter *fbc,
> u32 percpu_limit)
>
> The function should be called after percpu_counter_set(). It will
> compare the total limit (nr_cpu * percpu_limit) against the current
> counter value. If the limit is not smaller, it will disable per-cpu
> counter and use only the global counter instead. At run time, when
> the counter value grows past the total limit, per-cpu counter will
> be enabled again.

Hmmm... That is requiring manual setting of a limit. Would it not be
possible to completely automatize the switch over? F.e. one could
keep a cpumask of processors that use the per cpu counters.

Then in the fastpath if the current cpu is a member increment the per cpu
counter. If not do the spinlock thing. If there is contention add the
cpu to the cpumask and use the per cpu counters. Thus automatically
scaling for the processors on which frequent increments are operating.

Then regularly (once per minute or so) degenerate the counter by folding
the per cpu diffs into the global count and zapping the cpumask.

If the cpumask is empty you can use the global count. Otherwise you just
need to add up the counters of the cpus set in the cpumask.

2016-03-07 19:50:36

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system

On 03/07/2016 01:24 PM, Christoph Lameter wrote:
> On Fri, 4 Mar 2016, Waiman Long wrote:
>
>> This patch provides a mechanism to selectively degenerate per-cpu
>> counters to global counters at per-cpu counter initialization time. The
>> following new API is added:
>>
>> percpu_counter_set_limit(struct percpu_counter *fbc,
>> u32 percpu_limit)
>>
>> The function should be called after percpu_counter_set(). It will
>> compare the total limit (nr_cpu * percpu_limit) against the current
>> counter value. If the limit is not smaller, it will disable per-cpu
>> counter and use only the global counter instead. At run time, when
>> the counter value grows past the total limit, per-cpu counter will
>> be enabled again.
> Hmmm... That is requiring manual setting of a limit. Would it not be
> possible to completely automatize the switch over? F.e. one could
> keep a cpumask of processors that use the per cpu counters.

The limit is usually the batch size used or a multiple of it.

> Then in the fastpath if the current cpu is a member increment the per cpu
> counter. If not do the spinlock thing. If there is contention add the
> cpu to the cpumask and use the per cpu counters. Thus automatically
> scaling for the processors on which frequent increments are operating.

That is an interesting idea. I will do some prototyping and see how it
goes. One of the downside that I see is the increase in the size of the
percpu_counter structure.

> Then regularly (once per minute or so) degenerate the counter by folding
> the per cpu diffs into the global count and zapping the cpumask.

Actually, I think we need 2 cpumasks - one for deciding to use global or
percpu count and another one for which percpu counts are used as it is
not safe to change a per-cpu count other than your own one.

> If the cpumask is empty you can use the global count. Otherwise you just
> need to add up the counters of the cpus set in the cpumask.

Cheers,
Longman


2016-03-07 21:34:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >> 1) The number of CPUs in the system is large, hence a high
> >> cost for calling percpu_counter_sum(). 2) The initial count
> >> value is small so that it has a high chance of excessive
> >> percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters. We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
>
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem. It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency....

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> > a) 99.9% of production workloads never execute, b) where we
> > expect performance to degrade as allocation gets
> > computationally expensive as we close in on ENOSPC, c) we
> > start to execute blocking data flush operations that slow
> > everything down massively, and d) is indicative that the
> > workload is about to suffer from a fatal, unrecoverable
> > error (i.e. ENOSPC)
> >
>
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so. That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if space if freed in the fs, the global lock is
turned off, and will never get turned back on.

Further, anyone using XFS on nvdimms will be enabling DAX, which
goes through the direct IO path rather than the buffered IO path
that is generating all this block accounting pressure. Hence it will
behave differently, and so your solution doesn't obviously apply to
that workload space, either.

When we get production workloads hitting free block accounting
issues near ENOSPC, then we'll look at optimising the XFS accounting
code. Microbenchmarks are great when they have real-work relevance,
but this doesn't right now. Not to mention we've got bigger things
to worry about in XFS right now in terms of ENOSPC accounting (think
reverse mapping, shared blocks and breaking shares via COW right
next to ENOSPC) and getting these working *correctly* takes
precendence of optimisation of the accounting code.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-16 19:21:51

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system

On 03/07/2016 01:24 PM, Christoph Lameter wrote:
> On Fri, 4 Mar 2016, Waiman Long wrote:
>
>> This patch provides a mechanism to selectively degenerate per-cpu
>> counters to global counters at per-cpu counter initialization time. The
>> following new API is added:
>>
>> percpu_counter_set_limit(struct percpu_counter *fbc,
>> u32 percpu_limit)
>>
>> The function should be called after percpu_counter_set(). It will
>> compare the total limit (nr_cpu * percpu_limit) against the current
>> counter value. If the limit is not smaller, it will disable per-cpu
>> counter and use only the global counter instead. At run time, when
>> the counter value grows past the total limit, per-cpu counter will
>> be enabled again.
> Hmmm... That is requiring manual setting of a limit. Would it not be
> possible to completely automatize the switch over? F.e. one could
> keep a cpumask of processors that use the per cpu counters.
>
> Then in the fastpath if the current cpu is a member increment the per cpu
> counter. If not do the spinlock thing. If there is contention add the
> cpu to the cpumask and use the per cpu counters. Thus automatically
> scaling for the processors on which frequent increments are operating.
>
> Then regularly (once per minute or so) degenerate the counter by folding
> the per cpu diffs into the global count and zapping the cpumask.
>
> If the cpumask is empty you can use the global count. Otherwise you just
> need to add up the counters of the cpus set in the cpumask.
>

I have modified the patch to try that out. However, that doesn't yield
that much of improvement in term of performance and it slows down the
percpu fast path a bit. So I am going to focus on my existing patch
first and think about that later.

Cheers,
Longman

2016-03-16 20:13:40

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

On 03/07/2016 04:33 PM, Dave Chinner wrote:
> On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
>> On 03/05/2016 01:34 AM, Dave Chinner wrote:
>>> On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
>>>> This patchset allows the degeneration of per-cpu counters back
>>>> to global counters when:
>>>>
>>>> 1) The number of CPUs in the system is large, hence a high
>>>> cost for calling percpu_counter_sum(). 2) The initial count
>>>> value is small so that it has a high chance of excessive
>>>> percpu_counter_sum() calls.
>>>>
>>>> When the above 2 conditions are true, this patchset allows the
>>>> user of per-cpu counters to selectively degenerate them into
>>>> global counters with lock. This is done by calling the new
>>>> percpu_counter_set_limit() API after percpu_counter_set().
>>>> Without this call, there is no change in the behavior of the
>>>> per-cpu counters.
>>>>
>>>> Patch 1 implements the new percpu_counter_set_limit() API.
>>>>
>>>> Patch 2 modifies XFS to call the new API for the m_ifree and
>>>> m_fdblocks per-cpu counters.
>>>>
>>>> Waiman Long (2): percpu_counter: Allow falling back to global
>>>> counter on large system xfs: Allow degeneration of
>>>> m_fdblocks/m_ifree to global counters
>>> NACK.
>>>
>>> This change to turns off per-counter free block counters for 32p
>>> for the XFS free block counters. We proved 10 years ago that a
>>> global lock for these counters was a massive scalability
>>> limitation for concurrent buffered writes on 16p machines.
>>>
>>> IOWs, this change is going to cause fast path concurrent
>>> sequential write regressions for just about everyone, even on
>>> empty filesystems.
>> That is not really the case here. The patch won't change anything
>> if there is enough free blocks available in the filesystem. It
>> will turn on global lock at mount time iff the number of free
>> blocks available is less than the given limit. In the case of XFS,
>> it is 12MB per CPU. On the 80-thread system that I used for
>> testing, it will be a bit less than 1GB. Even if global lock is
>> enabled at the beginning, it will be transitioned back to percpu
>> lock as soon as enough free blocks become available.
> Again: How is this an optimisation that is generally useful? Nobody
> runs their production 80-thread workloads on a filesystems with less
> than 1GB of free space. This is a situation that most admins would
> consider "impending doom".

In most cases, there will be enough free blocks in m_fdblocks that the
switching to global count will never happen. However, I found that
m_ifree is a different story. On the 80-cpu system that I used, the
percpu slowpath will be activated when there are less than 2*80^2 =
12800 free inodes available which is usually the case because the code
use the default batch size (which scale linearly with # of cpus). Here,
my patch can really help.

>
>> I am aware that if there are enough threads pounding on the lock,
>> it can cause a scalability bottleneck. However, the qspinlock used
>> in x86 should greatly alleviate the scalability impact compared
>> with 10 years ago when we used the ticket lock.
> Regardless of whether there is less contention, it still brings back
> a global serialisation point and modified cacheline (the free block
> counter) in the filesystem that, at some point, will limit
> concurrency....

Yes, that is true, but the alternative here is to access all the
cachelines of the percpu counters and evict quite a number of other
useful cachelines along the way. My patch activates the global counter
at mount time only when the current count is too small. It was proven in
my test case that accessing all those cachelines was worse that taken
the lock when there are large number of cpus.

Once the counter increase past the limit, it will disable the global
counter and fall back to the usual per-cpu mode. The global counter
won't be reactivated unless you unmount and remount the filesystem
again. So I don't this case will cause any performance bottleneck that
is worse than what the existing code is.

>> BTW, what exactly
>> was the microbenchmark that you used to exercise concurrent
>> sequential write? I would like to try it out on the new hardware
>> and kernel.
> Just something that HPC apps have been known to do for more then 20
> years: concurrent sequential write from every CPU in the system.
>
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Thanks.

>
>>> near to ENOSPC. As i asked you last time - if you want to make
>>> this problem go away, please increase the size of the filesystem
>>> you are running your massively concurrent benchmarks on.
>>>
>>> IOWs, please stop trying to optimise a filesystem slow path that:
>>>
>>> a) 99.9% of production workloads never execute, b) where we
>>> expect performance to degrade as allocation gets
>>> computationally expensive as we close in on ENOSPC, c) we
>>> start to execute blocking data flush operations that slow
>>> everything down massively, and d) is indicative that the
>>> workload is about to suffer from a fatal, unrecoverable
>>> error (i.e. ENOSPC)
>>>
>> I totally agree. I am not trying to optimize a filesystem
>> slowpath.
> Where else in the kernel is there a requirement for 100%
> accurate threshold detection on per-cpu counters? There isn't, is
> there?

I don't quite get what you are asking here. The goal of this patch is
not 100% accurate threshold detection. This is not what I am aiming for.
Instead, I am trying to reduce the performance impact due to excessive
per_counter_sum() calls. Accurate threshold detection is just a side
effect, not the main purpose for this patch.

>> There are use cases, however, where we may want to
>> create relatively small filesystem. One example that I cited in
>> patch 2 is the battery backed NVDIMM that I have played with
>> recently. They can be used for log files or other small files.
>> Each dimm is 8 GB. You can have a few of those available. So the
>> filesystem size could be 32GB or so. That can come close to the
>> the limit where excessive percpu_counter_sum() call can happen.
>> What I want to do here is to try to reduce the chance of excessive
>> percpu_counter_sum() calls causing a performance problem. For a
>> large filesystem that is nowhere near ENOSPC, my patch will have
>> no performance impact whatsoever.
> Yet your patch won't have any effect on these "small" filesystems
> because unless they have less free space than your threshold at
> mount time (rare!) they won't ever have this global lock turned on.
> Not to mention if space if freed in the fs, the global lock is
> turned off, and will never get turned back on.

Yes, as I said above, the m_fdblocks counter isn't an issue in almost
all the cases. However, the patch was found to be useful in reducing
performance overhead of the percpu_counter_sum() call on the m_ifree
counter on a moderately sized XFS partition. BTW, I made no change in
the XFS code other than activating the counter limits. If those limits
are not activated, there is absolutely no change to the current behavior.

Yes, the benchmark that I shown in the changelog isn't real-life
scenario. I will try to look for other benchmark results that are more
realistic.

Cheers,
Longman


> Further, anyone using XFS on nvdimms will be enabling DAX, which
> goes through the direct IO path rather than the buffered IO path
> that is generating all this block accounting pressure. Hence it will
> behave differently, and so your solution doesn't obviously apply to
> that workload space, either.
>
> When we get production workloads hitting free block accounting
> issues near ENOSPC, then we'll look at optimising the XFS accounting
> code. Microbenchmarks are great when they have real-work relevance,
> but this doesn't right now. Not to mention we've got bigger things
> to worry about in XFS right now in terms of ENOSPC accounting (think
> reverse mapping, shared blocks and breaking shares via COW right
> next to ENOSPC) and getting these working *correctly* takes
> precendence of optimisation of the accounting code.
>

Subject: Re: [RFC PATCH 1/2] percpu_counter: Allow falling back to global counter on large system

On Wed, 16 Mar 2016, Waiman Long wrote:

> > If the cpumask is empty you can use the global count. Otherwise you just
> > need to add up the counters of the cpus set in the cpumask.
> >
>
> I have modified the patch to try that out. However, that doesn't yield that
> much of improvement in term of performance and it slows down the percpu fast
> path a bit. So I am going to focus on my existing patch first and think about
> that later.

Hmmm... Maybe look at the cause of the slowdown first?