2014-04-04 17:00:55

by Daniel J Blueman

[permalink] [raw]
Subject: ext4 performance falloff

On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very
low 600KB/s cached write performance to a local ext4 filesystem:

# mkfs.ext4 /dev/sda5
# mount /dev/sda5 /mnt
# dd if=/dev/zero of=/mnt/test bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 17.4307 s, 602 kB/s

Whereas eg on XFS, performance is much more reasonable:

# mkfs.xfs /dev/sda5
# mount /dev/sda5 /mnt
# dd if=/dev/zero of=/mnt/test bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 2.39329 s, 43.8 MB/s

Perf shows the time spent in bitmask iteration:

98.77% dd [kernel.kallsyms] [k] find_next_bit

|
--- find_next_bit
|
|--99.92%-- __percpu_counter_sum
| ext4_has_free_clusters
| ext4_claim_free_clusters
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| _ext4_get_block
| ext4_get_block
| __block_write_begin
| ext4_write_begin
| ext4_da_write_begin
| generic_file_buffered_write
| __generic_file_aio_write
| generic_file_aio_write
| ext4_file_write
| do_sync_write
| vfs_write
| sys_write
| system_call_fastpath
| __write_nocancel
| 0x0
--0.08%-- [...]

Analysis shows that ext4 is reading from all cores' cpu-local data (thus
expensive off-NUMA-node access) for each block written:

if (free_clusters - (nclusters + rsv + dirty_clusters) <
EXT4_FREECLUSTERS_WATERMARK) {
free_clusters = percpu_counter_sum_positive(fcc);
dirty_clusters = percpu_counter_sum_positive(dcc);
}

This threshold is defined as:

#define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
nr_cpu_ids))

I can see why this may get overlooked for systems with commensurate
local storage, but some filesystems reasonably don't need to scale with
core count. The filesystem I'm testing on and the rootfs (as it has
/tmp) are 50GB.

There must be a good rationale for this being dependent on the number of
cores rather than just the ratio of used space, right?

Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale


2014-04-04 20:56:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 performance falloff

On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
> 600KB/s cached write performance to a local ext4 filesystem:

Hi Daniel,

Thanks for the heads up. Most (all?) of the ext4 don't have systems
with thousands of cores, so these issues generally don't come up for
us, and so we're not likely (hell, very unlikely!) to notice potential
problems cause by these sorts of uber-large systems.

> Analysis shows that ext4 is reading from all cores' cpu-local data (thus
> expensive off-NUMA-node access) for each block written:
>
> if (free_clusters - (nclusters + rsv + dirty_clusters) <
> EXT4_FREECLUSTERS_WATERMARK) {
> free_clusters = percpu_counter_sum_positive(fcc);
> dirty_clusters = percpu_counter_sum_positive(dcc);
> }
>
> This threshold is defined as:
>
> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
> nr_cpu_ids))
>
> I can see why this may get overlooked for systems with commensurate local
> storage, but some filesystems reasonably don't need to scale with core
> count. The filesystem I'm testing on and the rootfs (as it has /tmp) are
> 50GB.

The problem we are trying to solve here is that when we do delayed
allocation, we're making an implicit promise that there will be space
available, even though we haven't allocated the space yet. The reason
why we are using percpu counters is precisely so that we don't have to
take a global lock in order to protect the free space counter for the
file system.

The problem is that when we start getting close to full, there is the
possibility that all of the cpus might simultaneously try allocate
space at exactly the same time (and while that might sound unlikely,
Murphy's law will dictate that if the downside is that the user will
lose data, and curse the day the file system developers were born, it
*will* happen :-). So when the free space, minus the space we have
already promised, drops below EXT4_FREE_CLUSTERS_WATERMARK, we start
being super careful.

I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
or 864 megabytes. That would mean that the file system is over 98%
full, so that's actually pretty reasonable; most of the time there's
more free space than that.

It looks like the real problem is that we're using nr_cpu_ids, which
is the maximum possible number of cpu's that the system can support,
which is different from the number of cpu's that you currently have.
For normal kernels nr_cpu_ids is small, so that has never been a
problem, but I bet you have nr_cpu_ids set to something really large,
right?

If you change nr_cpu_ids to total_cpus in the definition of
EXT4_FREECLUSTERS_WATERMARK, does that make things better for your
system?

Thanks,

- Ted

2014-04-05 03:28:17

by Daniel J Blueman

[permalink] [raw]
Subject: Re: ext4 performance falloff

On 04/05/2014 04:56 AM, Theodore Ts'o wrote:
> On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
>> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
>> 600KB/s cached write performance to a local ext4 filesystem:

> Thanks for the heads up. Most (all?) of the ext4 don't have systems
> with thousands of cores, so these issues generally don't come up for
> us, and so we're not likely (hell, very unlikely!) to notice potential
> problems cause by these sorts of uber-large systems.

Hehe. It's not every day we get access to these systems also.

>> Analysis shows that ext4 is reading from all cores' cpu-local data (thus
>> expensive off-NUMA-node access) for each block written:
>>
>> if (free_clusters - (nclusters + rsv + dirty_clusters) <
>> EXT4_FREECLUSTERS_WATERMARK) {
>> free_clusters = percpu_counter_sum_positive(fcc);
>> dirty_clusters = percpu_counter_sum_positive(dcc);
>> }
>>
>> This threshold is defined as:
>>
>> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
>> nr_cpu_ids))
...
> The problem we are trying to solve here is that when we do delayed
> allocation, we're making an implicit promise that there will be space
> available
>
> I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
> or 864 megabytes. That would mean that the file system is over 98%
> full, so that's actually pretty reasonable; most of the time there's
> more free space than that.

The filesystem is empty after the mkfs; the approach here may make sense
if we want to allow all cores to write to this FS, but here we have one.

Instrumenting shows that free_clusters=16464621 nclusters=1 rsv=842790
dirty_clusters=0 percpu_counter_batch=3456 nr_cpu_ids=1728; below 91GB
space, we'd hit this issue. It feels more sensible to start this
behaviour when the FS is say 98% full, irrespective of the number of
cores, but that's not why the behaviour is there.

Since these block devices are attached to a single NUMA node's IO link,
there is a scaling limitation there anyway, so there may be rationale in
limiting this to use min(256,nr_cpu_ids) maybe?

> It looks like the real problem is that we're using nr_cpu_ids, which
> is the maximum possible number of cpu's that the system can support,
> which is different from the number of cpu's that you currently have.
> For normal kernels nr_cpu_ids is small, so that has never been a
> problem, but I bet you have nr_cpu_ids set to something really large,
> right?
>
> If you change nr_cpu_ids to total_cpus in the definition of
> EXT4_FREECLUSTERS_WATERMARK, does that make things better for your
> system?

I have reproduced this with CPU hotplug disabled, so nr_cpu_ids is
nicely at 1728.

Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale

2014-04-07 14:19:39

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 performance falloff

On Sat 05-04-14 11:28:17, Daniel J Blueman wrote:
> On 04/05/2014 04:56 AM, Theodore Ts'o wrote:
> >On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
> >>On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
> >>600KB/s cached write performance to a local ext4 filesystem:
>
> > Thanks for the heads up. Most (all?) of the ext4 don't have systems
> > with thousands of cores, so these issues generally don't come up for
> > us, and so we're not likely (hell, very unlikely!) to notice potential
> > problems cause by these sorts of uber-large systems.
>
> Hehe. It's not every day we get access to these systems also.
>
> >>Analysis shows that ext4 is reading from all cores' cpu-local data (thus
> >>expensive off-NUMA-node access) for each block written:
> >>
> >>if (free_clusters - (nclusters + rsv + dirty_clusters) <
> >> EXT4_FREECLUSTERS_WATERMARK) {
> >> free_clusters = percpu_counter_sum_positive(fcc);
> >> dirty_clusters = percpu_counter_sum_positive(dcc);
> >>}
> >>
> >>This threshold is defined as:
> >>
> >>#define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
> >>nr_cpu_ids))
> ...
> >The problem we are trying to solve here is that when we do delayed
> >allocation, we're making an implicit promise that there will be space
> >available
> >
> >I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
> >or 864 megabytes. That would mean that the file system is over 98%
> >full, so that's actually pretty reasonable; most of the time there's
> >more free space than that.
>
> The filesystem is empty after the mkfs; the approach here may make
> sense if we want to allow all cores to write to this FS, but here we
> have one.
>
> Instrumenting shows that free_clusters=16464621 nclusters=1
> rsv=842790 dirty_clusters=0 percpu_counter_batch=3456
> nr_cpu_ids=1728; below 91GB space, we'd hit this issue. It feels
> more sensible to start this behaviour when the FS is say 98% full,
> irrespective of the number of cores, but that's not why the
> behaviour is there.
Yeah, percpu_counter_batch = max(32, nr*2) so the value you observe is
correct and EXT4_FREECLUSTERS_WATERMARK is then 23887872 ~= 95 GB. Clearly
we have to try to be more clever on these large systems.

> Since these block devices are attached to a single NUMA node's IO
> link, there is a scaling limitation there anyway, so there may be
> rationale in limiting this to use min(256,nr_cpu_ids) maybe?
Well, but when you get something "allocated" from the counter, we rely on
the space being really available in the filesystem (so that delayed
allocated blocks can be allocated and written out). With this limitation to
256 if there is more that 256*percpu_counter_patch accumulated in the
percpu part of the counter, we could promise allocating something we don't
really have space for. And I understand this is unlikely but when we speak
about "your data is lost", even unlikely doesn't sound good to people. They
want "this can never happen" promises :)

What we really need is a counter where we can better estimate counts
accumulated in the percpu part of it. As the counter approaches zero, it's
CPU overhead will have to become that of a single locked variable but when
the value of counter is relatively high, we want it to be fast as the
percpu one. Possibly, each CPU could "reserve" part of the value in the
counter (by just decrementing the total value; how large that part should
be really needs to depend to the total value of the counter and number of
CPUs - in this regard we really differ from classical percpu couters) and
allocate/free using that part. If CPU cannot reserve what it is asked for
anymore, it would go and steal from parts other CPUs have accumulated,
returning them to global pool until it can satisfy the allocation.

But someone would need to try whether this really works out reasonably fast
:).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-04-07 16:40:28

by Andi Kleen

[permalink] [raw]
Subject: Re: ext4 performance falloff

Jan Kara <[email protected]> writes:
>
> What we really need is a counter where we can better estimate counts
> accumulated in the percpu part of it. As the counter approaches zero, it's
> CPU overhead will have to become that of a single locked variable but when
> the value of counter is relatively high, we want it to be fast as the
> percpu one. Possibly, each CPU could "reserve" part of the value in the
> counter (by just decrementing the total value; how large that part should
> be really needs to depend to the total value of the counter and number of
> CPUs - in this regard we really differ from classical percpu couters) and
> allocate/free using that part. If CPU cannot reserve what it is asked for
> anymore, it would go and steal from parts other CPUs have accumulated,
> returning them to global pool until it can satisfy the allocation.

That's a percpu_counter() isn't it? (or cookie jar)

The MM uses similar techniques.

-Andi

--
[email protected] -- Speaking for myself only

2014-04-07 20:08:30

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 performance falloff

On Mon 07-04-14 09:40:28, Andi Kleen wrote:
> Jan Kara <[email protected]> writes:
> >
> > What we really need is a counter where we can better estimate counts
> > accumulated in the percpu part of it. As the counter approaches zero, it's
> > CPU overhead will have to become that of a single locked variable but when
> > the value of counter is relatively high, we want it to be fast as the
> > percpu one. Possibly, each CPU could "reserve" part of the value in the
> > counter (by just decrementing the total value; how large that part should
> > be really needs to depend to the total value of the counter and number of
> > CPUs - in this regard we really differ from classical percpu couters) and
> > allocate/free using that part. If CPU cannot reserve what it is asked for
> > anymore, it would go and steal from parts other CPUs have accumulated,
> > returning them to global pool until it can satisfy the allocation.
>
> That's a percpu_counter() isn't it? (or cookie jar)
Not quite. We could use __percpu_counter_add() to set batch size for each
operation depending on the current counter value. But still we don't want
any cpu-local count to go negative (as then we cannot rely on global
counter to give us a lower bound on number of free blocks). Also stealing
from different cpu needs to be implemented...

> The MM uses similar techniques.
Where exactly? I'd be happy to be inspired :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-04-08 10:30:27

by Dave Chinner

[permalink] [raw]
Subject: Re: ext4 performance falloff

On Mon, Apr 07, 2014 at 09:40:28AM -0700, Andi Kleen wrote:
> Jan Kara <[email protected]> writes:
> >
> > What we really need is a counter where we can better estimate counts
> > accumulated in the percpu part of it. As the counter approaches zero, it's
> > CPU overhead will have to become that of a single locked variable but when
> > the value of counter is relatively high, we want it to be fast as the
> > percpu one. Possibly, each CPU could "reserve" part of the value in the
> > counter (by just decrementing the total value; how large that part should
> > be really needs to depend to the total value of the counter and number of
> > CPUs - in this regard we really differ from classical percpu couters) and
> > allocate/free using that part. If CPU cannot reserve what it is asked for
> > anymore, it would go and steal from parts other CPUs have accumulated,
> > returning them to global pool until it can satisfy the allocation.

Yup, that's pretty much what the slow path/fast path breakdown of
the xfs_icsb_* (XFS In-Core Super Block) code in fs/xfs/xfs_mount.c
does. :)

It distributes free space across all the CPUs and
rebalances them when a per-CPu counter runs out. And to avoid lots
of rebalances when ENOSPC approaches (512 blocks per CPU, IIRC),
it disables the per-CPU counters completely and falls back to a
global counter protected by a mutex to avoid wasting hundreds of
CPUs spinning on a contended global lock. When the free space goes
back above that threshold, it returns to per-cpu mode (the fast
path code).

> That's a percpu_counter() isn't it? (or cookie jar)

No. percpu_counters do not guarantee accuracy nor can the counters
be externally serialised for things like concurrent ENOSPC detection
that require a guarantee that the counter never, ever goes below
zero.

> The MM uses similar techniques.

I haven't seen anything else that uses similar techniques to the XFS
code - I wrote it back in 2005 before there was generic per-cpu
counter infrastructure, and I've been keeping an eye out as
to whether it could be replaced with generic code ever since....

Cheers,

Dave.
--
Dave Chinner
[email protected]