Back in 2018, I did a patchset [1] to rework the inode->i_version
counter handling to be much less expensive, particularly when no-one is
querying for it.
Testing at the time showed that the cost of enabling i_version on ext4
was close to 0 when nothing is querying it, but I stopped short of
trying to make it the default at the time (mostly out of an abundance of
caution). Since then, we still see a steady stream of cache-coherency
problems with NFSv4 on ext4 when this option is disabled (e.g. [2]).
Is it time to go ahead and make this option the default on ext4? I don't
see a real downside to doing so, though I'm unclear on how we should
approach this. Currently the option is twiddled using MS_I_VERSION flag,
and it's unclear to me how we can reverse the sense of such a flag.
Thoughts?
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b7fd7d34de5765dece2dd08060d2e1f7be3b39
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=2107587
--
Jeff Layton <[email protected]>
On Tue, Jul 19, 2022 at 09:51:33AM -0400, Jeff Layton wrote:
> Back in 2018, I did a patchset [1] to rework the inode->i_version
> counter handling to be much less expensive, particularly when no-one is
> querying for it.
>
> Testing at the time showed that the cost of enabling i_version on ext4
> was close to 0 when nothing is querying it, but I stopped short of
> trying to make it the default at the time (mostly out of an abundance of
> caution). Since then, we still see a steady stream of cache-coherency
> problems with NFSv4 on ext4 when this option is disabled (e.g. [2]).
>
> Is it time to go ahead and make this option the default on ext4? I don't
> see a real downside to doing so, though I'm unclear on how we should
> approach this. Currently the option is twiddled using MS_I_VERSION flag,
> and it's unclear to me how we can reverse the sense of such a flag.
>
> Thoughts?
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b7fd7d34de5765dece2dd08060d2e1f7be3b39
> [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2107587
Hi,
I don't have the results myself yet, but a quick look at how it is done
suggests that indeed the impact should be low. But not zero, at least
every time the inode is loaded from disk it is scheduled for i_version
update on the next attempted increment. Could that have an effect on
some particular common workload you can think of?
How would we approach making iversion a default? libmount is passing
this option to the kernel as just a MS_I_VERSION flag that is set when
-o iversion is used and left empty when the -o noiversion is used. This
means that while we could make it a default in ext4, we don't have any
way of knowing whether the user asked for -o noiversion. So that's not
really an option.
Updating the mke2fs/tune2fs to allow setting iversion as a default mount
option I think has the same problem.
So the only way I can see ATM would be to introduce another mountflag
for libmount to indicate -o noiversion. This way we can make iversion a
default on ext4 without loosing the information about user provided -o
noiversion option.
Is there a different way I am not seeing?
If we can do reasonably extensive testing that will indeed show
negligible impact when nothing is querying i_version, then I would be in
favor of the change.
Thanks!
-Lukas
>
> --
> Jeff Layton <[email protected]>
>
On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
> On Tue, Jul 19, 2022 at 09:51:33AM -0400, Jeff Layton wrote:
> > Back in 2018, I did a patchset [1] to rework the inode->i_version
> > counter handling to be much less expensive, particularly when no-one is
> > querying for it.
> >
> > Testing at the time showed that the cost of enabling i_version on ext4
> > was close to 0 when nothing is querying it, but I stopped short of
> > trying to make it the default at the time (mostly out of an abundance of
> > caution). Since then, we still see a steady stream of cache-coherency
> > problems with NFSv4 on ext4 when this option is disabled (e.g. [2]).
> >
> > Is it time to go ahead and make this option the default on ext4? I don't
> > see a real downside to doing so, though I'm unclear on how we should
> > approach this. Currently the option is twiddled using MS_I_VERSION flag,
> > and it's unclear to me how we can reverse the sense of such a flag.
> >
> > Thoughts?
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4b7fd7d34de5765dece2dd08060d2e1f7be3b39
> > [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2107587
>
> Hi,
>
> I don't have the results myself yet, but a quick look at how it is done
> suggests that indeed the impact should be low. But not zero, at least
> every time the inode is loaded from disk it is scheduled for i_version
> update on the next attempted increment. Could that have an effect on
> some particular common workload you can think of?
>
Yeah, it's not zero, but hopefully any performance hit would end up
amortized over the long term use of the inode. In the common situation
where the i_version flag hasn't been queried, this just ends up being an
extra atomic fetch to look at i_version and detect that.
> How would we approach making iversion a default? libmount is passing
> this option to the kernel as just a MS_I_VERSION flag that is set when
> -o iversion is used and left empty when the -o noiversion is used. This
> means that while we could make it a default in ext4, we don't have any
> way of knowing whether the user asked for -o noiversion. So that's not
> really an option.
>
> Updating the mke2fs/tune2fs to allow setting iversion as a default mount
> option I think has the same problem.
>
> So the only way I can see ATM would be to introduce another mountflag
> for libmount to indicate -o noiversion. This way we can make iversion a
> default on ext4 without loosing the information about user provided -o
> noiversion option.
>
> Is there a different way I am not seeing?
>
Right, implementing this is the difficult bit actually since this uses a
MS_* flag.?If we do make this the default, we'd definitely want to
continue allowing "-o noiversion" to disable it.
Could we just reverse the default in libmount? It might cause this to
suddenly be enabled in some deployments, but in most cases, people
wouldn't even notice and they could still specify -o noiversion to turn
it off.
Another idea would be to introduce new mount options for this, but
that's kind of nasty from a UI standpoint.
>
> If we can do reasonably extensive testing that will indeed show
> negligible impact when nothing is querying i_version, then I would be in
> favor of the change.
>
Excellent! I think that would be best if we can get away with it. A lot
of people are currently running ext4-backed nfs servers and aren't using
that mount option.
--
Jeff Layton <[email protected]>
On Wed, Jul 20, 2022 at 10:38:31AM -0400, Jeff Layton wrote:
> On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
--snip--
> > How would we approach making iversion a default? libmount is passing
> > this option to the kernel as just a MS_I_VERSION flag that is set when
> > -o iversion is used and left empty when the -o noiversion is used. This
> > means that while we could make it a default in ext4, we don't have any
> > way of knowing whether the user asked for -o noiversion. So that's not
> > really an option.
> >
> > Updating the mke2fs/tune2fs to allow setting iversion as a default mount
> > option I think has the same problem.
> >
> > So the only way I can see ATM would be to introduce another mountflag
> > for libmount to indicate -o noiversion. This way we can make iversion a
> > default on ext4 without loosing the information about user provided -o
> > noiversion option.
> >
> > Is there a different way I am not seeing?
> >
>
> Right, implementing this is the difficult bit actually since this uses a
> MS_* flag.?If we do make this the default, we'd definitely want to
> continue allowing "-o noiversion" to disable it.
>
> Could we just reverse the default in libmount? It might cause this to
> suddenly be enabled in some deployments, but in most cases, people
> wouldn't even notice and they could still specify -o noiversion to turn
> it off.
Can be done, but that would change the default for everyone. Not sure if
that desirable. Also I can image this being a bit confusing. I still
think the best approach would be to introduce another MS_ flag for
noiversion case. I think there is precedence in the case of
MS_STRICTATIME - not exactly the same but similar enough.
> Another idea would be to introduce new mount options for this, but
> that's kind of nasty from a UI standpoint.
>
> >
> > If we can do reasonably extensive testing that will indeed show
> > negligible impact when nothing is querying i_version, then I would be in
> > favor of the change.
> >
>
> Excellent! I think that would be best if we can get away with it. A lot
> of people are currently running ext4-backed nfs servers and aren't using
> that mount option.
Could you provide some performance numbers for iversion case?
Thanks!
-Lukas
> --
> Jeff Layton <[email protected]>
>
On 20 Jul 2022, at 10:38, Jeff Layton wrote:
> On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
>>
>> Is there a different way I am not seeing?
>>
>
> Right, implementing this is the difficult bit actually since this uses a
> MS_* flag. If we do make this the default, we'd definitely want to
> continue allowing "-o noiversion" to disable it.
>
> Could we just reverse the default in libmount? It might cause this to
> suddenly be enabled in some deployments, but in most cases, people
> wouldn't even notice and they could still specify -o noiversion to turn
> it off.
>
> Another idea would be to introduce new mount options for this, but
> that's kind of nasty from a UI standpoint.
Is it safe to set SB_I_VERSION at export time? If so, export_operations
could grow an ->enable_iversion().
Ben
On Wed, Jul 20, 2022 at 11:56:12AM -0400, Benjamin Coddington wrote:
> > Another idea would be to introduce new mount options for this, but
> > that's kind of nasty from a UI standpoint.
>
> Is it safe to set SB_I_VERSION at export time? If so, export_operations
> could grow an ->enable_iversion().
No, it's not. Think of unexporting, modifying thing and re-exporting
and now the client will see the same change counter for a modified
file.
On Wed, 2022-07-20 at 11:56 -0400, Benjamin Coddington wrote:
> On 20 Jul 2022, at 10:38, Jeff Layton wrote:
> > On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
> > >
> > > Is there a different way I am not seeing?
> > >
> >
> > Right, implementing this is the difficult bit actually since this uses a
> > MS_* flag.?If we do make this the default, we'd definitely want to
> > continue allowing "-o noiversion" to disable it.
> >
> > Could we just reverse the default in libmount? It might cause this to
> > suddenly be enabled in some deployments, but in most cases, people
> > wouldn't even notice and they could still specify -o noiversion to turn
> > it off.
> >
> > Another idea would be to introduce new mount options for this, but
> > that's kind of nasty from a UI standpoint.
>
> Is it safe to set SB_I_VERSION at export time? If so, export_operations
> could grow an ->enable_iversion().
>
That sounds like it might be problematic.
Consider the case where a NFSv4 client has cached file data and the
change attribute for the file. Server then reboots, but before the
export happens a local user makes a change to the file and it doesn't
update the i_version.
--
Jeff Layton <[email protected]>
On 20 Jul 2022, at 12:15, Jeff Layton wrote:
> On Wed, 2022-07-20 at 11:56 -0400, Benjamin Coddington wrote:
>> On 20 Jul 2022, at 10:38, Jeff Layton wrote:
>>> On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
>>>>
>>>> Is there a different way I am not seeing?
>>>>
>>>
>>> Right, implementing this is the difficult bit actually since this uses a
>>> MS_* flag. If we do make this the default, we'd definitely want to
>>> continue allowing "-o noiversion" to disable it.
>>>
>>> Could we just reverse the default in libmount? It might cause this to
>>> suddenly be enabled in some deployments, but in most cases, people
>>> wouldn't even notice and they could still specify -o noiversion to turn
>>> it off.
>>>
>>> Another idea would be to introduce new mount options for this, but
>>> that's kind of nasty from a UI standpoint.
>>
>> Is it safe to set SB_I_VERSION at export time? If so, export_operations
>> could grow an ->enable_iversion().
>>
>
> That sounds like it might be problematic.
>
> Consider the case where a NFSv4 client has cached file data and the
> change attribute for the file. Server then reboots, but before the
> export happens a local user makes a change to the file and it doesn't
> update the i_version.
Nfsd currently uses both ctime and i_version if its available, I'd expect
that eliminates this case.
On Wed, 2022-07-20 at 17:22 +0200, Lukas Czerner wrote:
> But not zero, at least
> every time the inode is loaded from disk it is scheduled for i_version
> update on the next attempted increment. Could that have an effect on
> some particular common workload you can think of?
>
FWIW, it's doubtful that you'd even notice this. You'd almost certainly
be updating the mtime or ctime on the next change anyway, so updating
the i_version in that case is basically free. You will probably need to
do some a few extra atomic in-memory operations, but that's probably not
noticeable in something I/O constrained.
>
> Could you provide some performance numbers for iversion case?
>
I'm writing to a LVM volume on a no-name-brand ssd I have sitting
around. fio jobfile is here:
[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=4k
direct=0
numjobs=1
time_based
runtime=300
[file1]
size=1G
ioengine=libaio
iodepth=16
iversion support disabled:
$ fio ./4k-write.fio
file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.27
Starting 1 process
file1: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=52.5MiB/s][w=13.4k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=10056: Wed Jul 20 12:28:21 2022
write: IOPS=96.3k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
slat (nsec): min=1112, max=5727.5k, avg=1917.70, stdev=1300.30
clat (nsec): min=1112, max=2146.5M, avg=156067.38, stdev=15568002.13
lat (usec): min=3, max=2146.5k, avg=158.03, stdev=15568.00
clat percentiles (usec):
| 1.00th=[ 36], 5.00th=[ 36], 10.00th=[ 37], 20.00th=[ 37],
| 30.00th=[ 38], 40.00th=[ 38], 50.00th=[ 38], 60.00th=[ 39],
| 70.00th=[ 39], 80.00th=[ 40], 90.00th=[ 42], 95.00th=[ 44],
| 99.00th=[ 52], 99.50th=[ 59], 99.90th=[ 77], 99.95th=[ 88],
| 99.99th=[ 169]
bw ( KiB/s): min=15664, max=1599456, per=100.00%, avg=897761.07, stdev=504329.17, samples=257
iops : min= 3916, max=399864, avg=224440.26, stdev=126082.33, samples=257
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.80%
lat (usec) : 100=1.18%, 250=0.02%, 500=0.01%
lat (msec) : 10=0.01%, 2000=0.01%, >=2000=0.01%
cpu : usr=5.45%, sys=23.92%, ctx=78418, majf=0, minf=14
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,28889786,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
Disk stats (read/write):
dm-7: ios=0/22878, merge=0/0, ticks=0/373254, in_queue=373254, util=43.89%, aggrios=0/99746, aggrmerge=0/9246, aggrticks=0/1406831, aggrin_queue=1408420, aggrutil=73.56%
sda: ios=0/99746, merge=0/9246, ticks=0/1406831, in_queue=1408420, util=73.56%
mounted with -o iversion:
$ fio ./4k-write.fio
file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=10369: Wed Jul 20 12:33:57 2022
write: IOPS=96.2k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
slat (nsec): min=1112, max=1861.5k, avg=1994.58, stdev=890.78
clat (nsec): min=1392, max=2113.3M, avg=156252.71, stdev=15409487.99
lat (usec): min=3, max=2113.3k, avg=158.30, stdev=15409.49
clat percentiles (usec):
| 1.00th=[ 37], 5.00th=[ 38], 10.00th=[ 38], 20.00th=[ 38],
| 30.00th=[ 39], 40.00th=[ 39], 50.00th=[ 40], 60.00th=[ 40],
| 70.00th=[ 41], 80.00th=[ 42], 90.00th=[ 43], 95.00th=[ 45],
| 99.00th=[ 53], 99.50th=[ 60], 99.90th=[ 79], 99.95th=[ 90],
| 99.99th=[ 174]
bw ( KiB/s): min= 304, max=1540000, per=100.00%, avg=870727.42, stdev=499371.78, samples=265
iops : min= 76, max=385000, avg=217681.82, stdev=124842.94, samples=265
lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.49%
lat (usec) : 100=1.48%, 250=0.02%, 500=0.01%
lat (msec) : 2=0.01%, 2000=0.01%, >=2000=0.01%
cpu : usr=5.71%, sys=24.49%, ctx=52874, majf=0, minf=18
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,28856695,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
Disk stats (read/write):
dm-7: ios=1/16758, merge=0/0, ticks=2/341817, in_queue=341819, util=47.93%, aggrios=1/98153, aggrmerge=0/5691, aggrticks=2/1399496, aggrin_queue=1400893, aggrutil=73.42%
sda: ios=1/98153, merge=0/5691, ticks=2/1399496, in_queue=1400893, util=73.42%
--
Jeff Layton <[email protected]>
On Wed, 2022-07-20 at 12:29 -0400, Benjamin Coddington wrote:
> On 20 Jul 2022, at 12:15, Jeff Layton wrote:
>
> > On Wed, 2022-07-20 at 11:56 -0400, Benjamin Coddington wrote:
> > > On 20 Jul 2022, at 10:38, Jeff Layton wrote:
> > > > On Wed, 2022-07-20 at 16:15 +0200, Lukas Czerner wrote:
> > > > >
> > > > > Is there a different way I am not seeing?
> > > > >
> > > >
> > > > Right, implementing this is the difficult bit actually since this uses a
> > > > MS_* flag.?If we do make this the default, we'd definitely want to
> > > > continue allowing "-o noiversion" to disable it.
> > > >
> > > > Could we just reverse the default in libmount? It might cause this to
> > > > suddenly be enabled in some deployments, but in most cases, people
> > > > wouldn't even notice and they could still specify -o noiversion to turn
> > > > it off.
> > > >
> > > > Another idea would be to introduce new mount options for this, but
> > > > that's kind of nasty from a UI standpoint.
> > >
> > > Is it safe to set SB_I_VERSION at export time? If so, export_operations
> > > could grow an ->enable_iversion().
> > >
> >
> > That sounds like it might be problematic.
> >
> > Consider the case where a NFSv4 client has cached file data and the
> > change attribute for the file. Server then reboots, but before the
> > export happens a local user makes a change to the file and it doesn't
> > update the i_version.
>
> Nfsd currently uses both ctime and i_version if its available, I'd expect
> that eliminates this case.
>
Good point, that probably would. Still, I'd rather we just enable this
wholesale if we can get away with it. There's still some interest in
exposing i_version to userland via statx or the like, so I'd rather not
assume that only nfsd will care about it.
--
Jeff Layton <[email protected]>
On Wed, Jul 20, 2022 at 12:42:11PM -0400, Jeff Layton wrote:
> On Wed, 2022-07-20 at 17:22 +0200, Lukas Czerner wrote:
>
> > But not zero, at least
> > every time the inode is loaded from disk it is scheduled for i_version
> > update on the next attempted increment. Could that have an effect on
> > some particular common workload you can think of?
> >
>
> FWIW, it's doubtful that you'd even notice this. You'd almost certainly
> be updating the mtime or ctime on the next change anyway, so updating
> the i_version in that case is basically free. You will probably need to
> do some a few extra atomic in-memory operations, but that's probably not
> noticeable in something I/O constrained.
>
> >
> > Could you provide some performance numbers for iversion case?
> >
>
> I'm writing to a LVM volume on a no-name-brand ssd I have sitting
> around. fio jobfile is here:
That's very simplistic test, but fair enough. I've ran 10 iterations of
xfstests with and without iversion and there is no significant
difference, in fact it's all well within run by run variation. That's
true in aggregate as well for individual tests.
However there are problems to solve before we attempt to make it a
default. With -o iversion ext4/026 and generic/622 fails. The ext4/026
seems to be a real bug and I am not sure about the other one yet.
I'll look into it.
-Lukas
>
> [global]
> name=fio-seq-write
> filename=fio-seq-write
> rw=write
> bs=4k
> direct=0
> numjobs=1
> time_based
> runtime=300
>
> [file1]
> size=1G
> ioengine=libaio
> iodepth=16
>
> iversion support disabled:
>
> $ fio ./4k-write.fio
> file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
> fio-3.27
> Starting 1 process
> file1: Laying out IO file (1 file / 1024MiB)
> Jobs: 1 (f=1): [W(1)][100.0%][w=52.5MiB/s][w=13.4k IOPS][eta 00m:00s]
> file1: (groupid=0, jobs=1): err= 0: pid=10056: Wed Jul 20 12:28:21 2022
> write: IOPS=96.3k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
> slat (nsec): min=1112, max=5727.5k, avg=1917.70, stdev=1300.30
> clat (nsec): min=1112, max=2146.5M, avg=156067.38, stdev=15568002.13
> lat (usec): min=3, max=2146.5k, avg=158.03, stdev=15568.00
> clat percentiles (usec):
> | 1.00th=[ 36], 5.00th=[ 36], 10.00th=[ 37], 20.00th=[ 37],
> | 30.00th=[ 38], 40.00th=[ 38], 50.00th=[ 38], 60.00th=[ 39],
> | 70.00th=[ 39], 80.00th=[ 40], 90.00th=[ 42], 95.00th=[ 44],
> | 99.00th=[ 52], 99.50th=[ 59], 99.90th=[ 77], 99.95th=[ 88],
> | 99.99th=[ 169]
> bw ( KiB/s): min=15664, max=1599456, per=100.00%, avg=897761.07, stdev=504329.17, samples=257
> iops : min= 3916, max=399864, avg=224440.26, stdev=126082.33, samples=257
> lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.80%
> lat (usec) : 100=1.18%, 250=0.02%, 500=0.01%
> lat (msec) : 10=0.01%, 2000=0.01%, >=2000=0.01%
> cpu : usr=5.45%, sys=23.92%, ctx=78418, majf=0, minf=14
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,28889786,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=16
>
> Run status group 0 (all jobs):
> WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
>
> Disk stats (read/write):
> dm-7: ios=0/22878, merge=0/0, ticks=0/373254, in_queue=373254, util=43.89%, aggrios=0/99746, aggrmerge=0/9246, aggrticks=0/1406831, aggrin_queue=1408420, aggrutil=73.56%
> sda: ios=0/99746, merge=0/9246, ticks=0/1406831, in_queue=1408420, util=73.56%
>
> mounted with -o iversion:
>
> $ fio ./4k-write.fio
> file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
> fio-3.27
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
> file1: (groupid=0, jobs=1): err= 0: pid=10369: Wed Jul 20 12:33:57 2022
> write: IOPS=96.2k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
> slat (nsec): min=1112, max=1861.5k, avg=1994.58, stdev=890.78
> clat (nsec): min=1392, max=2113.3M, avg=156252.71, stdev=15409487.99
> lat (usec): min=3, max=2113.3k, avg=158.30, stdev=15409.49
> clat percentiles (usec):
> | 1.00th=[ 37], 5.00th=[ 38], 10.00th=[ 38], 20.00th=[ 38],
> | 30.00th=[ 39], 40.00th=[ 39], 50.00th=[ 40], 60.00th=[ 40],
> | 70.00th=[ 41], 80.00th=[ 42], 90.00th=[ 43], 95.00th=[ 45],
> | 99.00th=[ 53], 99.50th=[ 60], 99.90th=[ 79], 99.95th=[ 90],
> | 99.99th=[ 174]
> bw ( KiB/s): min= 304, max=1540000, per=100.00%, avg=870727.42, stdev=499371.78, samples=265
> iops : min= 76, max=385000, avg=217681.82, stdev=124842.94, samples=265
> lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.49%
> lat (usec) : 100=1.48%, 250=0.02%, 500=0.01%
> lat (msec) : 2=0.01%, 2000=0.01%, >=2000=0.01%
> cpu : usr=5.71%, sys=24.49%, ctx=52874, majf=0, minf=18
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,28856695,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=16
>
> Run status group 0 (all jobs):
> WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
>
> Disk stats (read/write):
> dm-7: ios=1/16758, merge=0/0, ticks=2/341817, in_queue=341819, util=47.93%, aggrios=1/98153, aggrmerge=0/5691, aggrticks=2/1399496, aggrin_queue=1400893, aggrutil=73.42%
> sda: ios=1/98153, merge=0/5691, ticks=2/1399496, in_queue=1400893, util=73.42%
>
> --
> Jeff Layton <[email protected]>
>
On Thu, 2022-07-21 at 16:06 +0200, Lukas Czerner wrote:
> On Wed, Jul 20, 2022 at 12:42:11PM -0400, Jeff Layton wrote:
> > On Wed, 2022-07-20 at 17:22 +0200, Lukas Czerner wrote:
> >
> > > But not zero, at least
> > > every time the inode is loaded from disk it is scheduled for i_version
> > > update on the next attempted increment. Could that have an effect on
> > > some particular common workload you can think of?
> > >
> >
> > FWIW, it's doubtful that you'd even notice this. You'd almost certainly
> > be updating the mtime or ctime on the next change anyway, so updating
> > the i_version in that case is basically free. You will probably need to
> > do some a few extra atomic in-memory operations, but that's probably not
> > noticeable in something I/O constrained.
> >
> > >
> > > Could you provide some performance numbers for iversion case?
> > >
> >
> > I'm writing to a LVM volume on a no-name-brand ssd I have sitting
> > around. fio jobfile is here:
>
> That's very simplistic test, but fair enough. I've ran 10 iterations of
> xfstests with and without iversion and there is no significant
> difference, in fact it's all well within run by run variation. That's
> true in aggregate as well for individual tests.
>
Yeah. This change was most evident with small I/O sizes, so if there is
an effect here it'll likely show up there.
> However there are problems to solve before we attempt to make it a
> default. With -o iversion ext4/026 and generic/622 fails. The ext4/026
> seems to be a real bug and I am not sure about the other one yet.
>
> I'll look into it.
>
Interesting, thanks. Lack of testing with that option enabled is
probably another good reason to go ahead and make it the default. Let me
know what you find.
> -Lukas
>
> >
> > [global]
> > name=fio-seq-write
> > filename=fio-seq-write
> > rw=write
> > bs=4k
> > direct=0
> > numjobs=1
> > time_based
> > runtime=300
> >
> > [file1]
> > size=1G
> > ioengine=libaio
> > iodepth=16
> >
> > iversion support disabled:
> >
> > $ fio ./4k-write.fio
> > file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
> > fio-3.27
> > Starting 1 process
> > file1: Laying out IO file (1 file / 1024MiB)
> > Jobs: 1 (f=1): [W(1)][100.0%][w=52.5MiB/s][w=13.4k IOPS][eta 00m:00s]
> > file1: (groupid=0, jobs=1): err= 0: pid=10056: Wed Jul 20 12:28:21 2022
> > write: IOPS=96.3k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
> > slat (nsec): min=1112, max=5727.5k, avg=1917.70, stdev=1300.30
> > clat (nsec): min=1112, max=2146.5M, avg=156067.38, stdev=15568002.13
> > lat (usec): min=3, max=2146.5k, avg=158.03, stdev=15568.00
> > clat percentiles (usec):
> > | 1.00th=[ 36], 5.00th=[ 36], 10.00th=[ 37], 20.00th=[ 37],
> > | 30.00th=[ 38], 40.00th=[ 38], 50.00th=[ 38], 60.00th=[ 39],
> > | 70.00th=[ 39], 80.00th=[ 40], 90.00th=[ 42], 95.00th=[ 44],
> > | 99.00th=[ 52], 99.50th=[ 59], 99.90th=[ 77], 99.95th=[ 88],
> > | 99.99th=[ 169]
> > bw ( KiB/s): min=15664, max=1599456, per=100.00%, avg=897761.07, stdev=504329.17, samples=257
> > iops : min= 3916, max=399864, avg=224440.26, stdev=126082.33, samples=257
> > lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.80%
> > lat (usec) : 100=1.18%, 250=0.02%, 500=0.01%
> > lat (msec) : 10=0.01%, 2000=0.01%, >=2000=0.01%
> > cpu : usr=5.45%, sys=23.92%, ctx=78418, majf=0, minf=14
> > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued rwts: total=0,28889786,0,0 short=0,0,0,0 dropped=0,0,0,0
> > latency : target=0, window=0, percentile=100.00%, depth=16
> >
> > Run status group 0 (all jobs):
> > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
> >
> > Disk stats (read/write):
> > dm-7: ios=0/22878, merge=0/0, ticks=0/373254, in_queue=373254, util=43.89%, aggrios=0/99746, aggrmerge=0/9246, aggrticks=0/1406831, aggrin_queue=1408420, aggrutil=73.56%
> > sda: ios=0/99746, merge=0/9246, ticks=0/1406831, in_queue=1408420, util=73.56%
> >
> > mounted with -o iversion:
> >
> > $ fio ./4k-write.fio
> > file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
> > fio-3.27
> > Starting 1 process
> > Jobs: 1 (f=1): [W(1)][100.0%][eta 00m:00s]
> > file1: (groupid=0, jobs=1): err= 0: pid=10369: Wed Jul 20 12:33:57 2022
> > write: IOPS=96.2k, BW=376MiB/s (394MB/s)(110GiB/300001msec); 0 zone resets
> > slat (nsec): min=1112, max=1861.5k, avg=1994.58, stdev=890.78
> > clat (nsec): min=1392, max=2113.3M, avg=156252.71, stdev=15409487.99
> > lat (usec): min=3, max=2113.3k, avg=158.30, stdev=15409.49
> > clat percentiles (usec):
> > | 1.00th=[ 37], 5.00th=[ 38], 10.00th=[ 38], 20.00th=[ 38],
> > | 30.00th=[ 39], 40.00th=[ 39], 50.00th=[ 40], 60.00th=[ 40],
> > | 70.00th=[ 41], 80.00th=[ 42], 90.00th=[ 43], 95.00th=[ 45],
> > | 99.00th=[ 53], 99.50th=[ 60], 99.90th=[ 79], 99.95th=[ 90],
> > | 99.99th=[ 174]
> > bw ( KiB/s): min= 304, max=1540000, per=100.00%, avg=870727.42, stdev=499371.78, samples=265
> > iops : min= 76, max=385000, avg=217681.82, stdev=124842.94, samples=265
> > lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=98.49%
> > lat (usec) : 100=1.48%, 250=0.02%, 500=0.01%
> > lat (msec) : 2=0.01%, 2000=0.01%, >=2000=0.01%
> > cpu : usr=5.71%, sys=24.49%, ctx=52874, majf=0, minf=18
> > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
> > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
> > issued rwts: total=0,28856695,0,0 short=0,0,0,0 dropped=0,0,0,0
> > latency : target=0, window=0, percentile=100.00%, depth=16
> >
> > Run status group 0 (all jobs):
> > WRITE: bw=376MiB/s (394MB/s), 376MiB/s-376MiB/s (394MB/s-394MB/s), io=110GiB (118GB), run=300001-300001msec
> >
> > Disk stats (read/write):
> > dm-7: ios=1/16758, merge=0/0, ticks=2/341817, in_queue=341819, util=47.93%, aggrios=1/98153, aggrmerge=0/5691, aggrticks=2/1399496, aggrin_queue=1400893, aggrutil=73.42%
> > sda: ios=1/98153, merge=0/5691, ticks=2/1399496, in_queue=1400893, util=73.42%
> >
> > --
> > Jeff Layton <[email protected]>
> >
>
--
Jeff Layton <[email protected]>
On Tue, Jul 19, 2022 at 09:51:33AM -0400, Jeff Layton wrote:
> Back in 2018, I did a patchset [1] to rework the inode->i_version
> counter handling to be much less expensive, particularly when no-one is
> querying for it.
Yup, there's zero additional overhead for maintaining i_version in
XFS when nothing is monitoring it. Updating it comes for free in any
transaction that modifies the inode, so when writes
occur i_version gets bumped if timestamps change or allocation is
required.
And when something is monitoring it, the overhead is effectively a
single "timestamp" update for each peek at i_version the monitoring
agent makes. This is also largely noise....
> Testing at the time showed that the cost of enabling i_version on ext4
> was close to 0 when nothing is querying it, but I stopped short of
> trying to make it the default at the time (mostly out of an abundance of
> caution). Since then, we still see a steady stream of cache-coherency
> problems with NFSv4 on ext4 when this option is disabled (e.g. [2]).
>
> Is it time to go ahead and make this option the default on ext4? I don't
> see a real downside to doing so, though I'm unclear on how we should
> approach this. Currently the option is twiddled using MS_I_VERSION flag,
> and it's unclear to me how we can reverse the sense of such a flag.
XFS only enables SB_I_VERSION based on an on disk format flag - you
can't turn it on or off by mount options, so it completely ignores
MS_I_VERSION.
> Thoughts?
My 2c is to behave like XFS: ignore the mount option and always turn
it on.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, 2022-07-22 at 08:32 +1000, Dave Chinner wrote:
> On Tue, Jul 19, 2022 at 09:51:33AM -0400, Jeff Layton wrote:
> > Back in 2018, I did a patchset [1] to rework the inode->i_version
> > counter handling to be much less expensive, particularly when no-one is
> > querying for it.
>
> Yup, there's zero additional overhead for maintaining i_version in
> XFS when nothing is monitoring it. Updating it comes for free in any
> transaction that modifies the inode, so when writes
> occur i_version gets bumped if timestamps change or allocation is
> required.
>
> And when something is monitoring it, the overhead is effectively a
> single "timestamp" update for each peek at i_version the monitoring
> agent makes. This is also largely noise....
>
> > Testing at the time showed that the cost of enabling i_version on ext4
> > was close to 0 when nothing is querying it, but I stopped short of
> > trying to make it the default at the time (mostly out of an abundance of
> > caution). Since then, we still see a steady stream of cache-coherency
> > problems with NFSv4 on ext4 when this option is disabled (e.g. [2]).
> >
> > Is it time to go ahead and make this option the default on ext4? I don't
> > see a real downside to doing so, though I'm unclear on how we should
> > approach this. Currently the option is twiddled using MS_I_VERSION flag,
> > and it's unclear to me how we can reverse the sense of such a flag.
>
> XFS only enables SB_I_VERSION based on an on disk format flag - you
> can't turn it on or off by mount options, so it completely ignores
> MS_I_VERSION.
>
> > Thoughts?
>
> My 2c is to behave like XFS: ignore the mount option and always turn
> it on.
I'd be fine with that, personally.
They could also couple that with a tune2fs flag or something, so you
could still disable it if it were a problem for some reason.
It's unlikely that anyone will really notice however, so turning it on
unconditionally may be the best place to start.
--
Jeff Layton <[email protected]>