2013-08-14 17:10:07

by Dave Hansen

[permalink] [raw]
Subject: page fault scalability (ext3, ext4, xfs)

We talked a little about this issue in this thread:

http://marc.info/?l=linux-mm&m=137573185419275&w=2

but I figured I'd follow up with a full comparison. ext4 is about 20%
slower in handling write page faults than ext3. xfs is about 30% slower
than ext3. I'm running on an 8-socket / 80-core / 160-thread system.
Test case is this:

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

It's a little easier to look at the trends as you grow the number of
processes:

http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&3=xfs&hide=linear,threads,threads_idle,processes_idle&rollPeriod=16

I recorded and diff'd some perf data (I've still got the raw data if
anyone wants it), and the main culprit of the ext4/xfs delta looks to be
spinlock contention (or at least bouncing) in xfs_log_commit_cil().
This looks to be a known problem:

http://oss.sgi.com/archives/xfs/2013-07/msg00110.html

Here's a brief snippet of the ext4->xfs 'perf diff'. Note that things
like page_fault() go down in the profile because we are doing _fewer_ of
them, not because it got faster:

> # Baseline Delta Shared Object Symbol
> # ........ ....... ..................... ..............................................
> #
> 22.04% -4.07% [kernel.kallsyms] [k] page_fault
> 2.93% +12.49% [kernel.kallsyms] [k] _raw_spin_lock
> 8.21% -0.58% page_fault3_processes [.] testcase
> 4.87% -0.34% [kernel.kallsyms] [k] __set_page_dirty_buffers
> 4.07% -0.58% [kernel.kallsyms] [k] mem_cgroup_update_page_stat
> 4.10% -0.61% [kernel.kallsyms] [k] __block_write_begin
> 3.69% -0.57% [kernel.kallsyms] [k] find_get_page

It's a bit of a bummer that things are so much less scalable on the
newer filesystems. I expected xfs to do a _lot_ better than it did.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs


2013-08-14 19:43:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

Thanks dave for doing this comparison. Is there any chance you can
check whether lockstats shows anything interesting?

> Test case is this:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

One interesting thing about the test case. It looks like the first
time through the while loop, the file will need to be extended (since
it is a new tempfile). But subsequent times through the list the
blocks for the file will already be allocated. If the file is
prezero'ed ahead of time, so we're only measuring the cost of the
write page fault, and we take block allocation out of the comparison,
do we see the same scalability curve?


Thanks,

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-14 20:50:02

by Dave Hansen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On 08/14/2013 12:43 PM, Theodore Ts'o wrote:
> Thanks dave for doing this comparison. Is there any chance you can
> check whether lockstats shows anything interesting?
>
>> Test case is this:
>>
>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>
> One interesting thing about the test case. It looks like the first
> time through the while loop, the file will need to be extended (since
> it is a new tempfile). But subsequent times through the list the
> blocks for the file will already be allocated. If the file is
> prezero'ed ahead of time, so we're only measuring the cost of the
> write page fault, and we take block allocation out of the comparison,
> do we see the same scalability curve?

Would a plain old fallocate() do the trick, or does it actually need
zeros written to it?

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-14 23:06:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 01:50:02PM -0700, Dave Hansen wrote:
>
> Would a plain old fallocate() do the trick, or does it actually need
> zeros written to it?

It would be better to write zeros to it, so we aren't measuring the
cost of the unwritten->written conversion.

We could do a different test where at the end of each while loop, we
truncate the file and then do an fallocate, at which point we could be
measuring the scalability of the unwritten->written conversion as well
as the write page fault. And that might be a useful thing to do at
some point.

But I'd suggest focusing on just the write page fault first, and then
once we're sure we've improved the scalability of that micro-operation
as much as possible, we can expand our scalability testing to include
either writing into fallocated space, or doing extending writes.

Cheers,

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-14 23:38:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 4:06 PM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Aug 14, 2013 at 01:50:02PM -0700, Dave Hansen wrote:
>>
>> Would a plain old fallocate() do the trick, or does it actually need
>> zeros written to it?
>
> It would be better to write zeros to it, so we aren't measuring the
> cost of the unwritten->written conversion.

At the risk of beating a dead horse, how hard would it be to defer
this part until writeback?

--Andy

2013-08-15 00:24:36

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 10:10:07AM -0700, Dave Hansen wrote:
> We talked a little about this issue in this thread:
>
> http://marc.info/?l=linux-mm&m=137573185419275&w=2
>
> but I figured I'd follow up with a full comparison. ext4 is about 20%
> slower in handling write page faults than ext3. xfs is about 30% slower
> than ext3. I'm running on an 8-socket / 80-core / 160-thread system.
> Test case is this:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

So, it writes a 128MB file sequentially via mmap page faults. This
isn't a page fault benchmark, as such...

>
> It's a little easier to look at the trends as you grow the number of
> processes:
>
> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&3=xfs&hide=linear,threads,threads_idle,processes_idle&rollPeriod=16
>
> I recorded and diff'd some perf data (I've still got the raw data if
> anyone wants it), and the main culprit of the ext4/xfs delta looks to be
> spinlock contention (or at least bouncing) in xfs_log_commit_cil().
> This looks to be a known problem:
>
> http://oss.sgi.com/archives/xfs/2013-07/msg00110.html

Yup, apparently they've been pulled into the xfsdev tree, but i
haven't seen it updated since they were pulled in so the linux-next
builds aren't picking up the fixes yet.

> Here's a brief snippet of the ext4->xfs 'perf diff'. Note that things
> like page_fault() go down in the profile because we are doing _fewer_ of
> them, not because it got faster:
>
> > # Baseline Delta Shared Object Symbol
> > # ........ ....... ..................... ..............................................
> > #
> > 22.04% -4.07% [kernel.kallsyms] [k] page_fault
> > 2.93% +12.49% [kernel.kallsyms] [k] _raw_spin_lock
> > 8.21% -0.58% page_fault3_processes [.] testcase
> > 4.87% -0.34% [kernel.kallsyms] [k] __set_page_dirty_buffers
> > 4.07% -0.58% [kernel.kallsyms] [k] mem_cgroup_update_page_stat
> > 4.10% -0.61% [kernel.kallsyms] [k] __block_write_begin
> > 3.69% -0.57% [kernel.kallsyms] [k] find_get_page
>
> It's a bit of a bummer that things are so much less scalable on the
> newer filesystems.

Sorry, what? What filesystems are you comparing here? XFS is
anything but new...

> I expected xfs to do a _lot_ better than it did.

perf diff doesn't tell me anything about how you should expect the
workload to scale.

This workload appears to be a concurrent write workload using
mmap(), so performance is going to be determined by filesystem
configuration, storage capability and the CPU overhead of the
page_mkwrite() path through the filesystem. It's not a page fault
benchmark at all - it's simply a filesystem write bandwidth
benchmark.

So, perhaps you could describe the storage you are using, as that
would shed more light on your results. A good summary of what
information is useful to us is here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

And FWIW, it's no secret that XFS has more per-operation overhead
than ext4 through the write path when it comes to allocation, so
it's no surprise that on a workload that is highly dependent on
allocation overhead that ext4 is a bit faster....

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 01:11:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > It would be better to write zeros to it, so we aren't measuring the
> > cost of the unwritten->written conversion.
>
> At the risk of beating a dead horse, how hard would it be to defer
> this part until writeback?

Part of the work has to be done at write time because we need to
update allocation statistics (i.e., so that we don't have ENOSPC
problems). The unwritten->written conversion does happen at writeback
(as does the actual block allocation if we are doing delayed
allocation).

The point is that if the goal is to measure page fault scalability, we
shouldn't have this other stuff happening as the same time as the page
fault workload.

- Ted

2013-08-15 02:10:28

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > > It would be better to write zeros to it, so we aren't measuring the
> > > cost of the unwritten->written conversion.
> >
> > At the risk of beating a dead horse, how hard would it be to defer
> > this part until writeback?
>
> Part of the work has to be done at write time because we need to
> update allocation statistics (i.e., so that we don't have ENOSPC
> problems). The unwritten->written conversion does happen at writeback
> (as does the actual block allocation if we are doing delayed
> allocation).
>
> The point is that if the goal is to measure page fault scalability, we
> shouldn't have this other stuff happening as the same time as the page
> fault workload.

Sure, but the real problem is not the block mapping or allocation
path - even if the test is changed to take that out of the picture,
we still have timestamp updates being done on every single page
fault. ext4, XFS and btrfs all do transactional timestamp updates
and have nanosecond granularity, so every page fault is resulting in
a transaction to update the timestamp of the file being modified.

That's why on XFS the log is showing up in the profiles.

So, even if we narrow the test down to just overwriting existing
blocks, we've still got a filesystem transaction per page fault
being done. IOWs, it's still just a filesystem overhead test....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-08-15 02:24:01

by Andi Kleen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

> And FWIW, it's no secret that XFS has more per-operation overhead
> than ext4 through the write path when it comes to allocation, so
> it's no surprise that on a workload that is highly dependent on
> allocation overhead that ext4 is a bit faster....

This cannot explain a worse scaling curve though?

w-i-s is all about scaling.

-Andi

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 04:29:30

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 07:24:01PM -0700, Andi Kleen wrote:
> > And FWIW, it's no secret that XFS has more per-operation overhead
> > than ext4 through the write path when it comes to allocation, so
> > it's no surprise that on a workload that is highly dependent on
> > allocation overhead that ext4 is a bit faster....
>
> This cannot explain a worse scaling curve though?

The scaling curve is pretty much identical. The difference in
performance will be the overhead of timestamp updates through
the transaction subsystems of the filesystems.

> w-i-s is all about scaling.

Sure, but scaling *what*? It's spending all it's time in the
filesystem through the .page_mkwrite path. It's not a page fault
scaling test - it's a filesystem overwrite test that uses mmap.
Indeed, I bet if you replace the mmap() with a write(fd, buf, 4096)
loop, you'd get almost identical behaviour from the filesystems.

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 04:32:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > > It would be better to write zeros to it, so we aren't measuring the
>> > > cost of the unwritten->written conversion.
>> >
>> > At the risk of beating a dead horse, how hard would it be to defer
>> > this part until writeback?
>>
>> Part of the work has to be done at write time because we need to
>> update allocation statistics (i.e., so that we don't have ENOSPC
>> problems). The unwritten->written conversion does happen at writeback
>> (as does the actual block allocation if we are doing delayed
>> allocation).
>>
>> The point is that if the goal is to measure page fault scalability, we
>> shouldn't have this other stuff happening as the same time as the page
>> fault workload.
>
> Sure, but the real problem is not the block mapping or allocation
> path - even if the test is changed to take that out of the picture,
> we still have timestamp updates being done on every single page
> fault. ext4, XFS and btrfs all do transactional timestamp updates
> and have nanosecond granularity, so every page fault is resulting in
> a transaction to update the timestamp of the file being modified.

I have (unmergeable) patches to fix this:

http://comments.gmane.org/gmane.linux.kernel.mm/92476

I'll dust them off. Getting something like that merged will allow me
to run an unmodified kernel.org kernel on my production system :) It
should be a latency improvement (file times are deferred), a
throughput improvement (one update per writepages call instead of one
per page), and a correctness improvement (the current semantics
violate SuS, IIRC, and are backwards from the point of view of
anything trying to detect changes to files).

--Andy

>
> That's why on XFS the log is showing up in the profiles.
>
> So, even if we narrow the test down to just overwriting existing
> blocks, we've still got a filesystem transaction per page fault
> being done. IOWs, it's still just a filesystem overhead test....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]



--
Andy Lutomirski
AMA Capital Management, LLC

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 06:01:49

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > > cost of the unwritten->written conversion.
> >> >
> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > this part until writeback?
> >>
> >> Part of the work has to be done at write time because we need to
> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> problems). The unwritten->written conversion does happen at writeback
> >> (as does the actual block allocation if we are doing delayed
> >> allocation).
> >>
> >> The point is that if the goal is to measure page fault scalability, we
> >> shouldn't have this other stuff happening as the same time as the page
> >> fault workload.
> >
> > Sure, but the real problem is not the block mapping or allocation
> > path - even if the test is changed to take that out of the picture,
> > we still have timestamp updates being done on every single page
> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > and have nanosecond granularity, so every page fault is resulting in
> > a transaction to update the timestamp of the file being modified.
>
> I have (unmergeable) patches to fix this:
>
> http://comments.gmane.org/gmane.linux.kernel.mm/92476

The big problem with this approach is that not doing the
timestamp update on page faults is going to break the inode change
version counting because for ext4, btrfs and XFS it takes a
transaction to bump that counter. NFS needs to know the moment a
file is changed in memory, not when it is written to disk. Also, NFS
requires the change to the counter to be persistent over server
failures, so it needs to be changed as part of a transaction....

IOWs, fixing the "filesystems need a transaction on each page_mkwrite
call" problem isn't as simple as changing how timestamps are
updated.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-08-15 06:14:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
>> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> >> > > It would be better to write zeros to it, so we aren't measuring the
>> >> > > cost of the unwritten->written conversion.
>> >> >
>> >> > At the risk of beating a dead horse, how hard would it be to defer
>> >> > this part until writeback?
>> >>
>> >> Part of the work has to be done at write time because we need to
>> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> >> problems). The unwritten->written conversion does happen at writeback
>> >> (as does the actual block allocation if we are doing delayed
>> >> allocation).
>> >>
>> >> The point is that if the goal is to measure page fault scalability, we
>> >> shouldn't have this other stuff happening as the same time as the page
>> >> fault workload.
>> >
>> > Sure, but the real problem is not the block mapping or allocation
>> > path - even if the test is changed to take that out of the picture,
>> > we still have timestamp updates being done on every single page
>> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > and have nanosecond granularity, so every page fault is resulting in
>> > a transaction to update the timestamp of the file being modified.
>>
>> I have (unmergeable) patches to fix this:
>>
>> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>
> The big problem with this approach is that not doing the
> timestamp update on page faults is going to break the inode change
> version counting because for ext4, btrfs and XFS it takes a
> transaction to bump that counter. NFS needs to know the moment a
> file is changed in memory, not when it is written to disk. Also, NFS
> requires the change to the counter to be persistent over server
> failures, so it needs to be changed as part of a transaction....

I've been running a kernel that has the file_update_time call
commented out for over a year now, and the only problem I've seen is
that the timestamp doesn't get updated :)

I think I must be misunderstanding you (or vice versa). I'm currently
redoing the patches, and this time I'll do it for just the mm core and
ext4. The only change I'm proposing to ext4's page_mkwrite is to
remove the file_update_time call. Instead, ext4 will call
file_update_time on munmap, exit, MS_ASYNC, and at the end of
writepages. Unless I'm missing something, there's no need to
unconditionally start a transaction on page_mkwrite (and there had
better not be, because file_update_time won't start a transaction if
the time doesn't change).

NFS can do whatever it wants, although I suspect that even NFS can get
away with deferring cmtime updates.

--Andy

>
> IOWs, fixing the "filesystems need a transaction on each page_mkwrite
> call" problem isn't as simple as changing how timestamps are
> updated.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]



--
Andy Lutomirski
AMA Capital Management, LLC

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 06:18:01

by David Lang

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, 14 Aug 2013, Andy Lutomirski wrote:

>> The big problem with this approach is that not doing the
>> timestamp update on page faults is going to break the inode change
>> version counting because for ext4, btrfs and XFS it takes a
>> transaction to bump that counter. NFS needs to know the moment a
>> file is changed in memory, not when it is written to disk. Also, NFS
>> requires the change to the counter to be persistent over server
>> failures, so it needs to be changed as part of a transaction....
>
> NFS can do whatever it wants, although I suspect that even NFS can get
> away with deferring cmtime updates.

NFS already has to do syncs to make sure the data is safe on disk, have a flag
that NFS can use to make the ctime safe, everyone else can get the performance
improvement and NFS can have it's slow-but-safe approach.

David Lang

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 06:28:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 11:18 PM, David Lang <[email protected]> wrote:
> On Wed, 14 Aug 2013, Andy Lutomirski wrote:
>
>>> The big problem with this approach is that not doing the
>>> timestamp update on page faults is going to break the inode change
>>> version counting because for ext4, btrfs and XFS it takes a
>>> transaction to bump that counter. NFS needs to know the moment a
>>> file is changed in memory, not when it is written to disk. Also, NFS
>>> requires the change to the counter to be persistent over server
>>> failures, so it needs to be changed as part of a transaction....
>>
>>
>> NFS can do whatever it wants, although I suspect that even NFS can get
>> away with deferring cmtime updates.
>
>
> NFS already has to do syncs to make sure the data is safe on disk, have a
> flag that NFS can use to make the ctime safe, everyone else can get the
> performance improvement and NFS can have it's slow-but-safe approach.
>

I don't see the current code that updates times for NFS. I'm not
planning on making any changes that'll affect NFS at all (i.e. I don't
think any flag will be needed), but I'd be more confident if I
understand why it worked in the first place.

(For filesystems that provide page_mkwrite, there hasn't been a
file_update_time call in the core code for several kernel versions.)

--Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 07:11:42

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> > > cost of the unwritten->written conversion.
> >> >> >
> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> > this part until writeback?
> >> >>
> >> >> Part of the work has to be done at write time because we need to
> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> problems). The unwritten->written conversion does happen at writeback
> >> >> (as does the actual block allocation if we are doing delayed
> >> >> allocation).
> >> >>
> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> fault workload.
> >> >
> >> > Sure, but the real problem is not the block mapping or allocation
> >> > path - even if the test is changed to take that out of the picture,
> >> > we still have timestamp updates being done on every single page
> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > and have nanosecond granularity, so every page fault is resulting in
> >> > a transaction to update the timestamp of the file being modified.
> >>
> >> I have (unmergeable) patches to fix this:
> >>
> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >
> > The big problem with this approach is that not doing the
> > timestamp update on page faults is going to break the inode change
> > version counting because for ext4, btrfs and XFS it takes a
> > transaction to bump that counter. NFS needs to know the moment a
> > file is changed in memory, not when it is written to disk. Also, NFS
> > requires the change to the counter to be persistent over server
> > failures, so it needs to be changed as part of a transaction....
>
> I've been running a kernel that has the file_update_time call
> commented out for over a year now, and the only problem I've seen is
> that the timestamp doesn't get updated :)
>
> I think I must be misunderstanding you (or vice versa). I'm currently

Yup, you are.

> redoing the patches, and this time I'll do it for just the mm core and
> ext4. The only change I'm proposing to ext4's page_mkwrite is to
> remove the file_update_time call.

Right. Where does that end up? All the way down in
ext4_mark_iloc_dirty(), and that does:

if (IS_I_VERSION(inode))
inode_inc_iversion(inode);

The XFS transaction code is the same - deep inside it where an inode
is marked as dirty in the transaction, it bumps the same counter and
adds it to the transaction.

If a filesystem is providing an i_version value, then NFS uses it to
determine whether client side caches are still consistent with the
server state. If the filesystem does not provide an i_version, then
NFS falls back to checking c/mtime for changes. If files on the
server are being modified without either the tiemstamps or i_version
changing, then it's likely that there will be problems with client
side cache consistency....

> Instead, ext4 will call
> file_update_time on munmap, exit, MS_ASYNC, and at the end of
> writepages. Unless I'm missing something, there's no need to
> unconditionally start a transaction on page_mkwrite (and there had
> better not be, because file_update_time won't start a transaction if
> the time doesn't change).

Right, there's no unconditional need for a transaction except if the
filesystem is providing the inode version change feature for NFS.
ext4, btrfs and XFS all do this unconditionally, and so therefore
those filesystem have a need for an inode change transaction on
every page fault, just like they do for every write(2) call.

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 07:45:31

by Jan Kara

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> >> > > cost of the unwritten->written conversion.
> > >> >> >
> > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> >> > this part until writeback?
> > >> >>
> > >> >> Part of the work has to be done at write time because we need to
> > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> >> problems). The unwritten->written conversion does happen at writeback
> > >> >> (as does the actual block allocation if we are doing delayed
> > >> >> allocation).
> > >> >>
> > >> >> The point is that if the goal is to measure page fault scalability, we
> > >> >> shouldn't have this other stuff happening as the same time as the page
> > >> >> fault workload.
> > >> >
> > >> > Sure, but the real problem is not the block mapping or allocation
> > >> > path - even if the test is changed to take that out of the picture,
> > >> > we still have timestamp updates being done on every single page
> > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > >> > and have nanosecond granularity, so every page fault is resulting in
> > >> > a transaction to update the timestamp of the file being modified.
> > >>
> > >> I have (unmergeable) patches to fix this:
> > >>
> > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > >
> > > The big problem with this approach is that not doing the
> > > timestamp update on page faults is going to break the inode change
> > > version counting because for ext4, btrfs and XFS it takes a
> > > transaction to bump that counter. NFS needs to know the moment a
> > > file is changed in memory, not when it is written to disk. Also, NFS
> > > requires the change to the counter to be persistent over server
> > > failures, so it needs to be changed as part of a transaction....
> >
> > I've been running a kernel that has the file_update_time call
> > commented out for over a year now, and the only problem I've seen is
> > that the timestamp doesn't get updated :)
> >
> > I think I must be misunderstanding you (or vice versa). I'm currently
>
> Yup, you are.
>
> > redoing the patches, and this time I'll do it for just the mm core and
> > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> > remove the file_update_time call.
>
> Right. Where does that end up? All the way down in
> ext4_mark_iloc_dirty(), and that does:
>
> if (IS_I_VERSION(inode))
> inode_inc_iversion(inode);
>
> The XFS transaction code is the same - deep inside it where an inode
> is marked as dirty in the transaction, it bumps the same counter and
> adds it to the transaction.
Yeah, I'd just add that ext4 maintains i_version only if it has been
mounted with i_version mount option. But then NFS server would depend on
c/mtime update so it won't help you much - you still should update at least
one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
exported, you could avoid this relatively expensive dance and defer things
as Andy suggests.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-15 15:05:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Wed, Aug 14, 2013 at 10:10:07AM -0700, Dave Hansen wrote:
> We talked a little about this issue in this thread:
>
> http://marc.info/?l=linux-mm&m=137573185419275&w=2
>
> but I figured I'd follow up with a full comparison. ext4 is about 20%
> slower in handling write page faults than ext3.

Let's take a step back from the details of whether the benchmark is
measuring what it claims to be measuring, and address this a different
way --- what's the workload which might be run on an 8-socket, 80-core
system, which is heavily modifying mmap'ed pages in such a way that
all or most of the memory writes are to clean pages that require write
page fault handling?

We can talk about isolating the test so that we remove block
allocation, timestamp modifications, etc., but then are we stil
measuring whatever motivated Dave's work in the first place?

IOW, if it really is about write page fault handling, the simplest
test to do is to mmap /dev/zero and then start dirtying pages. At
that point we will be measuring the VM level write page fault code.

If we start trying to add in file system specific behavior, then we
get into questions about block allocation vs. inode updates
vs. writeback code paths, depending on what we are trying to measure,
which then leads to the next logical question --- why are we trying to
measure this?

Is there a specific scalability problem that is show up in some real
world use case? Or is this a theoretical exercise? It's Ok if it's
just theoretical, since then we can try to figure out some kind of
useful scalability limitation which is of practical importance. But
if there was some original workload which was motivating this
exercise, it would be good if we kept this in mind....

Cheers,

- Ted

2013-08-15 15:09:26

by Dave Hansen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On 08/14/2013 05:24 PM, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 10:10:07AM -0700, Dave Hansen wrote:
>> We talked a little about this issue in this thread:
>>
>> http://marc.info/?l=linux-mm&m=137573185419275&w=2
>>
>> but I figured I'd follow up with a full comparison. ext4 is about 20%
>> slower in handling write page faults than ext3. xfs is about 30% slower
>> than ext3. I'm running on an 8-socket / 80-core / 160-thread system.
>> Test case is this:
>>
>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>
> So, it writes a 128MB file sequentially via mmap page faults. This
> isn't a page fault benchmark, as such...

Call it what you will. :)

The other half of the benchmark (the threaded case) looks _completely_
different since it's dominated by per-mm VM structures while doing page
faults.

>>> # Baseline Delta Shared Object Symbol
>>> # ........ ....... ..................... ..............................................
>>> #
>>> 22.04% -4.07% [kernel.kallsyms] [k] page_fault
>>> 2.93% +12.49% [kernel.kallsyms] [k] _raw_spin_lock
>>> 8.21% -0.58% page_fault3_processes [.] testcase
>>> 4.87% -0.34% [kernel.kallsyms] [k] __set_page_dirty_buffers
>>> 4.07% -0.58% [kernel.kallsyms] [k] mem_cgroup_update_page_stat
>>> 4.10% -0.61% [kernel.kallsyms] [k] __block_write_begin
>>> 3.69% -0.57% [kernel.kallsyms] [k] find_get_page
>>
>> It's a bit of a bummer that things are so much less scalable on the
>> newer filesystems.
>
> Sorry, what? What filesystems are you comparing here? XFS is
> anything but new...

As I said in the first message:
> Here's a brief snippet of the ext4->xfs 'perf diff'. Note that things
> like page_fault() go down in the profile because we are doing _fewer_ of
> them, not because it got faster:

And, yes, I probably shouldn't be calling xfs "newer".

>> I expected xfs to do a _lot_ better than it did.
>
> perf diff doesn't tell me anything about how you should expect the
> workload to scale.

Click on the little "Linear scaling" checkbox. That's what I _want_ it
to do. It's completely unscientific, but I _expected_ xfs to do better
than ext4 here.

> This workload appears to be a concurrent write workload using
> mmap(), so performance is going to be determined by filesystem
> configuration, storage capability and the CPU overhead of the
> page_mkwrite() path through the filesystem. It's not a page fault
> benchmark at all - it's simply a filesystem write bandwidth
> benchmark.
>
> So, perhaps you could describe the storage you are using, as that
> would shed more light on your results.

The storage is a piddly little laptop disk. If I do this on a
ramfs-hosted loopback, the things actually looks the same (or even a wee
bit worse). The reason is that nobody is waiting on the disk to finish
any of the writeback (we're way below the dirty limits), so we're not
actually limited by the storage.

> And FWIW, it's no secret that XFS has more per-operation overhead
> than ext4 through the write path when it comes to allocation, so
> it's no surprise that on a workload that is highly dependent on
> allocation overhead that ext4 is a bit faster....

Oh, I didn't mean to be spilling secrets here or anything. I'm
obviously not a filesystem developer and I have zero deep understanding
of what the difference in overhead of the write paths is. It confused
me, so I reported it.

2013-08-15 15:14:21

by Dave Hansen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On 08/14/2013 06:11 PM, Theodore Ts'o wrote:
> The point is that if the goal is to measure page fault scalability, we
> shouldn't have this other stuff happening as the same time as the page
> fault workload.

will-it-scale does several different tests probing at different parts of
the fault path:

https://www.sr71.net/~dave/intel/willitscale/systems/bigbox/3.11.0-rc2-dirty/foo.html

It does that both for process and threaded workloads which lets it get
pretty good coverage of different areas of code.

I only posted data from half of one of these tests here because it was
the only one that I found that both had noticeable overhead in the
filesystem code. It also showed substantial, consistent, and measurable
deltas between the different filesystems.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 15:17:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <[email protected]> wrote:
> On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
>> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
>> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> >> >> > > cost of the unwritten->written conversion.
>> >> >> >
>> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> >> >> > this part until writeback?
>> >> >>
>> >> >> Part of the work has to be done at write time because we need to
>> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> >> >> problems). The unwritten->written conversion does happen at writeback
>> >> >> (as does the actual block allocation if we are doing delayed
>> >> >> allocation).
>> >> >>
>> >> >> The point is that if the goal is to measure page fault scalability, we
>> >> >> shouldn't have this other stuff happening as the same time as the page
>> >> >> fault workload.
>> >> >
>> >> > Sure, but the real problem is not the block mapping or allocation
>> >> > path - even if the test is changed to take that out of the picture,
>> >> > we still have timestamp updates being done on every single page
>> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> >> > and have nanosecond granularity, so every page fault is resulting in
>> >> > a transaction to update the timestamp of the file being modified.
>> >>
>> >> I have (unmergeable) patches to fix this:
>> >>
>> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> >
>> > The big problem with this approach is that not doing the
>> > timestamp update on page faults is going to break the inode change
>> > version counting because for ext4, btrfs and XFS it takes a
>> > transaction to bump that counter. NFS needs to know the moment a
>> > file is changed in memory, not when it is written to disk. Also, NFS
>> > requires the change to the counter to be persistent over server
>> > failures, so it needs to be changed as part of a transaction....
>>
>> I've been running a kernel that has the file_update_time call
>> commented out for over a year now, and the only problem I've seen is
>> that the timestamp doesn't get updated :)
>>

[...]

> If a filesystem is providing an i_version value, then NFS uses it to
> determine whether client side caches are still consistent with the
> server state. If the filesystem does not provide an i_version, then
> NFS falls back to checking c/mtime for changes. If files on the
> server are being modified without either the tiemstamps or i_version
> changing, then it's likely that there will be problems with client
> side cache consistency....

I didn't think of that at all.

If userspace does:

ptr = mmap(...);
ptr[0] = 1;
sleep(1);
ptr[0] = 2;
sleep(1);
munmap();

Then current kernels will mark the inode changed on (only) the ptr[0]
= 1 line. My patches will instead mark the inode changed when munmap
is called (or after ptr[0] = 2 if writepages gets called for any
reason).

I'm not sure which is better. POSIX actually requires my behavior
(which is most irrelevant). My behavior also means that, if an NFS
client reads and caches the file between the two writes, then it will
eventually find out that the data is stale. The current behavior, on
the other hand, means that a single pass of mmapped writes through the
file will update the times much faster.

I could arrange for the first page fault to *also* update times when
the FS is exported or if a particular mount option is set. (The ext4
change to request the new behavior is all of four lines, and it's easy
to adjust.)

I'll send patches later today. I want to get msync(MS_ASYNC) working
and pound on them a bit first.

--Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 15:36:40

by Dave Hansen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On 08/14/2013 09:29 PM, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 07:24:01PM -0700, Andi Kleen wrote:
>>> And FWIW, it's no secret that XFS has more per-operation overhead
>>> than ext4 through the write path when it comes to allocation, so
>>> it's no surprise that on a workload that is highly dependent on
>>> allocation overhead that ext4 is a bit faster....
>>
>> This cannot explain a worse scaling curve though?
>
> The scaling curve is pretty much identical. The difference in
> performance will be the overhead of timestamp updates through
> the transaction subsystems of the filesystems.

I guess how you read it is in the eye of the beholder. I see xfs being
slower than ext3 or ext4. Nobody sits and does this in a loop in real
life (it's a microbenchbark), but I'd be willing to bet that this is a
real *component* of real-life workloads. It's a component where I think
it's pretty clear xfs and ext4 lag behind ext3, and it _looks_ to me
like it gets worse on larger systems.

Maybe that's because of design decisions in the filesystem, or because
of the enhanced integrity guarantees that xfs/ext4 provide.

>> w-i-s is all about scaling.
>
> Sure, but scaling *what*? It's spending all it's time in the
> filesystem through the .page_mkwrite path. It's not a page fault
> scaling test - it's a filesystem overwrite test that uses mmap.

will-it-scale tests a bunch of different scenarios. This is just one of
at least 6 tests that we do which beat on the page fault path. It was
the only one of those 6 that showed any kind of bottleneck being in the
fs code.

> Indeed, I bet if you replace the mmap() with a write(fd, buf, 4096)
> loop, you'd get almost identical behaviour from the filesystems.

In a quick 60-second test: xfs went from ~70M writes/sec (doing faults)
to ~18M/sec (using write()). ext4 went down to 0.5M/sec. I didn't take
the mmap()/munmap() out:

lseek(fd, 0, SEEK_SET);
for (i = 0; i < MEMSIZE; i += pgsize) {
write(fd, xxx, 4096);
//c[i] = 0;
(*iterations)++;
}

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 17:45:09

by Dave Hansen

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On 08/15/2013 08:05 AM, Theodore Ts'o wrote:
> IOW, if it really is about write page fault handling, the simplest
> test to do is to mmap /dev/zero and then start dirtying pages. At
> that point we will be measuring the VM level write page fault code.

As I mentioned in some of the other replies, this is only one of six
tests that look at page faults. It's the only one of the six that even
hinted at involvement by fs code.

> If we start trying to add in file system specific behavior, then we
> get into questions about block allocation vs. inode updates
> vs. writeback code paths, depending on what we are trying to measure,
> which then leads to the next logical question --- why are we trying to
> measure this?

At the risk of putting the cart before the horse, I ran the following:

http://sr71.net/~dave/intel/page-fault-exts/page_fault4.c.txt

It should do all of the block allocation during will-it-scale's warmup
period. I ran it for all 3 fs's with 160-processes. The numbers were
indistinguishable from the case where the blocks were not preallocated.

I _believe_ this is because the block allocation is occurring during the
warmup, even in those numbers I posted previously. will-it-scale forks
things off early and the tests spend most of their time in those while
loops. Each "page fault handled" (the y-axis) is a trip through the
while loop, *not* a call to testcase().

It looks something like this:

for_each_cpu(cpu)
fork_off_stuff(testcase_func, &iterations[cpu]);
while(test_nr++) {
if (test_nr < 5)
printf("warmup...")
sleep(1);
sample_iterations_from_shmem();
}
kill_everything();

In other words, block allocation isn't (or shouldn't be) playing a role
here, at least in the faults-per-second numbers.

> Is there a specific scalability problem that is show up in some real
> world use case? Or is this a theoretical exercise? It's Ok if it's
> just theoretical, since then we can try to figure out some kind of
> useful scalability limitation which is of practical importance. But
> if there was some original workload which was motivating this
> exercise, it would be good if we kept this in mind....

It's definitely a theoretical exercise. I'm in no way saying that all
you lazy filesystem developers need to get off your butts and go fix
this! ;)

Here's the problem:

We've got a kernel which works *really* well, even on very large
systems. There are vanishingly few places to make performance
improvements, especially on more modestly-sized systems. To _find_
those smallish issues (which I believe this is), we run things on
ridiculously-sized systems to make them easier to identify and measure.

1. The test is doing something that is not out of the question for a
real workload to be doing (writing to an existing, medium-sized file
with mmap())
2. I noticed that it exercised some of the same code paths Andy
Lutomirski was trying to work around with his MADV_WILLWRITE patch
3. Dave Chinner _has_ patches which look to me like they could make an
impact (at least on the xfs_log_commit_cil() spinlock)
4. This is something that is measurable, and we can easily measure
improvements

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 19:31:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 10:45:09AM -0700, Dave Hansen wrote:
>
> I _believe_ this is because the block allocation is occurring during the
> warmup, even in those numbers I posted previously. will-it-scale forks
> things off early and the tests spend most of their time in those while
> loops. Each "page fault handled" (the y-axis) is a trip through the
> while loop, *not* a call to testcase().

Ah, OK. Sorry, I misinterpreted what was going on.

So basically, what we have going on in the test is (a) we're bumping
i_version and/or mtime, and (b) the munmap() implies an msync(), so
writeback is happening in the background concurrently with the write
page faults, and we may be (actually, almost certainly) seeing some
interference between the writeback and the page_mkwrite operations.

That implies that if you redid the test using a ramdisk, which will
significantly speed up the writeback and overhead caused by the
journal transactions for the metadata updates, the results might very
well be different.

Cheers,

- Ted

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 21:28:26

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > > >> >> > > cost of the unwritten->written conversion.
> > > >> >> >
> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > > >> >> > this part until writeback?
> > > >> >>
> > > >> >> Part of the work has to be done at write time because we need to
> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > > >> >> problems). The unwritten->written conversion does happen at writeback
> > > >> >> (as does the actual block allocation if we are doing delayed
> > > >> >> allocation).
> > > >> >>
> > > >> >> The point is that if the goal is to measure page fault scalability, we
> > > >> >> shouldn't have this other stuff happening as the same time as the page
> > > >> >> fault workload.
> > > >> >
> > > >> > Sure, but the real problem is not the block mapping or allocation
> > > >> > path - even if the test is changed to take that out of the picture,
> > > >> > we still have timestamp updates being done on every single page
> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > >> > and have nanosecond granularity, so every page fault is resulting in
> > > >> > a transaction to update the timestamp of the file being modified.
> > > >>
> > > >> I have (unmergeable) patches to fix this:
> > > >>
> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > > >
> > > > The big problem with this approach is that not doing the
> > > > timestamp update on page faults is going to break the inode change
> > > > version counting because for ext4, btrfs and XFS it takes a
> > > > transaction to bump that counter. NFS needs to know the moment a
> > > > file is changed in memory, not when it is written to disk. Also, NFS
> > > > requires the change to the counter to be persistent over server
> > > > failures, so it needs to be changed as part of a transaction....
> > >
> > > I've been running a kernel that has the file_update_time call
> > > commented out for over a year now, and the only problem I've seen is
> > > that the timestamp doesn't get updated :)
> > >
> > > I think I must be misunderstanding you (or vice versa). I'm currently
> >
> > Yup, you are.
> >
> > > redoing the patches, and this time I'll do it for just the mm core and
> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> > > remove the file_update_time call.
> >
> > Right. Where does that end up? All the way down in
> > ext4_mark_iloc_dirty(), and that does:
> >
> > if (IS_I_VERSION(inode))
> > inode_inc_iversion(inode);
> >
> > The XFS transaction code is the same - deep inside it where an inode
> > is marked as dirty in the transaction, it bumps the same counter and
> > adds it to the transaction.
> Yeah, I'd just add that ext4 maintains i_version only if it has been
> mounted with i_version mount option. But then NFS server would depend on
> c/mtime update so it won't help you much - you still should update at least
> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
> exported, you could avoid this relatively expensive dance and defer things
> as Andy suggests.

The problem with "not exported, don't update" is that files can be
modified on server startup (e.g. after a crash) or in short
maintenance periods when the NFS service is down. When the server is
started back up, the change number needs to indicate the file has
been modified so that clients reconnecting to the server see the
change.

IOWs, even if the NFS server is not up or the filesystem not
exported we still need to update change counts whenever a file
changes if we are going to tell the NFS server that we keep them...

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 21:31:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 2:28 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
>> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
>> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
>> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
>> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > > >> >> > > cost of the unwritten->written conversion.
>> > > >> >> >
>> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > > >> >> > this part until writeback?
>> > > >> >>
>> > > >> >> Part of the work has to be done at write time because we need to
>> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > > >> >> problems). The unwritten->written conversion does happen at writeback
>> > > >> >> (as does the actual block allocation if we are doing delayed
>> > > >> >> allocation).
>> > > >> >>
>> > > >> >> The point is that if the goal is to measure page fault scalability, we
>> > > >> >> shouldn't have this other stuff happening as the same time as the page
>> > > >> >> fault workload.
>> > > >> >
>> > > >> > Sure, but the real problem is not the block mapping or allocation
>> > > >> > path - even if the test is changed to take that out of the picture,
>> > > >> > we still have timestamp updates being done on every single page
>> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > > >> > and have nanosecond granularity, so every page fault is resulting in
>> > > >> > a transaction to update the timestamp of the file being modified.
>> > > >>
>> > > >> I have (unmergeable) patches to fix this:
>> > > >>
>> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > > >
>> > > > The big problem with this approach is that not doing the
>> > > > timestamp update on page faults is going to break the inode change
>> > > > version counting because for ext4, btrfs and XFS it takes a
>> > > > transaction to bump that counter. NFS needs to know the moment a
>> > > > file is changed in memory, not when it is written to disk. Also, NFS
>> > > > requires the change to the counter to be persistent over server
>> > > > failures, so it needs to be changed as part of a transaction....
>> > >
>> > > I've been running a kernel that has the file_update_time call
>> > > commented out for over a year now, and the only problem I've seen is
>> > > that the timestamp doesn't get updated :)
>> > >
>> > > I think I must be misunderstanding you (or vice versa). I'm currently
>> >
>> > Yup, you are.
>> >
>> > > redoing the patches, and this time I'll do it for just the mm core and
>> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
>> > > remove the file_update_time call.
>> >
>> > Right. Where does that end up? All the way down in
>> > ext4_mark_iloc_dirty(), and that does:
>> >
>> > if (IS_I_VERSION(inode))
>> > inode_inc_iversion(inode);
>> >
>> > The XFS transaction code is the same - deep inside it where an inode
>> > is marked as dirty in the transaction, it bumps the same counter and
>> > adds it to the transaction.
>> Yeah, I'd just add that ext4 maintains i_version only if it has been
>> mounted with i_version mount option. But then NFS server would depend on
>> c/mtime update so it won't help you much - you still should update at least
>> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
>> exported, you could avoid this relatively expensive dance and defer things
>> as Andy suggests.
>
> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...

This will keep working as long as the clients are willing to wait for
writeback (or msync, munmap, or exit) on the server.

--Andy

2013-08-15 21:37:25

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <[email protected]> wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> >> > > cost of the unwritten->written conversion.
> >> >> >> >
> >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> >> > this part until writeback?
> >> >> >>
> >> >> >> Part of the work has to be done at write time because we need to
> >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> >> >> (as does the actual block allocation if we are doing delayed
> >> >> >> allocation).
> >> >> >>
> >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> >> fault workload.
> >> >> >
> >> >> > Sure, but the real problem is not the block mapping or allocation
> >> >> > path - even if the test is changed to take that out of the picture,
> >> >> > we still have timestamp updates being done on every single page
> >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> >> > a transaction to update the timestamp of the file being modified.
> >> >>
> >> >> I have (unmergeable) patches to fix this:
> >> >>
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> >
> >> > The big problem with this approach is that not doing the
> >> > timestamp update on page faults is going to break the inode change
> >> > version counting because for ext4, btrfs and XFS it takes a
> >> > transaction to bump that counter. NFS needs to know the moment a
> >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > requires the change to the counter to be persistent over server
> >> > failures, so it needs to be changed as part of a transaction....
> >>
> >> I've been running a kernel that has the file_update_time call
> >> commented out for over a year now, and the only problem I've seen is
> >> that the timestamp doesn't get updated :)
> >>
>
> [...]
>
> > If a filesystem is providing an i_version value, then NFS uses it to
> > determine whether client side caches are still consistent with the
> > server state. If the filesystem does not provide an i_version, then
> > NFS falls back to checking c/mtime for changes. If files on the
> > server are being modified without either the tiemstamps or i_version
> > changing, then it's likely that there will be problems with client
> > side cache consistency....
>
> I didn't think of that at all.
>
> If userspace does:
>
> ptr = mmap(...);
> ptr[0] = 1;
> sleep(1);
> ptr[0] = 2;
> sleep(1);
> munmap();
>
> Then current kernels will mark the inode changed on (only) the ptr[0]
> = 1 line. My patches will instead mark the inode changed when munmap
> is called (or after ptr[0] = 2 if writepages gets called for any
> reason).
>
> I'm not sure which is better. POSIX actually requires my behavior
> (which is most irrelevant).

Not by my reading of it. Posix states that c/mtime needs to be
updated between the first access and the next msync() call. We
update mtime on the first access, and so therefore we conform to the
posix requirement....

> My behavior also means that, if an NFS
> client reads and caches the file between the two writes, then it will
> eventually find out that the data is stale.

"eventually" is very different behaviour to the current behaviour.

My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.

> The current behavior, on
> the other hand, means that a single pass of mmapped writes through the
> file will update the times much faster.
>
> I could arrange for the first page fault to *also* update times when
> the FS is exported or if a particular mount option is set. (The ext4
> change to request the new behavior is all of four lines, and it's easy
> to adjust.)

What does "first page fault" mean?

Cheers,

Dave
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 21:39:32

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 02:31:14PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 2:28 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
> >> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> >> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> >> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> >> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > > >> >> > > cost of the unwritten->written conversion.
> >> > > >> >> >
> >> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > > >> >> > this part until writeback?
> >> > > >> >>
> >> > > >> >> Part of the work has to be done at write time because we need to
> >> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> > > >> >> problems). The unwritten->written conversion does happen at writeback
> >> > > >> >> (as does the actual block allocation if we are doing delayed
> >> > > >> >> allocation).
> >> > > >> >>
> >> > > >> >> The point is that if the goal is to measure page fault scalability, we
> >> > > >> >> shouldn't have this other stuff happening as the same time as the page
> >> > > >> >> fault workload.
> >> > > >> >
> >> > > >> > Sure, but the real problem is not the block mapping or allocation
> >> > > >> > path - even if the test is changed to take that out of the picture,
> >> > > >> > we still have timestamp updates being done on every single page
> >> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > > >> > and have nanosecond granularity, so every page fault is resulting in
> >> > > >> > a transaction to update the timestamp of the file being modified.
> >> > > >>
> >> > > >> I have (unmergeable) patches to fix this:
> >> > > >>
> >> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> > > >
> >> > > > The big problem with this approach is that not doing the
> >> > > > timestamp update on page faults is going to break the inode change
> >> > > > version counting because for ext4, btrfs and XFS it takes a
> >> > > > transaction to bump that counter. NFS needs to know the moment a
> >> > > > file is changed in memory, not when it is written to disk. Also, NFS
> >> > > > requires the change to the counter to be persistent over server
> >> > > > failures, so it needs to be changed as part of a transaction....
> >> > >
> >> > > I've been running a kernel that has the file_update_time call
> >> > > commented out for over a year now, and the only problem I've seen is
> >> > > that the timestamp doesn't get updated :)
> >> > >
> >> > > I think I must be misunderstanding you (or vice versa). I'm currently
> >> >
> >> > Yup, you are.
> >> >
> >> > > redoing the patches, and this time I'll do it for just the mm core and
> >> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> >> > > remove the file_update_time call.
> >> >
> >> > Right. Where does that end up? All the way down in
> >> > ext4_mark_iloc_dirty(), and that does:
> >> >
> >> > if (IS_I_VERSION(inode))
> >> > inode_inc_iversion(inode);
> >> >
> >> > The XFS transaction code is the same - deep inside it where an inode
> >> > is marked as dirty in the transaction, it bumps the same counter and
> >> > adds it to the transaction.
> >> Yeah, I'd just add that ext4 maintains i_version only if it has been
> >> mounted with i_version mount option. But then NFS server would depend on
> >> c/mtime update so it won't help you much - you still should update at least
> >> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
> >> exported, you could avoid this relatively expensive dance and defer things
> >> as Andy suggests.
> >
> > The problem with "not exported, don't update" is that files can be
> > modified on server startup (e.g. after a crash) or in short
> > maintenance periods when the NFS service is down. When the server is
> > started back up, the change number needs to indicate the file has
> > been modified so that clients reconnecting to the server see the
> > change.
> >
> > IOWs, even if the NFS server is not up or the filesystem not
> > exported we still need to update change counts whenever a file
> > changes if we are going to tell the NFS server that we keep them...
>
> This will keep working as long as the clients are willing to wait for
> writeback (or msync, munmap, or exit) on the server.

I don't follow you - what will keep working? If we don't record
changes while the filesystem is not exported, then NFS clients can't
determine if files have changed while the server was down for a
period....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-08-15 21:43:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> I didn't think of that at all.
>>
>> If userspace does:
>>
>> ptr = mmap(...);
>> ptr[0] = 1;
>> sleep(1);
>> ptr[0] = 2;
>> sleep(1);
>> munmap();
>>
>> Then current kernels will mark the inode changed on (only) the ptr[0]
>> = 1 line. My patches will instead mark the inode changed when munmap
>> is called (or after ptr[0] = 2 if writepages gets called for any
>> reason).
>>
>> I'm not sure which is better. POSIX actually requires my behavior
>> (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....

It says "between a write reference to the mapped region and the next
call to msync()." Most write references don't cause page faults.

>
>> My behavior also means that, if an NFS
>> client reads and caches the file between the two writes, then it will
>> eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately. So not
> informing the filesystem that the file data has been changed is
> going to cause problems.

We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification. See
below...

>
>> The current behavior, on
>> the other hand, means that a single pass of mmapped writes through the
>> file will update the times much faster.
>>
>> I could arrange for the first page fault to *also* update times when
>> the FS is exported or if a particular mount option is set. (The ext4
>> change to request the new behavior is all of four lines, and it's easy
>> to adjust.)
>
> What does "first page fault" mean?

The first write to the page triggers a page fault and marks the page
writable. The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.


In current kernels, this chain of events won't work:

- Server goes down
- Server comes up
- Userspace on server calls mmap and writes something
- Client reconnects and invalidates its cache
- Userspace on server writes something else *to the same page*

The client will never notice the second write, because it won't update
any inode state. With my patches, the client will as soon as the
server starts writeback.

So I think that there are cases where my changes make things better
and cases where they make things worse.

--Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 22:18:07

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
> <[email protected]> wrote:
> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> My behavior also means that, if an NFS
> >> client reads and caches the file between the two writes, then it will
> >> eventually find out that the data is stale.
> >
> > "eventually" is very different behaviour to the current behaviour.
> >
> > My understanding is that NFS v4 delegations require the underlying
> > filesystem to bump the version count on *any* modification made to
> > the file so that delegations can be recalled appropriately. So not
> > informing the filesystem that the file data has been changed is
> > going to cause problems.
>
> We don't do that right now (and we can't without utterly destroying
> performance) because we don't trap on every modification. See
> below...

We don't trap every mmap modification. We trap every modification
that the filesystem is informed about. That includes a c/mtime
update on every write page fault. It's as fine grained as we can get
without introducing serious performance killing overhead.

And nobody has made any compelling argument that what we do now is
problematic - all we've got is a microbenchmark doesn't quite scale
linearly because filesystem updates through a global filesystem
structure (the journal) don't scale linearly.

> >> The current behavior, on
> >> the other hand, means that a single pass of mmapped writes through the
> >> file will update the times much faster.
> >>
> >> I could arrange for the first page fault to *also* update times when
> >> the FS is exported or if a particular mount option is set. (The ext4
> >> change to request the new behavior is all of four lines, and it's easy
> >> to adjust.)
> >
> > What does "first page fault" mean?
>
> The first write to the page triggers a page fault and marks the page
> writable. The second write to the page (assuming no writeback happens
> in the mean time) does not trigger a page fault or notify the kernel
> in any way.

IIUC, you are saying is that you'll maintain the current behaviour
(i.e. clean->dirty does a timestamp update) if the filesystem
requires it? So the default behaviour of any filesystem that
supports NFSv4 is going to behave as it does now?

If that's the case, why bother changing anything as nfsv4 is the
default version that the kernel uses? (I'm playing devil's advocate
here).

> In current kernels, this chain of events won't work:
>
> - Server goes down
> - Server comes up
> - Userspace on server calls mmap and writes something
> - Client reconnects and invalidates its cache
> - Userspace on server writes something else *to the same page*
>
> The client will never notice the second write, because it won't update
> any inode state.

That's wrong. The server wrote the dirty page before the client
reconnected, therefore it got marked clean. The second write to the
server page marks it dirty again, causing page_mkwrite to be
called, thereby updating the timestamp/i_version field. So, the NFS
client will notice the second change on the server, and it will
notice it immediately after the second access has occurred, not some
time later when:

> With my patches, the client will as soon as the
> server starts writeback.

Your patches introduce a 30+ second window where a file can be dirty
on the server but the NFS server doesn't know about it and can't
tell the clients about it because i_version doesn't get bumped until
writeback.....

> So I think that there are cases where my changes make things better
> and cases where they make things worse.

Right, and the issue is that there are important use cases that we
have to support in default configurations that it makes things
worse.

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-15 22:26:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> <[email protected]> wrote:
>> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> >> My behavior also means that, if an NFS
>> >> client reads and caches the file between the two writes, then it will
>> >> eventually find out that the data is stale.
>> >
>> > "eventually" is very different behaviour to the current behaviour.
>> >
>> > My understanding is that NFS v4 delegations require the underlying
>> > filesystem to bump the version count on *any* modification made to
>> > the file so that delegations can be recalled appropriately. So not
>> > informing the filesystem that the file data has been changed is
>> > going to cause problems.
>>
>> We don't do that right now (and we can't without utterly destroying
>> performance) because we don't trap on every modification. See
>> below...
>
> We don't trap every mmap modification. We trap every modification
> that the filesystem is informed about. That includes a c/mtime
> update on every write page fault. It's as fine grained as we can get
> without introducing serious performance killing overhead.
>
> And nobody has made any compelling argument that what we do now is
> problematic - all we've got is a microbenchmark doesn't quite scale
> linearly because filesystem updates through a global filesystem
> structure (the journal) don't scale linearly.

I don't personally care about scaling. I care about sleeping in write
faults, and starting journal transactions sleeps, and this is an
absolute show-stopper for me. (It's a real-time latency problem, not
a throughput or scalability thing.)

>
>> >> The current behavior, on
>> >> the other hand, means that a single pass of mmapped writes through the
>> >> file will update the times much faster.
>> >>
>> >> I could arrange for the first page fault to *also* update times when
>> >> the FS is exported or if a particular mount option is set. (The ext4
>> >> change to request the new behavior is all of four lines, and it's easy
>> >> to adjust.)
>> >
>> > What does "first page fault" mean?
>>
>> The first write to the page triggers a page fault and marks the page
>> writable. The second write to the page (assuming no writeback happens
>> in the mean time) does not trigger a page fault or notify the kernel
>> in any way.
>
> IIUC, you are saying is that you'll maintain the current behaviour
> (i.e. clean->dirty does a timestamp update) if the filesystem
> requires it? So the default behaviour of any filesystem that
> supports NFSv4 is going to behave as it does now?
>
> If that's the case, why bother changing anything as nfsv4 is the
> default version that the kernel uses? (I'm playing devil's advocate
> here).

Because the performance sucks right now. I'd like to fix it without
breaking things, and I think I can fix it while actually improving the
semantics.

>
>> In current kernels, this chain of events won't work:
>>
>> - Server goes down
>> - Server comes up
>> - Userspace on server calls mmap and writes something
>> - Client reconnects and invalidates its cache
>> - Userspace on server writes something else *to the same page*
>>
>> The client will never notice the second write, because it won't update
>> any inode state.
>
> That's wrong. The server wrote the dirty page before the client
> reconnected, therefore it got marked clean.

Why would it write the dirty page? Is the client's NFSv4 request
forcing the server to scan for dirty ptes or pages? If so, can you
point me to that code? I can probably make it work deterministically.

> The second write to the
> server page marks it dirty again, causing page_mkwrite to be
> called, thereby updating the timestamp/i_version field. So, the NFS
> client will notice the second change on the server, and it will
> notice it immediately after the second access has occurred, not some
> time later when:
>
>> With my patches, the client will as soon as the
>> server starts writeback.
>
> Your patches introduce a 30+ second window where a file can be dirty
> on the server but the NFS server doesn't know about it and can't
> tell the clients about it because i_version doesn't get bumped until
> writeback.....

I claim that there's an infinite window right now, and that 30 seconds
is therefore an improvement.

--Andy

2013-08-16 00:14:35

by Dave Chinner

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
> >> <[email protected]> wrote:
> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> >> My behavior also means that, if an NFS
> >> >> client reads and caches the file between the two writes, then it will
> >> >> eventually find out that the data is stale.
> >> >
> >> > "eventually" is very different behaviour to the current behaviour.
> >> >
> >> > My understanding is that NFS v4 delegations require the underlying
> >> > filesystem to bump the version count on *any* modification made to
> >> > the file so that delegations can be recalled appropriately. So not
> >> > informing the filesystem that the file data has been changed is
> >> > going to cause problems.
> >>
> >> We don't do that right now (and we can't without utterly destroying
> >> performance) because we don't trap on every modification. See
> >> below...
> >
> > We don't trap every mmap modification. We trap every modification
> > that the filesystem is informed about. That includes a c/mtime
> > update on every write page fault. It's as fine grained as we can get
> > without introducing serious performance killing overhead.
> >
> > And nobody has made any compelling argument that what we do now is
> > problematic - all we've got is a microbenchmark doesn't quite scale
> > linearly because filesystem updates through a global filesystem
> > structure (the journal) don't scale linearly.
>
> I don't personally care about scaling. I care about sleeping in write
> faults, and starting journal transactions sleeps, and this is an
> absolute show-stopper for me. (It's a real-time latency problem, not
> a throughput or scalability thing.)

Different problem, then. And one that does actaully have a solution
that is already implemented but not exposed to userspace -
O_NOCMTIME. i.e. we actually support turning off c/mtime updates on
a per file basis - the XFS open-by-handle interface sets this flag
by default on files opened that way.....

Expose that to open/fcntl and your problem is solved without
impacting anyone else or default behaviours of filesystems.

> >> In current kernels, this chain of events won't work:
> >>
> >> - Server goes down
> >> - Server comes up
> >> - Userspace on server calls mmap and writes something
> >> - Client reconnects and invalidates its cache
> >> - Userspace on server writes something else *to the same page*
> >>
> >> The client will never notice the second write, because it won't update
> >> any inode state.
> >
> > That's wrong. The server wrote the dirty page before the client
> > reconnected, therefore it got marked clean.
>
> Why would it write the dirty page?

Terminology mismatch - you said it "writes something", not "dirties
the page". So, it's easy to take that as "does writeback" as opposed
to "dirties memory".

As to what woudl write it? Memory pressure, a user running sync,
ENOSPC conditions, all sorts of things that you can't control. You
cannot rely on writeback only happening periodically and therefore
being predictable and deterministic.

> > The second write to the
> > server page marks it dirty again, causing page_mkwrite to be
> > called, thereby updating the timestamp/i_version field. So, the NFS
> > client will notice the second change on the server, and it will
> > notice it immediately after the second access has occurred, not some
> > time later when:
> >
> >> With my patches, the client will as soon as the
> >> server starts writeback.
> >
> > Your patches introduce a 30+ second window where a file can be dirty
> > on the server but the NFS server doesn't know about it and can't
> > tell the clients about it because i_version doesn't get bumped until
> > writeback.....
>
> I claim that there's an infinite window right now, and that 30 seconds
> is therefore an improvement.

You're talking about after the second change is made. I'm talking
about the difference in behaviour after the *initial change* is
made. Your changes will result in the client not doing an
invalidation because timestamps don't get changed for 30s with your
patches. That's the problem - the first change of a file needs to
bump the i_version immediately, not in 30s time.

That's why delaying timestamp updates doesn't fix the scalability
problem that was reported. It might fix a different problem, but it
doesn't void the *requirment* that filesystems need to do
transactional updates during page faults....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-08-16 00:21:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 5:14 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <[email protected]> wrote:
>> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> >> <[email protected]> wrote:
>> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>
>> >> In current kernels, this chain of events won't work:
>> >>
>> >> - Server goes down
>> >> - Server comes up
>> >> - Userspace on server calls mmap and writes something
>> >> - Client reconnects and invalidates its cache
>> >> - Userspace on server writes something else *to the same page*
>> >>
>> >> The client will never notice the second write, because it won't update
>> >> any inode state.
>> >
>> > That's wrong. The server wrote the dirty page before the client
>> > reconnected, therefore it got marked clean.
>>
>> Why would it write the dirty page?
>
> Terminology mismatch - you said it "writes something", not "dirties
> the page". So, it's easy to take that as "does writeback" as opposed
> to "dirties memory".

When I say "writes something" I mean literally performs a store to
memory. That is:

ptr[offset] = value;

In my example, the client will *never* catch up.

>
>> > The second write to the
>> > server page marks it dirty again, causing page_mkwrite to be
>> > called, thereby updating the timestamp/i_version field. So, the NFS
>> > client will notice the second change on the server, and it will
>> > notice it immediately after the second access has occurred, not some
>> > time later when:
>> >
>> >> With my patches, the client will as soon as the
>> >> server starts writeback.
>> >
>> > Your patches introduce a 30+ second window where a file can be dirty
>> > on the server but the NFS server doesn't know about it and can't
>> > tell the clients about it because i_version doesn't get bumped until
>> > writeback.....
>>
>> I claim that there's an infinite window right now, and that 30 seconds
>> is therefore an improvement.
>
> You're talking about after the second change is made. I'm talking
> about the difference in behaviour after the *initial change* is
> made. Your changes will result in the client not doing an
> invalidation because timestamps don't get changed for 30s with your
> patches. That's the problem - the first change of a file needs to
> bump the i_version immediately, not in 30s time.
>
> That's why delaying timestamp updates doesn't fix the scalability
> problem that was reported. It might fix a different problem, but it
> doesn't void the *requirment* that filesystems need to do
> transactional updates during page faults....
>

And this is why I'm unconvinced that your requirement is sensible.
It's attempting to make sure that every mmaped write results in a some
kind of FS update, but it actually only results in an FS update
*before* the *first* mmapped write after writeback. It's racy as
hell.

My approach is slow but not racy.

--Andy

2013-08-16 22:02:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <[email protected]> wrote:
> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> >> >> > > cost of the unwritten->written conversion.
> > >> >> >> >
> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> >> >> > this part until writeback?
> > >> >> >>
> > >> >> >> Part of the work has to be done at write time because we need to
> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> > >> >> >> (as does the actual block allocation if we are doing delayed
> > >> >> >> allocation).
> > >> >> >>
> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> > >> >> >> fault workload.
> > >> >> >
> > >> >> > Sure, but the real problem is not the block mapping or allocation
> > >> >> > path - even if the test is changed to take that out of the picture,
> > >> >> > we still have timestamp updates being done on every single page
> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> > >> >> > a transaction to update the timestamp of the file being modified.
> > >> >>
> > >> >> I have (unmergeable) patches to fix this:
> > >> >>
> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > >> >
> > >> > The big problem with this approach is that not doing the
> > >> > timestamp update on page faults is going to break the inode change
> > >> > version counting because for ext4, btrfs and XFS it takes a
> > >> > transaction to bump that counter. NFS needs to know the moment a
> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> > >> > requires the change to the counter to be persistent over server
> > >> > failures, so it needs to be changed as part of a transaction....
> > >>
> > >> I've been running a kernel that has the file_update_time call
> > >> commented out for over a year now, and the only problem I've seen is
> > >> that the timestamp doesn't get updated :)
> > >>
> >
> > [...]
> >
> > > If a filesystem is providing an i_version value, then NFS uses it to
> > > determine whether client side caches are still consistent with the
> > > server state. If the filesystem does not provide an i_version, then
> > > NFS falls back to checking c/mtime for changes. If files on the
> > > server are being modified without either the tiemstamps or i_version
> > > changing, then it's likely that there will be problems with client
> > > side cache consistency....
> >
> > I didn't think of that at all.
> >
> > If userspace does:
> >
> > ptr = mmap(...);
> > ptr[0] = 1;
> > sleep(1);
> > ptr[0] = 2;
> > sleep(1);
> > munmap();
> >
> > Then current kernels will mark the inode changed on (only) the ptr[0]
> > = 1 line. My patches will instead mark the inode changed when munmap
> > is called (or after ptr[0] = 2 if writepages gets called for any
> > reason).
> >
> > I'm not sure which is better. POSIX actually requires my behavior
> > (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....
>
> > My behavior also means that, if an NFS
> > client reads and caches the file between the two writes, then it will
> > eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately.

Delegations at least shouldn't be an issue here: they're recalled on the
open.

--b.

> So not
> informing the filesystem that the file data has been changed is
> going to cause problems.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-16 23:18:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <[email protected]> wrote:
> On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
>> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <[email protected]> wrote:
>> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
>> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
>> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > >> >> >> > > cost of the unwritten->written conversion.
>> > >> >> >> >
>> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > >> >> >> > this part until writeback?
>> > >> >> >>
>> > >> >> >> Part of the work has to be done at write time because we need to
>> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > >> >> >> problems). The unwritten->written conversion does happen at writeback
>> > >> >> >> (as does the actual block allocation if we are doing delayed
>> > >> >> >> allocation).
>> > >> >> >>
>> > >> >> >> The point is that if the goal is to measure page fault scalability, we
>> > >> >> >> shouldn't have this other stuff happening as the same time as the page
>> > >> >> >> fault workload.
>> > >> >> >
>> > >> >> > Sure, but the real problem is not the block mapping or allocation
>> > >> >> > path - even if the test is changed to take that out of the picture,
>> > >> >> > we still have timestamp updates being done on every single page
>> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > >> >> > and have nanosecond granularity, so every page fault is resulting in
>> > >> >> > a transaction to update the timestamp of the file being modified.
>> > >> >>
>> > >> >> I have (unmergeable) patches to fix this:
>> > >> >>
>> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > >> >
>> > >> > The big problem with this approach is that not doing the
>> > >> > timestamp update on page faults is going to break the inode change
>> > >> > version counting because for ext4, btrfs and XFS it takes a
>> > >> > transaction to bump that counter. NFS needs to know the moment a
>> > >> > file is changed in memory, not when it is written to disk. Also, NFS
>> > >> > requires the change to the counter to be persistent over server
>> > >> > failures, so it needs to be changed as part of a transaction....
>> > >>
>> > >> I've been running a kernel that has the file_update_time call
>> > >> commented out for over a year now, and the only problem I've seen is
>> > >> that the timestamp doesn't get updated :)
>> > >>
>> >
>> > [...]
>> >
>> > > If a filesystem is providing an i_version value, then NFS uses it to
>> > > determine whether client side caches are still consistent with the
>> > > server state. If the filesystem does not provide an i_version, then
>> > > NFS falls back to checking c/mtime for changes. If files on the
>> > > server are being modified without either the tiemstamps or i_version
>> > > changing, then it's likely that there will be problems with client
>> > > side cache consistency....
>> >
>> > I didn't think of that at all.
>> >
>> > If userspace does:
>> >
>> > ptr = mmap(...);
>> > ptr[0] = 1;
>> > sleep(1);
>> > ptr[0] = 2;
>> > sleep(1);
>> > munmap();
>> >
>> > Then current kernels will mark the inode changed on (only) the ptr[0]
>> > = 1 line. My patches will instead mark the inode changed when munmap
>> > is called (or after ptr[0] = 2 if writepages gets called for any
>> > reason).
>> >
>> > I'm not sure which is better. POSIX actually requires my behavior
>> > (which is most irrelevant).
>>
>> Not by my reading of it. Posix states that c/mtime needs to be
>> updated between the first access and the next msync() call. We
>> update mtime on the first access, and so therefore we conform to the
>> posix requirement....
>>
>> > My behavior also means that, if an NFS
>> > client reads and caches the file between the two writes, then it will
>> > eventually find out that the data is stale.
>>
>> "eventually" is very different behaviour to the current behaviour.
>>
>> My understanding is that NFS v4 delegations require the underlying
>> filesystem to bump the version count on *any* modification made to
>> the file so that delegations can be recalled appropriately.
>
> Delegations at least shouldn't be an issue here: they're recalled on the
> open.

Can you translate that into clueless-non-NFS-expert? :)

Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
played with will-it-scale a bit, but I don't really know what I'm
doing.

--Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-18 20:17:06

by J. Bruce Fields

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Fri, Aug 16, 2013 at 04:18:33PM -0700, Andy Lutomirski wrote:
> On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <[email protected]> wrote:
> > On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> >> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <[email protected]> wrote:
> >> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <[email protected]> wrote:
> >> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> >> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > >> >> >> > > cost of the unwritten->written conversion.
> >> > >> >> >> >
> >> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > >> >> >> > this part until writeback?
> >> > >> >> >>
> >> > >> >> >> Part of the work has to be done at write time because we need to
> >> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> > >> >> >> (as does the actual block allocation if we are doing delayed
> >> > >> >> >> allocation).
> >> > >> >> >>
> >> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> > >> >> >> fault workload.
> >> > >> >> >
> >> > >> >> > Sure, but the real problem is not the block mapping or allocation
> >> > >> >> > path - even if the test is changed to take that out of the picture,
> >> > >> >> > we still have timestamp updates being done on every single page
> >> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> > >> >> > a transaction to update the timestamp of the file being modified.
> >> > >> >>
> >> > >> >> I have (unmergeable) patches to fix this:
> >> > >> >>
> >> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> > >> >
> >> > >> > The big problem with this approach is that not doing the
> >> > >> > timestamp update on page faults is going to break the inode change
> >> > >> > version counting because for ext4, btrfs and XFS it takes a
> >> > >> > transaction to bump that counter. NFS needs to know the moment a
> >> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > >> > requires the change to the counter to be persistent over server
> >> > >> > failures, so it needs to be changed as part of a transaction....
> >> > >>
> >> > >> I've been running a kernel that has the file_update_time call
> >> > >> commented out for over a year now, and the only problem I've seen is
> >> > >> that the timestamp doesn't get updated :)
> >> > >>
> >> >
> >> > [...]
> >> >
> >> > > If a filesystem is providing an i_version value, then NFS uses it to
> >> > > determine whether client side caches are still consistent with the
> >> > > server state. If the filesystem does not provide an i_version, then
> >> > > NFS falls back to checking c/mtime for changes. If files on the
> >> > > server are being modified without either the tiemstamps or i_version
> >> > > changing, then it's likely that there will be problems with client
> >> > > side cache consistency....
> >> >
> >> > I didn't think of that at all.
> >> >
> >> > If userspace does:
> >> >
> >> > ptr = mmap(...);
> >> > ptr[0] = 1;
> >> > sleep(1);
> >> > ptr[0] = 2;
> >> > sleep(1);
> >> > munmap();
> >> >
> >> > Then current kernels will mark the inode changed on (only) the ptr[0]
> >> > = 1 line. My patches will instead mark the inode changed when munmap
> >> > is called (or after ptr[0] = 2 if writepages gets called for any
> >> > reason).
> >> >
> >> > I'm not sure which is better. POSIX actually requires my behavior
> >> > (which is most irrelevant).
> >>
> >> Not by my reading of it. Posix states that c/mtime needs to be
> >> updated between the first access and the next msync() call. We
> >> update mtime on the first access, and so therefore we conform to the
> >> posix requirement....
> >>
> >> > My behavior also means that, if an NFS
> >> > client reads and caches the file between the two writes, then it will
> >> > eventually find out that the data is stale.
> >>
> >> "eventually" is very different behaviour to the current behaviour.
> >>
> >> My understanding is that NFS v4 delegations require the underlying
> >> filesystem to bump the version count on *any* modification made to
> >> the file so that delegations can be recalled appropriately.
> >
> > Delegations at least shouldn't be an issue here: they're recalled on the
> > open.
>
> Can you translate that into clueless-non-NFS-expert? :)

An NFS "delegation" is roughly the same thing as what's called a "lease"
by the linux vfs or an "OpLock" in SMB. It's a lock that is recalled
from the holder on certain conflicting operations. (Basically a way to
tell a client "you're the only one using this file, feel free to cache
it until I tell you otherwise".)

Delegations are recalled on conflicting opens, so by the time you get to
IO there shouldn't be any. I don't think they're really relevant to
this discussion.

--b.

>
> Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
> played with will-it-scale a bit, but I don't really know what I'm
> doing.
>
> --Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-19 22:17:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Thu, Aug 15, 2013 at 04:01:49PM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
> > > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> > > cost of the unwritten->written conversion.
> > >> >
> > >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> > this part until writeback?
> > >>
> > >> Part of the work has to be done at write time because we need to
> > >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> problems). The unwritten->written conversion does happen at writeback
> > >> (as does the actual block allocation if we are doing delayed
> > >> allocation).
> > >>
> > >> The point is that if the goal is to measure page fault scalability, we
> > >> shouldn't have this other stuff happening as the same time as the page
> > >> fault workload.
> > >
> > > Sure, but the real problem is not the block mapping or allocation
> > > path - even if the test is changed to take that out of the picture,
> > > we still have timestamp updates being done on every single page
> > > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > and have nanosecond granularity, so every page fault is resulting in
> > > a transaction to update the timestamp of the file being modified.
> >
> > I have (unmergeable) patches to fix this:
> >
> > http://comments.gmane.org/gmane.linux.kernel.mm/92476
>
> The big problem with this approach is that not doing the
> timestamp update on page faults is going to break the inode change
> version counting because for ext4, btrfs and XFS it takes a
> transaction to bump that counter. NFS needs to know the moment a
> file is changed in memory, not when it is written to disk.

I don't think the in-memory updates of the data and the version have to
be completely atomic, if that's what you mean.

> Also, NFS
> requires the change to the counter to be persistent over server
> failures, so it needs to be changed as part of a transaction....

I'm not sure those two updates have to be a single atomic transaction on
disk, either.

(Though the reboot cases are more complicated, I may not have thought it
through.)

(By the way, I wonder what happens if we reuse a change attribute value
after a crash? There's probably a (hard to hit) bug there.)

--b.

>
> IOWs, fixing the "filesystems need a transaction on each page_mkwrite
> call" problem isn't as simple as changing how timestamps are
> updated.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-19 22:29:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Mon, Aug 19, 2013 at 3:17 PM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 04:01:49PM +1000, Dave Chinner wrote:
>> On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <[email protected]> wrote:
>> > > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > >> > > It would be better to write zeros to it, so we aren't measuring the
>> > >> > > cost of the unwritten->written conversion.
>> > >> >
>> > >> > At the risk of beating a dead horse, how hard would it be to defer
>> > >> > this part until writeback?
>> > >>
>> > >> Part of the work has to be done at write time because we need to
>> > >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > >> problems). The unwritten->written conversion does happen at writeback
>> > >> (as does the actual block allocation if we are doing delayed
>> > >> allocation).
>> > >>
>> > >> The point is that if the goal is to measure page fault scalability, we
>> > >> shouldn't have this other stuff happening as the same time as the page
>> > >> fault workload.
>> > >
>> > > Sure, but the real problem is not the block mapping or allocation
>> > > path - even if the test is changed to take that out of the picture,
>> > > we still have timestamp updates being done on every single page
>> > > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > > and have nanosecond granularity, so every page fault is resulting in
>> > > a transaction to update the timestamp of the file being modified.
>> >
>> > I have (unmergeable) patches to fix this:
>> >
>> > http://comments.gmane.org/gmane.linux.kernel.mm/92476
>>
>> The big problem with this approach is that not doing the
>> timestamp update on page faults is going to break the inode change
>> version counting because for ext4, btrfs and XFS it takes a
>> transaction to bump that counter. NFS needs to know the moment a
>> file is changed in memory, not when it is written to disk.
>
> I don't think the in-memory updates of the data and the version have to
> be completely atomic, if that's what you mean.
>
>> Also, NFS
>> requires the change to the counter to be persistent over server
>> failures, so it needs to be changed as part of a transaction....
>
> I'm not sure those two updates have to be a single atomic transaction on
> disk, either.
>

I hope not, because they aren't currently in the same transaction, and
putting them in the same transaction require starting a transaction on
page fault and doing the equivalent of writepages when the same
transaction is committed.

With my changes [1], they still aren't, but putting them in the same
transaction would probably be only a couple lines of code, and it
would actually improve performance. (I won't write those couple lines
of code because I don't know anything at all about jbd2.)

[1] https://lkml.org/lkml/2013/8/16/510

--Andy

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-19 23:23:23

by David Lang

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Fri, 16 Aug 2013, Dave Chinner wrote:

> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...

This sounds like you need something more like relctime rather than noctime,
something that updates the time in ram, but doesn't insist on flushing it to
disk immediatly, updating when convienient or when the file is closed.

David Lang

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2013-08-19 23:31:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: page fault scalability (ext3, ext4, xfs)

On Mon, Aug 19, 2013 at 4:23 PM, David Lang <[email protected]> wrote:
> On Fri, 16 Aug 2013, Dave Chinner wrote:
>
>> The problem with "not exported, don't update" is that files can be
>> modified on server startup (e.g. after a crash) or in short
>> maintenance periods when the NFS service is down. When the server is
>> started back up, the change number needs to indicate the file has
>> been modified so that clients reconnecting to the server see the
>> change.
>>
>> IOWs, even if the NFS server is not up or the filesystem not
>> exported we still need to update change counts whenever a file
>> changes if we are going to tell the NFS server that we keep them...
>
>
> This sounds like you need something more like relctime rather than noctime,
> something that updates the time in ram, but doesn't insist on flushing it to
> disk immediatly, updating when convienient or when the file is closed.
>
> David Lang

I guess my patches could be extended to do this. In their current
form, when a pte dirty bit is transferred to a page (via page_mkclean
or unmap), the address_space is marked as needed a cmtime update. I
could add a mode in which even the normal write syscall path sets that
bit instead of immediately updating the timestamp. This could be a
nice speedup to non-mmap writers.

To avoid breaking things, things like fsync would need to force a
cmtime flush -- I doubt it would be okay for write; fsync; write;
fsync to leave the timestamp matching the first write.

I'd rather get comments on the current form of my patches and maybe
get them merged before looking at even more far-reaching extensions,
though.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC