2012-05-15 22:48:05

by Jan Kara

[permalink] [raw]
Subject: Hole punching and mmap races

Hello,

Hugh pointed me to ext4 hole punching code which is clearly missing some
locking. But looking at the code more deeply I realized I don't see
anything preventing the following race in XFS or ext4:

TASK1 TASK2
punch_hole(file, 0, 4096)
filemap_write_and_wait()
truncate_pagecache_range()
addr = mmap(file);
addr[0] = 1
^^ writeably fault a page
remove file blocks

FLUSHER
write out file
^^ interesting things can
happen because we expect blocks under the first page to be allocated /
reserved but they are not...

I'm pretty sure ext4 has this problem, I'm not completely sure whether
XFS has something to protect against such race but I don't see anything.

It's not easy to protect against these races. For truncate, i_size protects
us against similar races but for hole punching we don't have any such
mechanism. One way to avoid the race would be to hold mmap_sem while we are
invalidating the page cache and punching hole but that sounds a bit ugly.
Alternatively we could just have some special lock (rwsem?) held during
page_mkwrite() (for reading) and during whole hole punching (for writing)
to serialize these two operations.

Another alternative, which doesn't really look more appealing, is to go
page-by-page and always free corresponding blocks under page lock.

Any other ideas or thoughts?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


2012-05-16 02:14:23

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote:
> Hello,
>
> Hugh pointed me to ext4 hole punching code which is clearly missing some
> locking. But looking at the code more deeply I realized I don't see
> anything preventing the following race in XFS or ext4:
>
> TASK1 TASK2
> punch_hole(file, 0, 4096)
> filemap_write_and_wait()
> truncate_pagecache_range()
> addr = mmap(file);
> addr[0] = 1
> ^^ writeably fault a page
> remove file blocks
>
> FLUSHER
> write out file
> ^^ interesting things can
> happen because we expect blocks under the first page to be allocated /
> reserved but they are not...
>
> I'm pretty sure ext4 has this problem, I'm not completely sure whether
> XFS has something to protect against such race but I don't see anything.

No, it doesn't. It's a known problem due to not being able to take a
lock in .page_mkwrite() to serialise mmap() IO against truncation or
other IO such as direct IO. This has been known for, well, long
before we came up with page_mkwrite(). At the time page_mkwrite()
was introduced, locking was discusses to solve this problem but was
considered difficult on the VM side so it was ignored.

> It's not easy to protect against these races. For truncate, i_size protects
> us against similar races but for hole punching we don't have any such
> mechanism. One way to avoid the race would be to hold mmap_sem while we are
> invalidating the page cache and punching hole but that sounds a bit ugly.
> Alternatively we could just have some special lock (rwsem?) held during
> page_mkwrite() (for reading) and during whole hole punching (for writing)
> to serialize these two operations.

What really needs to happen is that .page_mkwrite() can be made to
fail with -EAGAIN and retry the entire page fault from the start an
arbitrary number of times instead of just once as the current code
does with VM_FAULT_RETRY. That would allow us to try to take the
filesystem lock that provides IO exclusion for all other types of IO
and fail with EAGAIN if we can't get it without blocking. For XFS,
that is the i_iolock rwsem, for others it is the i_mutex, and some
other filesystems might take other locks.

FWIW, I've been running at "use the IO lock in page_mkwrite" patch
for XFS for several months now, but I haven't posted it because
without the VM side being able to handle such locking failures
gracefully there's not much point in making the change. I did this
patch to reduce the incidence of mmap vs direct IO races that are
essentially identical in nature to rule them out of the cause of
stray delalloc blocks in files that fsstress has been producing on
XFS. FYI, this race condition hasn't been responsible for any of the
problems I've found recently....

> Another alternative, which doesn't really look more appealing, is to go
> page-by-page and always free corresponding blocks under page lock.

Doesn't work for regions with no pages in memory over them.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-05-16 13:04:45

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote:
> > Hello,
> >
> > Hugh pointed me to ext4 hole punching code which is clearly missing some
> > locking. But looking at the code more deeply I realized I don't see
> > anything preventing the following race in XFS or ext4:
> >
> > TASK1 TASK2
> > punch_hole(file, 0, 4096)
> > filemap_write_and_wait()
> > truncate_pagecache_range()
> > addr = mmap(file);
> > addr[0] = 1
> > ^^ writeably fault a page
> > remove file blocks
> >
> > FLUSHER
> > write out file
> > ^^ interesting things can
> > happen because we expect blocks under the first page to be allocated /
> > reserved but they are not...
> >
> > I'm pretty sure ext4 has this problem, I'm not completely sure whether
> > XFS has something to protect against such race but I don't see anything.
>
> No, it doesn't. It's a known problem due to not being able to take a
> lock in .page_mkwrite() to serialise mmap() IO against truncation or
> other IO such as direct IO. This has been known for, well, long
> before we came up with page_mkwrite(). At the time page_mkwrite()
> was introduced, locking was discusses to solve this problem but was
> considered difficult on the VM side so it was ignored.
I thought someone must have noticed before since XFS has hole punching for
a long time...

> > It's not easy to protect against these races. For truncate, i_size protects
> > us against similar races but for hole punching we don't have any such
> > mechanism. One way to avoid the race would be to hold mmap_sem while we are
> > invalidating the page cache and punching hole but that sounds a bit ugly.
> > Alternatively we could just have some special lock (rwsem?) held during
> > page_mkwrite() (for reading) and during whole hole punching (for writing)
> > to serialize these two operations.
>
> What really needs to happen is that .page_mkwrite() can be made to
> fail with -EAGAIN and retry the entire page fault from the start an
> arbitrary number of times instead of just once as the current code
> does with VM_FAULT_RETRY. That would allow us to try to take the
> filesystem lock that provides IO exclusion for all other types of IO
> and fail with EAGAIN if we can't get it without blocking. For XFS,
> that is the i_iolock rwsem, for others it is the i_mutex, and some
> other filesystems might take other locks.
Actually, I've been playing with VM_FAULT_RETRY recently (for freezing
patches) and it's completely unhandled for .page_mkwrite() callbacks. Also
only x86 really tries to handle it at all. Other architectures just don't
allow it at all. Also there's a ton of callers of things like
get_user_pages() which would need to handle VM_FAULT_RETRY and for some of
them it would be actually non-trivial.

But in this particular case, I don't think VM_FAULT_RETRY is strictly
necessary. We can have a lock, which ranks below mmap_sem (and thus
i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction
start, etc. Such lock could be taken in page_mkwrite() before taking page
lock, in truncate() and punch_hold() just after i_mutex, and direct IO
paths could be tweaked to use it as well I think.

> FWIW, I've been running at "use the IO lock in page_mkwrite" patch
> for XFS for several months now, but I haven't posted it because
> without the VM side being able to handle such locking failures
> gracefully there's not much point in making the change. I did this
> patch to reduce the incidence of mmap vs direct IO races that are
> essentially identical in nature to rule them out of the cause of
> stray delalloc blocks in files that fsstress has been producing on
> XFS. FYI, this race condition hasn't been responsible for any of the
> problems I've found recently....
Yeah, I've been trying to hit the race window for a while and I failed as
well...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-17 07:43:13

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote:
> > > It's not easy to protect against these races. For truncate, i_size protects
> > > us against similar races but for hole punching we don't have any such
> > > mechanism. One way to avoid the race would be to hold mmap_sem while we are
> > > invalidating the page cache and punching hole but that sounds a bit ugly.
> > > Alternatively we could just have some special lock (rwsem?) held during
> > > page_mkwrite() (for reading) and during whole hole punching (for writing)
> > > to serialize these two operations.
> >
> > What really needs to happen is that .page_mkwrite() can be made to
> > fail with -EAGAIN and retry the entire page fault from the start an
> > arbitrary number of times instead of just once as the current code
> > does with VM_FAULT_RETRY. That would allow us to try to take the
> > filesystem lock that provides IO exclusion for all other types of IO
> > and fail with EAGAIN if we can't get it without blocking. For XFS,
> > that is the i_iolock rwsem, for others it is the i_mutex, and some
> > other filesystems might take other locks.
> Actually, I've been playing with VM_FAULT_RETRY recently (for freezing
> patches) and it's completely unhandled for .page_mkwrite() callbacks.

Yeah, it's a mess.

> Also
> only x86 really tries to handle it at all. Other architectures just don't
> allow it at all. Also there's a ton of callers of things like
> get_user_pages() which would need to handle VM_FAULT_RETRY and for some of
> them it would be actually non-trivial.

Seems kind of silly to me to have a generic retry capability in the
page fault handler and then not implement it in a useful manner for
*anyone*.

> But in this particular case, I don't think VM_FAULT_RETRY is strictly
> necessary. We can have a lock, which ranks below mmap_sem (and thus
> i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction
> start, etc. Such lock could be taken in page_mkwrite() before taking page
> lock, in truncate() and punch_hold() just after i_mutex, and direct IO
> paths could be tweaked to use it as well I think.

Which means we'd be adding another layer of mostly redundant locking
just to avoid i_mutex/mmap_sem inversion. But I don't see how it
solves the direct IO problem because we still need to grab the
mmap_sem inside the IO during get_user_pages_fast() while holding
i_mutex/i_iolock....

> > FWIW, I've been running at "use the IO lock in page_mkwrite" patch
> > for XFS for several months now, but I haven't posted it because
> > without the VM side being able to handle such locking failures
> > gracefully there's not much point in making the change. I did this
> > patch to reduce the incidence of mmap vs direct IO races that are
> > essentially identical in nature to rule them out of the cause of
> > stray delalloc blocks in files that fsstress has been producing on
> > XFS. FYI, this race condition hasn't been responsible for any of the
> > problems I've found recently....
> Yeah, I've been trying to hit the race window for a while and I failed as
> well...

IIRC, it's a rare case (that I consider insane, BTW): read from a
file with into a buffer that is a mmap()d region of the same file
that has not been faulted in yet.....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-05-17 23:28:48

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > > On Wed, May 16, 2012 at 12:48:05AM +0200, Jan Kara wrote:
> > > > It's not easy to protect against these races. For truncate, i_size protects
> > > > us against similar races but for hole punching we don't have any such
> > > > mechanism. One way to avoid the race would be to hold mmap_sem while we are
> > > > invalidating the page cache and punching hole but that sounds a bit ugly.
> > > > Alternatively we could just have some special lock (rwsem?) held during
> > > > page_mkwrite() (for reading) and during whole hole punching (for writing)
> > > > to serialize these two operations.
> > >
> > > What really needs to happen is that .page_mkwrite() can be made to
> > > fail with -EAGAIN and retry the entire page fault from the start an
> > > arbitrary number of times instead of just once as the current code
> > > does with VM_FAULT_RETRY. That would allow us to try to take the
> > > filesystem lock that provides IO exclusion for all other types of IO
> > > and fail with EAGAIN if we can't get it without blocking. For XFS,
> > > that is the i_iolock rwsem, for others it is the i_mutex, and some
> > > other filesystems might take other locks.
> > Actually, I've been playing with VM_FAULT_RETRY recently (for freezing
> > patches) and it's completely unhandled for .page_mkwrite() callbacks.
>
> Yeah, it's a mess.
>
> > Also
> > only x86 really tries to handle it at all. Other architectures just don't
> > allow it at all. Also there's a ton of callers of things like
> > get_user_pages() which would need to handle VM_FAULT_RETRY and for some of
> > them it would be actually non-trivial.
>
> Seems kind of silly to me to have a generic retry capability in the
> page fault handler and then not implement it in a useful manner for
> *anyone*.
Yeah. It's only tailored for one specific use in filemap_fault() on
x86...

> > But in this particular case, I don't think VM_FAULT_RETRY is strictly
> > necessary. We can have a lock, which ranks below mmap_sem (and thus
> > i_mutex / i_iolock) and above i_mmap_mutex (thus page lock), transaction
> > start, etc. Such lock could be taken in page_mkwrite() before taking page
> > lock, in truncate() and punch_hold() just after i_mutex, and direct IO
> > paths could be tweaked to use it as well I think.
>
> Which means we'd be adding another layer of mostly redundant locking
> just to avoid i_mutex/mmap_sem inversion. But I don't see how it
> solves the direct IO problem because we still need to grab the
> mmap_sem inside the IO during get_user_pages_fast() while holding
> i_mutex/i_iolock....
Yeah, direct IO case won't be trivial. We would have to take the lock
after dio_get_page() and release it before going for next page. While the
lock was released someone could fault in pages into the area where direct
IO is happening so we would have to invalidate them while holding the lock
again. It seems it would work but I agree it's probably too ugly to live
given how abstract the problem is...

> > > FWIW, I've been running at "use the IO lock in page_mkwrite" patch
> > > for XFS for several months now, but I haven't posted it because
> > > without the VM side being able to handle such locking failures
> > > gracefully there's not much point in making the change. I did this
> > > patch to reduce the incidence of mmap vs direct IO races that are
> > > essentially identical in nature to rule them out of the cause of
> > > stray delalloc blocks in files that fsstress has been producing on
> > > XFS. FYI, this race condition hasn't been responsible for any of the
> > > problems I've found recently....
> > Yeah, I've been trying to hit the race window for a while and I failed as
> > well...
>
> IIRC, it's a rare case (that I consider insane, BTW): read from a
> file with into a buffer that is a mmap()d region of the same file
> that has not been faulted in yet.....
With punch hole, the race is less insane - just punching hole in the area
which is accessed via mmap could race in a bad way AFAICS.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-18 10:12:27

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote:
> On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > IIRC, it's a rare case (that I consider insane, BTW): read from a
> > file with into a buffer that is a mmap()d region of the same file
> > that has not been faulted in yet.....
> With punch hole, the race is less insane - just punching hole in the area
> which is accessed via mmap could race in a bad way AFAICS.

Seems the simple answer to me is to prevent page faults while hole
punching, then....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-05-18 13:33:11

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Fri 18-05-12 20:12:10, Dave Chinner wrote:
> On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote:
> > On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > > IIRC, it's a rare case (that I consider insane, BTW): read from a
> > > file with into a buffer that is a mmap()d region of the same file
> > > that has not been faulted in yet.....
> > With punch hole, the race is less insane - just punching hole in the area
> > which is accessed via mmap could race in a bad way AFAICS.
>
> Seems the simple answer to me is to prevent page faults while hole
> punching, then....
Yes, that's what I was suggesting in the beginning :) And I was asking
whether people are OK with another lock in the page fault path (in
particular in ->page_mkwrite) or whether someone has a better idea (e.g.
taking mmap_sem in the hole punching path seems possible but I'm not sure
whether that would be considered acceptable abuse).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-19 01:40:24

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Fri, May 18, 2012 at 03:32:50PM +0200, Jan Kara wrote:
> On Fri 18-05-12 20:12:10, Dave Chinner wrote:
> > On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote:
> > > On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> > > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > > > IIRC, it's a rare case (that I consider insane, BTW): read from a
> > > > file with into a buffer that is a mmap()d region of the same file
> > > > that has not been faulted in yet.....
> > > With punch hole, the race is less insane - just punching hole in the area
> > > which is accessed via mmap could race in a bad way AFAICS.
> >
> > Seems the simple answer to me is to prevent page faults while hole
> > punching, then....
> Yes, that's what I was suggesting in the beginning :) And I was asking
> whether people are OK with another lock in the page fault path (in
> particular in ->page_mkwrite)

Right. I probably should have been clearer in what I said. We got
back here from considering another IO level lock and all the
complexity it adds to just solve the hole punch problem....

> or whether someone has a better idea (e.g.
> taking mmap_sem in the hole punching path seems possible but I'm not sure
> whether that would be considered acceptable abuse).

That's for the VM guys to answer, but it seems wrong to me to have
to treat hole punching differently to truncation....

The thing is, mmap IO is completely unlocked from an IO perspective,
and that means we cannot guarantee exclusion from IO without using
the IO exclusion lock. That's the simplest way we can make mmap
serialise sanely against direct IO and hole punching. Hole punching
is inherently a filesystem operation (just like truncation), and
mmap operations must stall while it is in progress. It's just that
we have the problem that we allow the mmap_sem to be taken inside
the IO exclusion locks...

So let's step back a moment and have a look at how we've got here.
The problem is that we've optimised ourselves into a corner with the
way we handle page cache truncation - we don't need mmap
serialisation because of the combination of i_size and page locks
mean we can detect truncated pages safely at page fault time. With
hole punching, we don't have that i_size safety blanket, and so we
need some other serialisation mechanism to safely detect whether a
page is valid or not at any given point in time.

Because it needs to serialise against IO operations, we need a
sleeping lock of some kind, and it can't be the existing IO lock.
And now we are looking at needing a new lock for hole punching, I'm
really wondering if the i_size/page lock truncation optimisation
should even continue to exist. i.e. replace it with a single
mechanism that works for both hole punching, truncation and other
functions that require exclusive access or exclusion against
modifications to the mapping tree.

But this is only one of the problems in this area.The way I see it
is that we have many kludges in the area of page invalidation w.r.t.
different types of IO, the page cache and mmap, especially when we
take into account direct IO. What we are seeing here is we need
some level of _mapping tree exclusion_ between:

1. mmap vs hole punch (broken)
2. mmap vs truncate (i_size/page lock)
3. mmap vs direct IO (non-existent)
4. mmap vs buffered IO (page lock)
5. writeback vs truncate (i_size/page lock)
6. writeback vs hole punch (page lock, possibly broken)
7. direct IO vs buffered IO (racy - flush cache before/after DIO)

#1, #2, #5 and #6 could be solved by a rw-lock for the operations -
read for mmap/writeback, exclusive for hole-punch and truncation.
That, however, doesn't work for #3 and #4 as the exclusion is
inverted - direct/buffered IO would require a shared mode lock and
mmap requires the exclusive lock. Similarly, #7 requires a shared
lock for direct IO, and a shared lock for buffered IO, but exclusion
between the two for overlapping ranges. But no one locking primitive
that currently exists can give us this set of semantics....

Right now we are talking about hacking in some solution to #1, while
ignoring the wider range of related but ignored/papered over
problems we also have. I don't have a magic bullet that solves all
of these problems, but I think it is worth recognising and
considering that this problem is much larger than just hole punching
and that these problems have been there for a *long time*.

To me the issue at hand is that we have no method of serialising
multi-page operations on the mapping tree between the filesystem and
the VM, and that seems to be the fundamental problem we face in this
whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
Hence it might be better to try to work out how to fix this entire
class of problems rather than just adding a complex kuldge that just
papers over the current "hot" symptom....

Cheers,

Dave.
--
Dave Chinner
[email protected]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-05-24 12:35:38

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Sat 19-05-12 11:40:24, Dave Chinner wrote:
> On Fri, May 18, 2012 at 03:32:50PM +0200, Jan Kara wrote:
> > On Fri 18-05-12 20:12:10, Dave Chinner wrote:
> > > On Fri, May 18, 2012 at 01:28:29AM +0200, Jan Kara wrote:
> > > > On Thu 17-05-12 17:43:08, Dave Chinner wrote:
> > > > > On Wed, May 16, 2012 at 03:04:45PM +0200, Jan Kara wrote:
> > > > > > On Wed 16-05-12 12:14:23, Dave Chinner wrote:
> > > > > IIRC, it's a rare case (that I consider insane, BTW): read from a
> > > > > file with into a buffer that is a mmap()d region of the same file
> > > > > that has not been faulted in yet.....
> > > > With punch hole, the race is less insane - just punching hole in the area
> > > > which is accessed via mmap could race in a bad way AFAICS.
> > >
> > > Seems the simple answer to me is to prevent page faults while hole
> > > punching, then....
> > Yes, that's what I was suggesting in the beginning :) And I was asking
> > whether people are OK with another lock in the page fault path (in
> > particular in ->page_mkwrite)
>
> Right. I probably should have been clearer in what I said. We got
> back here from considering another IO level lock and all the
> complexity it adds to just solve the hole punch problem....
>
> > or whether someone has a better idea (e.g.
> > taking mmap_sem in the hole punching path seems possible but I'm not sure
> > whether that would be considered acceptable abuse).
>
> That's for the VM guys to answer, but it seems wrong to me to have
> to treat hole punching differently to truncation....
>
> The thing is, mmap IO is completely unlocked from an IO perspective,
> and that means we cannot guarantee exclusion from IO without using
> the IO exclusion lock. That's the simplest way we can make mmap
> serialise sanely against direct IO and hole punching. Hole punching
> is inherently a filesystem operation (just like truncation), and
> mmap operations must stall while it is in progress. It's just that
> we have the problem that we allow the mmap_sem to be taken inside
> the IO exclusion locks...
>
> So let's step back a moment and have a look at how we've got here.
> The problem is that we've optimised ourselves into a corner with the
> way we handle page cache truncation - we don't need mmap
> serialisation because of the combination of i_size and page locks
> mean we can detect truncated pages safely at page fault time. With
> hole punching, we don't have that i_size safety blanket, and so we
> need some other serialisation mechanism to safely detect whether a
> page is valid or not at any given point in time.
>
> Because it needs to serialise against IO operations, we need a
> sleeping lock of some kind, and it can't be the existing IO lock.
> And now we are looking at needing a new lock for hole punching, I'm
> really wondering if the i_size/page lock truncation optimisation
> should even continue to exist. i.e. replace it with a single
> mechanism that works for both hole punching, truncation and other
> functions that require exclusive access or exclusion against
> modifications to the mapping tree.
>
> But this is only one of the problems in this area.The way I see it
> is that we have many kludges in the area of page invalidation w.r.t.
> different types of IO, the page cache and mmap, especially when we
> take into account direct IO. What we are seeing here is we need
> some level of _mapping tree exclusion_ between:
>
> 1. mmap vs hole punch (broken)
> 2. mmap vs truncate (i_size/page lock)
> 3. mmap vs direct IO (non-existent)
> 4. mmap vs buffered IO (page lock)
> 5. writeback vs truncate (i_size/page lock)
> 6. writeback vs hole punch (page lock, possibly broken)
> 7. direct IO vs buffered IO (racy - flush cache before/after DIO)
Yes, this is a nice summary of the most interesting cases. For completeness,
here are the remaining cases:
8. mmap vs writeback (page lock)
9. writeback vs direct IO (as direct IO vs buffered IO)
10. writeback vs buffered IO (page lock)
11. direct IO vs truncate (dio_wait)
12. direct IO vs hole punch (dio_wait)
13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads)
14. buffered IO vs hole punch (fs dependent, broken for ext4)
15. truncate vs hole punch (fs dependent)
16. mmap vs mmap (page lock)
17. writeback vs writeback (page lock)
18. direct IO vs direct IO (i_mutex or fs dependent)
19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads)
20. truncate vs truncate (i_mutex)
21. punch hole vs punch hole (fs dependent)

> #1, #2, #5 and #6 could be solved by a rw-lock for the operations -
> read for mmap/writeback, exclusive for hole-punch and truncation.
> That, however, doesn't work for #3 and #4 as the exclusion is
> inverted - direct/buffered IO would require a shared mode lock and
> mmap requires the exclusive lock. Similarly, #7 requires a shared
> lock for direct IO, and a shared lock for buffered IO, but exclusion
> between the two for overlapping ranges. But no one locking primitive
> that currently exists can give us this set of semantics....
>
> Right now we are talking about hacking in some solution to #1, while
> ignoring the wider range of related but ignored/papered over
> problems we also have. I don't have a magic bullet that solves all
> of these problems, but I think it is worth recognising and
> considering that this problem is much larger than just hole punching
> and that these problems have been there for a *long time*.
>
> To me the issue at hand is that we have no method of serialising
> multi-page operations on the mapping tree between the filesystem and
> the VM, and that seems to be the fundamental problem we face in this
> whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> Hence it might be better to try to work out how to fix this entire
> class of problems rather than just adding a complex kuldge that just
> papers over the current "hot" symptom....
Yes, looking at the above table, the amount of different synchronization
mechanisms is really striking. So probably we should look at some
possibility of unifying at least some cases.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-05 05:51:54

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> On Sat 19-05-12 11:40:24, Dave Chinner wrote:
> > So let's step back a moment and have a look at how we've got here.
> > The problem is that we've optimised ourselves into a corner with the
> > way we handle page cache truncation - we don't need mmap
> > serialisation because of the combination of i_size and page locks
> > mean we can detect truncated pages safely at page fault time. With
> > hole punching, we don't have that i_size safety blanket, and so we
> > need some other serialisation mechanism to safely detect whether a
> > page is valid or not at any given point in time.
> >
> > Because it needs to serialise against IO operations, we need a
> > sleeping lock of some kind, and it can't be the existing IO lock.
> > And now we are looking at needing a new lock for hole punching, I'm
> > really wondering if the i_size/page lock truncation optimisation
> > should even continue to exist. i.e. replace it with a single
> > mechanism that works for both hole punching, truncation and other
> > functions that require exclusive access or exclusion against
> > modifications to the mapping tree.
> >
> > But this is only one of the problems in this area.The way I see it
> > is that we have many kludges in the area of page invalidation w.r.t.
> > different types of IO, the page cache and mmap, especially when we
> > take into account direct IO. What we are seeing here is we need
> > some level of _mapping tree exclusion_ between:
> >
> > 1. mmap vs hole punch (broken)
> > 2. mmap vs truncate (i_size/page lock)
> > 3. mmap vs direct IO (non-existent)
> > 4. mmap vs buffered IO (page lock)
> > 5. writeback vs truncate (i_size/page lock)
> > 6. writeback vs hole punch (page lock, possibly broken)
> > 7. direct IO vs buffered IO (racy - flush cache before/after DIO)
> Yes, this is a nice summary of the most interesting cases. For completeness,
> here are the remaining cases:
> 8. mmap vs writeback (page lock)
> 9. writeback vs direct IO (as direct IO vs buffered IO)
> 10. writeback vs buffered IO (page lock)
> 11. direct IO vs truncate (dio_wait)
> 12. direct IO vs hole punch (dio_wait)
> 13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads)
> 14. buffered IO vs hole punch (fs dependent, broken for ext4)
> 15. truncate vs hole punch (fs dependent)
> 16. mmap vs mmap (page lock)
> 17. writeback vs writeback (page lock)
> 18. direct IO vs direct IO (i_mutex or fs dependent)
> 19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads)
> 20. truncate vs truncate (i_mutex)
> 21. punch hole vs punch hole (fs dependent)

A lot of them are the IO exclusion side of the problem - I
ignored them just to make my discussion short and
to the point. So thanks for documenting them for everyone. :)

....

> > To me the issue at hand is that we have no method of serialising
> > multi-page operations on the mapping tree between the filesystem and
> > the VM, and that seems to be the fundamental problem we face in this
> > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > Hence it might be better to try to work out how to fix this entire
> > class of problems rather than just adding a complex kuldge that just
> > papers over the current "hot" symptom....
> Yes, looking at the above table, the amount of different synchronization
> mechanisms is really striking. So probably we should look at some
> possibility of unifying at least some cases.

It seems to me that we need some thing in between the fine grained
page lock and the entire-file IO exclusion lock. We need to maintain
fine grained locking for mmap scalability, but we also need to be
able to atomically lock ranges of pages.

I guess if we were to nest a fine grained multi-state lock
inside both the IO exclusion lock and the mmap_sem, we might be able
to kill all problems in one go.

Exclusive access on a range needs to be granted to:

- direct IO
- truncate
- hole punch

so they can be serialised against mmap based page faults, writeback
and concurrent buffered IO. Serialisation against themselves is an
IO/fs exclusion problem.

Shared access for traversal or modification needs to be granted to:

- buffered IO
- mmap page faults
- writeback

Each of these cases can rely on the existing page locks or IO
exclusion locks to provide safety for concurrent access to the same
ranges. This means that once we have access granted to a range we
can check truncate races once and ignore the problem until we drop
the access. And the case of taking a page fault within a buffered
IO won't deadlock because both take a shared lock....

We'd need some kind of efficient shared/exclusive range lock for
this sort of exclusion, and it's entirely possible that it would
have too much overhead to be acceptible in the page fault path. It's
the best I can think of right now.....

As it is, a range lock of this kind would be very handy for other
things, too (like the IO exclusion locks so we can do concurrent
buffered writes in XFS ;).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-05 06:22:31

by Marco Stornelli

[permalink] [raw]
Subject: Re: Hole punching and mmap races

2012/6/5 Dave Chinner <[email protected]>:
> On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
>> On Sat 19-05-12 11:40:24, Dave Chinner wrote:
>> > So let's step back a moment and have a look at how we've got here.
>> > The problem is that we've optimised ourselves into a corner with the
>> > way we handle page cache truncation - we don't need mmap
>> > serialisation because of the combination of i_size and page locks
>> > mean we can detect truncated pages safely at page fault time. With
>> > hole punching, we don't have that i_size safety blanket, and so we
>> > need some other serialisation mechanism to safely detect whether a
>> > page is valid or not at any given point in time.
>> >
>> > Because it needs to serialise against IO operations, we need a
>> > sleeping lock of some kind, and it can't be the existing IO lock.
>> > And now we are looking at needing a new lock for hole punching, I'm
>> > really wondering if the i_size/page lock truncation optimisation
>> > should even continue to exist. i.e. replace it with a single
>> > mechanism that works for both hole punching, truncation and other
>> > functions that require exclusive access or exclusion against
>> > modifications to the mapping tree.
>> >
>> > But this is only one of the problems in this area.The way I see it
>> > is that we have many kludges in the area of page invalidation w.r.t.
>> > different types of IO, the page cache and mmap, especially when we
>> > take into account direct IO. What we are seeing here is we need
>> > some level of _mapping tree exclusion_ between:
>> >
>> > ? ? 1. mmap vs hole punch (broken)
>> > ? ? 2. mmap vs truncate (i_size/page lock)
>> > ? ? 3. mmap vs direct IO (non-existent)
>> > ? ? 4. mmap vs buffered IO (page lock)
>> > ? ? 5. writeback vs truncate (i_size/page lock)
>> > ? ? 6. writeback vs hole punch (page lock, possibly broken)
>> > ? ? 7. direct IO vs buffered IO (racy - flush cache before/after DIO)
>> ? Yes, this is a nice summary of the most interesting cases. For completeness,
>> here are the remaining cases:
>> ? 8. mmap vs writeback (page lock)
>> ? 9. writeback vs direct IO (as direct IO vs buffered IO)
>> ?10. writeback vs buffered IO (page lock)
>> ?11. direct IO vs truncate (dio_wait)
>> ?12. direct IO vs hole punch (dio_wait)
>> ?13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads)
>> ?14. buffered IO vs hole punch (fs dependent, broken for ext4)
>> ?15. truncate vs hole punch (fs dependent)
>> ?16. mmap vs mmap (page lock)
>> ?17. writeback vs writeback (page lock)
>> ?18. direct IO vs direct IO (i_mutex or fs dependent)
>> ?19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads)
>> ?20. truncate vs truncate (i_mutex)
>> ?21. punch hole vs punch hole (fs dependent)
>

I think we have even the xip cases here.

Marco

2012-06-05 23:15:30

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > To me the issue at hand is that we have no method of serialising
> > > multi-page operations on the mapping tree between the filesystem and
> > > the VM, and that seems to be the fundamental problem we face in this
> > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > Hence it might be better to try to work out how to fix this entire
> > > class of problems rather than just adding a complex kuldge that just
> > > papers over the current "hot" symptom....
> > Yes, looking at the above table, the amount of different synchronization
> > mechanisms is really striking. So probably we should look at some
> > possibility of unifying at least some cases.
>
> It seems to me that we need some thing in between the fine grained
> page lock and the entire-file IO exclusion lock. We need to maintain
> fine grained locking for mmap scalability, but we also need to be
> able to atomically lock ranges of pages.
Yes, we also need to keep things fine grained to keep scalability of
direct IO and buffered reads...

> I guess if we were to nest a fine grained multi-state lock
> inside both the IO exclusion lock and the mmap_sem, we might be able
> to kill all problems in one go.
>
> Exclusive access on a range needs to be granted to:
>
> - direct IO
> - truncate
> - hole punch
>
> so they can be serialised against mmap based page faults, writeback
> and concurrent buffered IO. Serialisation against themselves is an
> IO/fs exclusion problem.
>
> Shared access for traversal or modification needs to be granted to:
>
> - buffered IO
> - mmap page faults
> - writeback
>
> Each of these cases can rely on the existing page locks or IO
> exclusion locks to provide safety for concurrent access to the same
> ranges. This means that once we have access granted to a range we
> can check truncate races once and ignore the problem until we drop
> the access. And the case of taking a page fault within a buffered
> IO won't deadlock because both take a shared lock....
You cannot just use a lock (not even a shared one) both above and under
mmap_sem. That is deadlockable in presence of other requests for exclusive
locking... Luckily, with buffered writes the situation isn't that bad. You
need mmap_sem only before each page is processed (in
iov_iter_fault_in_readable()). Later on in the loop we use
iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can
just get our shared lock after iov_iter_fault_in_readable() (or simply
leave it for ->write_begin() if we want to give control over the locking to
filesystems).

> We'd need some kind of efficient shared/exclusive range lock for
> this sort of exclusion, and it's entirely possible that it would
> have too much overhead to be acceptible in the page fault path. It's
> the best I can think of right now.....
>
> As it is, a range lock of this kind would be very handy for other
> things, too (like the IO exclusion locks so we can do concurrent
> buffered writes in XFS ;).
Yes, that's what I thought as well. In particular it should be pretty
efficient in locking a single page range because that's going to be
majority of calls. I'll try to write something and see how fast it can
be...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-06 00:06:41

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > To me the issue at hand is that we have no method of serialising
> > > > multi-page operations on the mapping tree between the filesystem and
> > > > the VM, and that seems to be the fundamental problem we face in this
> > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > Hence it might be better to try to work out how to fix this entire
> > > > class of problems rather than just adding a complex kuldge that just
> > > > papers over the current "hot" symptom....
> > > Yes, looking at the above table, the amount of different synchronization
> > > mechanisms is really striking. So probably we should look at some
> > > possibility of unifying at least some cases.
> >
> > It seems to me that we need some thing in between the fine grained
> > page lock and the entire-file IO exclusion lock. We need to maintain
> > fine grained locking for mmap scalability, but we also need to be
> > able to atomically lock ranges of pages.
> Yes, we also need to keep things fine grained to keep scalability of
> direct IO and buffered reads...
>
> > I guess if we were to nest a fine grained multi-state lock
> > inside both the IO exclusion lock and the mmap_sem, we might be able
> > to kill all problems in one go.
> >
> > Exclusive access on a range needs to be granted to:
> >
> > - direct IO
> > - truncate
> > - hole punch
> >
> > so they can be serialised against mmap based page faults, writeback
> > and concurrent buffered IO. Serialisation against themselves is an
> > IO/fs exclusion problem.
> >
> > Shared access for traversal or modification needs to be granted to:
> >
> > - buffered IO
> > - mmap page faults
> > - writeback
> >
> > Each of these cases can rely on the existing page locks or IO
> > exclusion locks to provide safety for concurrent access to the same
> > ranges. This means that once we have access granted to a range we
> > can check truncate races once and ignore the problem until we drop
> > the access. And the case of taking a page fault within a buffered
> > IO won't deadlock because both take a shared lock....
> You cannot just use a lock (not even a shared one) both above and under
> mmap_sem. That is deadlockable in presence of other requests for exclusive
> locking...

Well, that's assuming that exclusive lock requests form a barrier to
new shared requests. Remember that I'm talking about a range lock
here, which we can make up whatever semantics we'd need, including
having "shared lock if already locked shared" nested locking
semantics which avoids this page-fault-in-buffered-IO-copy-in/out
problem....

It also allows writeback to work the same way it does write now when
we take a page fult on a page that is under writeback

> Luckily, with buffered writes the situation isn't that bad. You
> need mmap_sem only before each page is processed (in
> iov_iter_fault_in_readable()). Later on in the loop we use
> iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can
> just get our shared lock after iov_iter_fault_in_readable() (or simply
> leave it for ->write_begin() if we want to give control over the locking to
> filesystems).

That would probably work as well, but it much more likely that
people would get it wrong as opposed to special casing the nested
lock semantic in the page fault code...

> > We'd need some kind of efficient shared/exclusive range lock for
> > this sort of exclusion, and it's entirely possible that it would
> > have too much overhead to be acceptible in the page fault path. It's
> > the best I can think of right now.....
> >
> > As it is, a range lock of this kind would be very handy for other
> > things, too (like the IO exclusion locks so we can do concurrent
> > buffered writes in XFS ;).
> Yes, that's what I thought as well. In particular it should be pretty
> efficient in locking a single page range because that's going to be
> majority of calls. I'll try to write something and see how fast it can
> be...

Cool. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-06 09:58:27

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > To me the issue at hand is that we have no method of serialising
> > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > Hence it might be better to try to work out how to fix this entire
> > > > > class of problems rather than just adding a complex kuldge that just
> > > > > papers over the current "hot" symptom....
> > > > Yes, looking at the above table, the amount of different synchronization
> > > > mechanisms is really striking. So probably we should look at some
> > > > possibility of unifying at least some cases.
> > >
> > > It seems to me that we need some thing in between the fine grained
> > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > fine grained locking for mmap scalability, but we also need to be
> > > able to atomically lock ranges of pages.
> > Yes, we also need to keep things fine grained to keep scalability of
> > direct IO and buffered reads...
> >
> > > I guess if we were to nest a fine grained multi-state lock
> > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > to kill all problems in one go.
> > >
> > > Exclusive access on a range needs to be granted to:
> > >
> > > - direct IO
> > > - truncate
> > > - hole punch
> > >
> > > so they can be serialised against mmap based page faults, writeback
> > > and concurrent buffered IO. Serialisation against themselves is an
> > > IO/fs exclusion problem.
> > >
> > > Shared access for traversal or modification needs to be granted to:
> > >
> > > - buffered IO
> > > - mmap page faults
> > > - writeback
> > >
> > > Each of these cases can rely on the existing page locks or IO
> > > exclusion locks to provide safety for concurrent access to the same
> > > ranges. This means that once we have access granted to a range we
> > > can check truncate races once and ignore the problem until we drop
> > > the access. And the case of taking a page fault within a buffered
> > > IO won't deadlock because both take a shared lock....
> > You cannot just use a lock (not even a shared one) both above and under
> > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > locking...
>
> Well, that's assuming that exclusive lock requests form a barrier to
> new shared requests. Remember that I'm talking about a range lock
> here, which we can make up whatever semantics we'd need, including
> having "shared lock if already locked shared" nested locking
> semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> problem....
That's true. But if you have semantics like this, constant writing to
or reading from a file could starve e.g. truncate. So I'd prefer not to
open this can of worms and keep semantics of rw semaphores if possible.

Furthermore, with direct IO you have to set in stone the ordering of
mmap_sem and range lock anyway because there we need an exclusive lock.

> It also allows writeback to work the same way it does write now when
> we take a page fault on a page that is under writeback
I'm not sure what would be the difference regardless of which semantics
we choose...

> > Luckily, with buffered writes the situation isn't that bad. You
> > need mmap_sem only before each page is processed (in
> > iov_iter_fault_in_readable()). Later on in the loop we use
> > iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can
> > just get our shared lock after iov_iter_fault_in_readable() (or simply
> > leave it for ->write_begin() if we want to give control over the locking to
> > filesystems).
>
> That would probably work as well, but it much more likely that
> people would get it wrong as opposed to special casing the nested
> lock semantic in the page fault code...
I suppose some helper functions could make this easier...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-06 13:36:16

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote:
> On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > > To me the issue at hand is that we have no method of serialising
> > > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > > Hence it might be better to try to work out how to fix this entire
> > > > > > class of problems rather than just adding a complex kuldge that just
> > > > > > papers over the current "hot" symptom....
> > > > > Yes, looking at the above table, the amount of different synchronization
> > > > > mechanisms is really striking. So probably we should look at some
> > > > > possibility of unifying at least some cases.
> > > >
> > > > It seems to me that we need some thing in between the fine grained
> > > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > > fine grained locking for mmap scalability, but we also need to be
> > > > able to atomically lock ranges of pages.
> > > Yes, we also need to keep things fine grained to keep scalability of
> > > direct IO and buffered reads...
> > >
> > > > I guess if we were to nest a fine grained multi-state lock
> > > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > > to kill all problems in one go.
> > > >
> > > > Exclusive access on a range needs to be granted to:
> > > >
> > > > - direct IO
> > > > - truncate
> > > > - hole punch
> > > >
> > > > so they can be serialised against mmap based page faults, writeback
> > > > and concurrent buffered IO. Serialisation against themselves is an
> > > > IO/fs exclusion problem.
> > > >
> > > > Shared access for traversal or modification needs to be granted to:
> > > >
> > > > - buffered IO
> > > > - mmap page faults
> > > > - writeback
> > > >
> > > > Each of these cases can rely on the existing page locks or IO
> > > > exclusion locks to provide safety for concurrent access to the same
> > > > ranges. This means that once we have access granted to a range we
> > > > can check truncate races once and ignore the problem until we drop
> > > > the access. And the case of taking a page fault within a buffered
> > > > IO won't deadlock because both take a shared lock....
> > > You cannot just use a lock (not even a shared one) both above and under
> > > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > > locking...
> >
> > Well, that's assuming that exclusive lock requests form a barrier to
> > new shared requests. Remember that I'm talking about a range lock
> > here, which we can make up whatever semantics we'd need, including
> > having "shared lock if already locked shared" nested locking
> > semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> > problem....
> That's true. But if you have semantics like this, constant writing to
> or reading from a file could starve e.g. truncate. So I'd prefer not to
> open this can of worms and keep semantics of rw semaphores if possible.

Except truncate uses the i_mutex/i_iolock for exclusion, so it would
never get held off any more than it already does by buffered IO in
this case. i.e. the mapping tree range lock is inside the locks used
for truncate serialisation, so we don't have a situation where other
operations woul dbe held off by such an IO pattern...

> Furthermore, with direct IO you have to set in stone the ordering of
> mmap_sem and range lock anyway because there we need an exclusive lock.

Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For
direct IO, we only need the mmap_sem for the get_user_pages() call
IIRC, so that requires exclusive range lock-> shared mmap_sem
ordering. Unless we can lift the range lock in the mmap path outside
the mmap_sem, we're still in the same boat....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-07 21:58:35

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote:
> > On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > > > To me the issue at hand is that we have no method of serialising
> > > > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > > > Hence it might be better to try to work out how to fix this entire
> > > > > > > class of problems rather than just adding a complex kuldge that just
> > > > > > > papers over the current "hot" symptom....
> > > > > > Yes, looking at the above table, the amount of different synchronization
> > > > > > mechanisms is really striking. So probably we should look at some
> > > > > > possibility of unifying at least some cases.
> > > > >
> > > > > It seems to me that we need some thing in between the fine grained
> > > > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > > > fine grained locking for mmap scalability, but we also need to be
> > > > > able to atomically lock ranges of pages.
> > > > Yes, we also need to keep things fine grained to keep scalability of
> > > > direct IO and buffered reads...
> > > >
> > > > > I guess if we were to nest a fine grained multi-state lock
> > > > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > > > to kill all problems in one go.
> > > > >
> > > > > Exclusive access on a range needs to be granted to:
> > > > >
> > > > > - direct IO
> > > > > - truncate
> > > > > - hole punch
> > > > >
> > > > > so they can be serialised against mmap based page faults, writeback
> > > > > and concurrent buffered IO. Serialisation against themselves is an
> > > > > IO/fs exclusion problem.
> > > > >
> > > > > Shared access for traversal or modification needs to be granted to:
> > > > >
> > > > > - buffered IO
> > > > > - mmap page faults
> > > > > - writeback
> > > > >
> > > > > Each of these cases can rely on the existing page locks or IO
> > > > > exclusion locks to provide safety for concurrent access to the same
> > > > > ranges. This means that once we have access granted to a range we
> > > > > can check truncate races once and ignore the problem until we drop
> > > > > the access. And the case of taking a page fault within a buffered
> > > > > IO won't deadlock because both take a shared lock....
> > > > You cannot just use a lock (not even a shared one) both above and under
> > > > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > > > locking...
> > >
> > > Well, that's assuming that exclusive lock requests form a barrier to
> > > new shared requests. Remember that I'm talking about a range lock
> > > here, which we can make up whatever semantics we'd need, including
> > > having "shared lock if already locked shared" nested locking
> > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> > > problem....
> > That's true. But if you have semantics like this, constant writing to
> > or reading from a file could starve e.g. truncate. So I'd prefer not to
> > open this can of worms and keep semantics of rw semaphores if possible.
>
> Except truncate uses the i_mutex/i_iolock for exclusion, so it would
> never get held off any more than it already does by buffered IO in
> this case. i.e. the mapping tree range lock is inside the locks used
> for truncate serialisation, so we don't have a situation where other
> operations woul dbe held off by such an IO pattern...
True. I was just hoping that i_mutex won't be needed if we get our new
lock right.

> > Furthermore, with direct IO you have to set in stone the ordering of
> > mmap_sem and range lock anyway because there we need an exclusive lock.
>
> Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For
> direct IO, we only need the mmap_sem for the get_user_pages() call
> IIRC, so that requires exclusive range lock-> shared mmap_sem
> ordering. Unless we can lift the range lock in the mmap path outside
> the mmap_sem, we're still in the same boat....
Yes. Just as I said before it is not an unsolvable situation since for
direct IO we can grab our lock after get_user_pages() call. I was thinking
about it for a while and I realized, that if we have range lock that we
take during page fault, buffered IO, direct IO, truncate, punch hole, we
actually don't need a shared version of it. Just the fact it would be range
lock is enough to avoid serialization.

Also I was thinking that since lock ordering forces our new lock to be
relatively close to page lock and page lock is serializing quite some of
IO operations anyway, it might be workable (and actually reasonably easy
to grasp for developers) if the range lock is coupled with page lock - i.e.
locking a range will be equivalent to locking each page in a range, except
that this way you can "lock" pages which are not present and thus forbid
their creation. Also we could implement the common case of locking a range
containing single page by just taking page lock so we save modification of
interval tree in the common case and generally make the tree smaller (of
course, at the cost of somewhat slowing down cases where we want to lock
larger ranges).

Using something like this in ext4 should be reasonably easy, we'd just need
to change some code to move page lock up in locking order to make locking
of ranges useful. Would it be also usable for XFS?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-08 00:57:00

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote:
> On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote:
> > > On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > > > > To me the issue at hand is that we have no method of serialising
> > > > > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > > > > Hence it might be better to try to work out how to fix this entire
> > > > > > > > class of problems rather than just adding a complex kuldge that just
> > > > > > > > papers over the current "hot" symptom....
> > > > > > > Yes, looking at the above table, the amount of different synchronization
> > > > > > > mechanisms is really striking. So probably we should look at some
> > > > > > > possibility of unifying at least some cases.
> > > > > >
> > > > > > It seems to me that we need some thing in between the fine grained
> > > > > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > > > > fine grained locking for mmap scalability, but we also need to be
> > > > > > able to atomically lock ranges of pages.
> > > > > Yes, we also need to keep things fine grained to keep scalability of
> > > > > direct IO and buffered reads...
> > > > >
> > > > > > I guess if we were to nest a fine grained multi-state lock
> > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > > > > to kill all problems in one go.
> > > > > >
> > > > > > Exclusive access on a range needs to be granted to:
> > > > > >
> > > > > > - direct IO
> > > > > > - truncate
> > > > > > - hole punch
> > > > > >
> > > > > > so they can be serialised against mmap based page faults, writeback
> > > > > > and concurrent buffered IO. Serialisation against themselves is an
> > > > > > IO/fs exclusion problem.
> > > > > >
> > > > > > Shared access for traversal or modification needs to be granted to:
> > > > > >
> > > > > > - buffered IO
> > > > > > - mmap page faults
> > > > > > - writeback
> > > > > >
> > > > > > Each of these cases can rely on the existing page locks or IO
> > > > > > exclusion locks to provide safety for concurrent access to the same
> > > > > > ranges. This means that once we have access granted to a range we
> > > > > > can check truncate races once and ignore the problem until we drop
> > > > > > the access. And the case of taking a page fault within a buffered
> > > > > > IO won't deadlock because both take a shared lock....
> > > > > You cannot just use a lock (not even a shared one) both above and under
> > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > > > > locking...
> > > >
> > > > Well, that's assuming that exclusive lock requests form a barrier to
> > > > new shared requests. Remember that I'm talking about a range lock
> > > > here, which we can make up whatever semantics we'd need, including
> > > > having "shared lock if already locked shared" nested locking
> > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> > > > problem....
> > > That's true. But if you have semantics like this, constant writing to
> > > or reading from a file could starve e.g. truncate. So I'd prefer not to
> > > open this can of worms and keep semantics of rw semaphores if possible.
> >
> > Except truncate uses the i_mutex/i_iolock for exclusion, so it would
> > never get held off any more than it already does by buffered IO in
> > this case. i.e. the mapping tree range lock is inside the locks used
> > for truncate serialisation, so we don't have a situation where other
> > operations woul dbe held off by such an IO pattern...
> True. I was just hoping that i_mutex won't be needed if we get our new
> lock right.
>
> > > Furthermore, with direct IO you have to set in stone the ordering of
> > > mmap_sem and range lock anyway because there we need an exclusive lock.
> >
> > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For
> > direct IO, we only need the mmap_sem for the get_user_pages() call
> > IIRC, so that requires exclusive range lock-> shared mmap_sem
> > ordering. Unless we can lift the range lock in the mmap path outside
> > the mmap_sem, we're still in the same boat....
> Yes. Just as I said before it is not an unsolvable situation since for
> direct IO we can grab our lock after get_user_pages() call. I was thinking
> about it for a while and I realized, that if we have range lock that we
> take during page fault, buffered IO, direct IO, truncate, punch hole, we
> actually don't need a shared version of it. Just the fact it would be range
> lock is enough to avoid serialization.

Well, we currently allow overlapping direct IO to the same range in
XFS, The order of completion is underfined, just like concurrent IOs
to the same sector of a disk, but it basically results in direct IO
on an XFS file to behave exactly the same way as concurrent IO to a raw
device would....

Now, that does cause some problems for naive users of direct IO
(just like those same naive users mix mmap, buffered and direct IO to
the same file), but if we are going to make everything else coherent
then I have no problems with dropping this functionality and ony
allowing concurrency for non-overlapping requests.

The other thing is that concurrent overlapping buffered reads to the
same range will only serialise on page locks if the range can be
lock shared, so there would be no change in behaviour. Making the
range lock exclusive would serialise the overlapping reads at a much
higher level and will result in a change of cached read behaviour
and potentially a significant reduction in performance. This may be
a corner case no-one cares about, but exclusive range locking will
definitely impact such a workload....

> Also I was thinking that since lock ordering forces our new lock to be
> relatively close to page lock and page lock is serializing quite some of
> IO operations anyway,

That's the observation that has led me to call it a "mapping
tree" lock.

> it might be workable (and actually reasonably easy
> to grasp for developers) if the range lock is coupled with page lock - i.e.
> locking a range will be equivalent to locking each page in a range, except
> that this way you can "lock" pages which are not present and thus forbid
> their creation.

Nice idea, but I think that is introducing requirements and
potential complexity way beyond what we need right now. It's already
a complex problem, so lets not make it any harder to validate than
it already will be... :/

As it is, we can't "forbid" the creation of pages - we control
access to the mapping tree, which allows us to prevent insertion or
removal of pages from a given range. That's exactly how it will
serialise DIO against buffered/mmap IO, hole punch/truncate against
mmap, and so on. It's a mapping tree access control mechanism, not a
page serialisation mechanism...

> Also we could implement the common case of locking a range
> containing single page by just taking page lock so we save modification of
> interval tree in the common case and generally make the tree smaller (of
> course, at the cost of somewhat slowing down cases where we want to lock
> larger ranges).

That seems like premature optimistion to me, and all the cases I
think we need to care about are locking large ranges of the tree.
Let's measure what the overhead of tracking everything in a single
tree is first so we can then see what needs optimising...

Also, I don't think it can replace the page lock entirely because
the page lock as that also serialises against other parts of the VM
(e.g. memory reclaim). I'd prefer not to complicate the issue by
trying to be fancy - it's going to be hard enough to validate that
it works correctly without having to validate that it does not
introduce races that were previously prevented by the page lock.

> Using something like this in ext4 should be reasonably easy, we'd just need
> to change some code to move page lock up in locking order to make locking
> of ranges useful. Would it be also usable for XFS?

I don't think we need to change the page locking at all - just
inserting the range locks in the right place should.

Cheers,

Dave.
--
Dave Chinner
[email protected]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-06-08 21:36:29

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Fri 08-06-12 10:57:00, Dave Chinner wrote:
> On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote:
> > On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> > > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote:
> > > > On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> > > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > > > > > To me the issue at hand is that we have no method of serialising
> > > > > > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > > > > > Hence it might be better to try to work out how to fix this entire
> > > > > > > > > class of problems rather than just adding a complex kuldge that just
> > > > > > > > > papers over the current "hot" symptom....
> > > > > > > > Yes, looking at the above table, the amount of different synchronization
> > > > > > > > mechanisms is really striking. So probably we should look at some
> > > > > > > > possibility of unifying at least some cases.
> > > > > > >
> > > > > > > It seems to me that we need some thing in between the fine grained
> > > > > > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > > > > > fine grained locking for mmap scalability, but we also need to be
> > > > > > > able to atomically lock ranges of pages.
> > > > > > Yes, we also need to keep things fine grained to keep scalability of
> > > > > > direct IO and buffered reads...
> > > > > >
> > > > > > > I guess if we were to nest a fine grained multi-state lock
> > > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > > > > > to kill all problems in one go.
> > > > > > >
> > > > > > > Exclusive access on a range needs to be granted to:
> > > > > > >
> > > > > > > - direct IO
> > > > > > > - truncate
> > > > > > > - hole punch
> > > > > > >
> > > > > > > so they can be serialised against mmap based page faults, writeback
> > > > > > > and concurrent buffered IO. Serialisation against themselves is an
> > > > > > > IO/fs exclusion problem.
> > > > > > >
> > > > > > > Shared access for traversal or modification needs to be granted to:
> > > > > > >
> > > > > > > - buffered IO
> > > > > > > - mmap page faults
> > > > > > > - writeback
> > > > > > >
> > > > > > > Each of these cases can rely on the existing page locks or IO
> > > > > > > exclusion locks to provide safety for concurrent access to the same
> > > > > > > ranges. This means that once we have access granted to a range we
> > > > > > > can check truncate races once and ignore the problem until we drop
> > > > > > > the access. And the case of taking a page fault within a buffered
> > > > > > > IO won't deadlock because both take a shared lock....
> > > > > > You cannot just use a lock (not even a shared one) both above and under
> > > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > > > > > locking...
> > > > >
> > > > > Well, that's assuming that exclusive lock requests form a barrier to
> > > > > new shared requests. Remember that I'm talking about a range lock
> > > > > here, which we can make up whatever semantics we'd need, including
> > > > > having "shared lock if already locked shared" nested locking
> > > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> > > > > problem....
> > > > That's true. But if you have semantics like this, constant writing to
> > > > or reading from a file could starve e.g. truncate. So I'd prefer not to
> > > > open this can of worms and keep semantics of rw semaphores if possible.
> > >
> > > Except truncate uses the i_mutex/i_iolock for exclusion, so it would
> > > never get held off any more than it already does by buffered IO in
> > > this case. i.e. the mapping tree range lock is inside the locks used
> > > for truncate serialisation, so we don't have a situation where other
> > > operations woul dbe held off by such an IO pattern...
> > True. I was just hoping that i_mutex won't be needed if we get our new
> > lock right.
> >
> > > > Furthermore, with direct IO you have to set in stone the ordering of
> > > > mmap_sem and range lock anyway because there we need an exclusive lock.
> > >
> > > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For
> > > direct IO, we only need the mmap_sem for the get_user_pages() call
> > > IIRC, so that requires exclusive range lock-> shared mmap_sem
> > > ordering. Unless we can lift the range lock in the mmap path outside
> > > the mmap_sem, we're still in the same boat....
> > Yes. Just as I said before it is not an unsolvable situation since for
> > direct IO we can grab our lock after get_user_pages() call. I was thinking
> > about it for a while and I realized, that if we have range lock that we
> > take during page fault, buffered IO, direct IO, truncate, punch hole, we
> > actually don't need a shared version of it. Just the fact it would be range
> > lock is enough to avoid serialization.
>
> Well, we currently allow overlapping direct IO to the same range in
> XFS, The order of completion is underfined, just like concurrent IOs
> to the same sector of a disk, but it basically results in direct IO
> on an XFS file to behave exactly the same way as concurrent IO to a raw
> device would....
>
> Now, that does cause some problems for naive users of direct IO
> (just like those same naive users mix mmap, buffered and direct IO to
> the same file), but if we are going to make everything else coherent
> then I have no problems with dropping this functionality and ony
> allowing concurrency for non-overlapping requests.
Yeah, overlapping direct IO is asking for trouble (except possibly two
direct IO reads).

> The other thing is that concurrent overlapping buffered reads to the
> same range will only serialise on page locks if the range can be
> lock shared, so there would be no change in behaviour. Making the
> range lock exclusive would serialise the overlapping reads at a much
> higher level and will result in a change of cached read behaviour
> and potentially a significant reduction in performance. This may be
> a corner case no-one cares about, but exclusive range locking will
> definitely impact such a workload....
For buffered reads we would lock page-by-page anyway (again due to
locking constraints with mmap_sem when copying what we've read) so there
shouldn't be any difference to current level of concurrency.

> > Also I was thinking that since lock ordering forces our new lock to be
> > relatively close to page lock and page lock is serializing quite some of
> > IO operations anyway,
>
> That's the observation that has led me to call it a "mapping
> tree" lock.
I've internally called the locking function lock_mapping_range() ;).

> > it might be workable (and actually reasonably easy
> > to grasp for developers) if the range lock is coupled with page lock - i.e.
> > locking a range will be equivalent to locking each page in a range, except
> > that this way you can "lock" pages which are not present and thus forbid
> > their creation.
>
> Nice idea, but I think that is introducing requirements and
> potential complexity way beyond what we need right now. It's already
> a complex problem, so lets not make it any harder to validate than
> it already will be... :/
>
> As it is, we can't "forbid" the creation of pages - we control
> access to the mapping tree, which allows us to prevent insertion or
> removal of pages from a given range. That's exactly how it will
> serialise DIO against buffered/mmap IO, hole punch/truncate against
> mmap, and so on. It's a mapping tree access control mechanism, not a
> page serialisation mechanism...
I agree, I was imprecise here and it's good to realize exactly what we
guard.

> > Also we could implement the common case of locking a range
> > containing single page by just taking page lock so we save modification of
> > interval tree in the common case and generally make the tree smaller (of
> > course, at the cost of somewhat slowing down cases where we want to lock
> > larger ranges).
>
> That seems like premature optimistion to me, and all the cases I
> think we need to care about are locking large ranges of the tree.
> Let's measure what the overhead of tracking everything in a single
> tree is first so we can then see what needs optimising...
Umm, I agree that initially we probably want just to have the mapping
range lock ability, stick it somewhere to IO path and make things work.
Then we can look into making it faster / merging with page lock.

However I disagree we care most about locking large ranges. For all
buffered IO and all page faults we need to lock a range containing just a
single page. We cannot lock more due to locking constraints with mmap_sem.
So the places that will lock larger ranges are: direct IO, truncate, punch
hole. Writeback actually doesn't seem to need any additional protection at
least as I've sketched out things so far.

So single-page ranges matter at least as much as longer ranges. That's why
I came up with that page lock optimisation and merging...

> Also, I don't think it can replace the page lock entirely because
> the page lock as that also serialises against other parts of the VM
> (e.g. memory reclaim). I'd prefer not to complicate the issue by
> trying to be fancy - it's going to be hard enough to validate that
> it works correctly without having to validate that it does not
> introduce races that were previously prevented by the page lock.
So I never meant to completely replace page lock. I agree that will be
almost impossible. I rather meant to add a capability of locking a range of
pages. But lets leave that for now.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-06-08 23:06:16

by Dave Chinner

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Fri, Jun 08, 2012 at 11:36:29PM +0200, Jan Kara wrote:
> On Fri 08-06-12 10:57:00, Dave Chinner wrote:
> > On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote:
> > > On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> > > Also we could implement the common case of locking a range
> > > containing single page by just taking page lock so we save modification of
> > > interval tree in the common case and generally make the tree smaller (of
> > > course, at the cost of somewhat slowing down cases where we want to lock
> > > larger ranges).
> >
> > That seems like premature optimistion to me, and all the cases I
> > think we need to care about are locking large ranges of the tree.
> > Let's measure what the overhead of tracking everything in a single
> > tree is first so we can then see what needs optimising...
> Umm, I agree that initially we probably want just to have the mapping
> range lock ability, stick it somewhere to IO path and make things work.
> Then we can look into making it faster / merging with page lock.
>
> However I disagree we care most about locking large ranges. For all
> buffered IO and all page faults we need to lock a range containing just a
> single page. We cannot lock more due to locking constraints with mmap_sem.

Not sure I understand what that constraint is - I hav ebeen thinking
that the buffered IO range lok would be across the entire IO, not
individual pages.

As it is, if we want to do multipage writes (and we do), we have to
be able to lock a range of the mapping in the buffered IO path at a
time...

> So the places that will lock larger ranges are: direct IO, truncate, punch
> hole. Writeback actually doesn't seem to need any additional protection at
> least as I've sketched out things so far.

AFAICT, writeback needs protection against punching holes, just like
mmap does, because they use the same "avoid truncated pages"
mechanism.

> So single-page ranges matter at least as much as longer ranges. That's why
> I came up with that page lock optimisation and merging...

I agree they are common, but lets measure the overhead first before
trying to optimise/special case certain behaviours....

Cheers,

Dave.
--
Dave Chinner
[email protected]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-06-12 08:56:44

by Jan Kara

[permalink] [raw]
Subject: Re: Hole punching and mmap races

On Sat 09-06-12 09:06:16, Dave Chinner wrote:
> On Fri, Jun 08, 2012 at 11:36:29PM +0200, Jan Kara wrote:
> > On Fri 08-06-12 10:57:00, Dave Chinner wrote:
> > > On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote:
> > > > On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> > > > Also we could implement the common case of locking a range
> > > > containing single page by just taking page lock so we save modification of
> > > > interval tree in the common case and generally make the tree smaller (of
> > > > course, at the cost of somewhat slowing down cases where we want to lock
> > > > larger ranges).
> > >
> > > That seems like premature optimistion to me, and all the cases I
> > > think we need to care about are locking large ranges of the tree.
> > > Let's measure what the overhead of tracking everything in a single
> > > tree is first so we can then see what needs optimising...
> > Umm, I agree that initially we probably want just to have the mapping
> > range lock ability, stick it somewhere to IO path and make things work.
> > Then we can look into making it faster / merging with page lock.
> >
> > However I disagree we care most about locking large ranges. For all
> > buffered IO and all page faults we need to lock a range containing just a
> > single page. We cannot lock more due to locking constraints with mmap_sem.
>
> Not sure I understand what that constraint is - I hav ebeen thinking
> that the buffered IO range lok would be across the entire IO, not
> individual pages.
>
> As it is, if we want to do multipage writes (and we do), we have to
> be able to lock a range of the mapping in the buffered IO path at a
> time...
The problem is that buffered IO path does (e.g. in
generic_perform_write()):
iov_iter_fault_in_readable() - that faults in one page worth of buffers,
takes mmap_sem
->write_begin()
copy data - iov_iter_copy_from_user_atomic()
->write_end()

So we take mmap_sem before writing every page. We could fault in more,
but that increases risk of iov_iter_copy_from_user_atomic() failing because
the page got reclaimed before we got to it. So the amount we fault in would
have to adapt to current memory pressure. That's certainly possible but not
related to the problem we are trying to solve now so I'd prefer to handle
it separately.

> > So the places that will lock larger ranges are: direct IO, truncate, punch
> > hole. Writeback actually doesn't seem to need any additional protection at
> > least as I've sketched out things so far.
>
> AFAICT, writeback needs protection against punching holes, just like
> mmap does, because they use the same "avoid truncated pages"
> mechanism.
If punching hole does:
lock_mapping_range()
evict all pages in a range
punch blocks
unlock_mapping_range()

Then we shouldn't race against writeback because there are no pages in
the mapping range we punch and they cannot be created there because we
hold the lock. I agree this might be unnecessary optimization, but the nice
result is that we can clean dirty pages regardless of what others do with
the mapping. So in case there would be problems with taking mapping lock from
writeback, we could avoid that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR