As discussed in linux-mm, mlocking a shared, writable vma currently causes
the corresponding pages to be marked as dirty and queued for writeback.
This seems rather unnecessary given that the pages are not being actually
modified during mlock. It is understood that for non-shared mappings
(file or anon) we want to use a write fault in order to break COW, but
there is just no such need for shared mappings.
The first two patches in this series do not introduce any behavior change.
The intent there is to make it obvious that dirtying file pages is
only done in the (writable, shared) case. I think this clarifies the
code, but I wouldn't mind dropping these two patches if there is no
consensus about them.
The last patch is where we actually avoid dirtying shared mappings
during mlock. Note that as a side effect of this, we won't call
page_mkwrite() for the mappings that define it, and won't be
pre-allocating data blocks at the FS level if the mapped file was
sparsely allocated. My understanding is that mlock does not need to
provide such guarantee, as evidenced by the fact that it never did for
the filesystems that don't define page_mkwrite() - including some
common ones like ext3. However, I would like to gather feedback on this
from filesystem people as a precaution. If this turns out to be a
showstopper, maybe block preallocation can be added back on using a
different interface.
Large shared mlocks are getting significantly (>2x) faster in my tests,
as the disk can be fully used for reading the file instead of having to
share between this and writeback.
Michel Lespinasse (3):
do_wp_page: remove the 'reuse' flag
do_wp_page: clarify dirty_page handling
mlock: avoid dirtying pages and triggering writeback
mm/memory.c | 90 ++++++++++++++++++++++++++++++++---------------------------
mm/mlock.c | 7 ++++-
2 files changed, 55 insertions(+), 42 deletions(-)
--
1.7.3.1
Reorganize the code to remove the 'reuse' flag. No behavior changes.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..810a75f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2110,7 +2110,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *old_page, *new_page;
pte_t entry;
- int reuse = 0, ret = 0;
+ int ret = 0;
int page_mkwrite = 0;
struct page *dirty_page = NULL;
@@ -2147,14 +2147,16 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
page_cache_release(old_page);
}
- reuse = reuse_swap_page(old_page);
- if (reuse)
+ if (reuse_swap_page(old_page)) {
/*
* The page is all ours. Move it to our anon_vma so
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
page_move_anon_rmap(old_page, vma, address);
+ unlock_page(old_page);
+ goto reuse;
+ }
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
@@ -2218,10 +2220,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
dirty_page = old_page;
get_page(dirty_page);
- reuse = 1;
- }
- if (reuse) {
reuse:
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
--
1.7.3.1
When faulting in pages for mlock(), we want to break COW for anonymous
or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
no need to write-fault into VM_SHARED vmas since shared file pages can
be mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 7 ++++++-
mm/mlock.c | 7 ++++++-
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index d4c0c2e..7f45085 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,7 +3291,12 @@ int make_pages_present(unsigned long addr, unsigned long end)
vma = find_vma(current->mm, addr);
if (!vma)
return -ENOMEM;
- write = (vma->vm_flags & VM_WRITE) != 0;
+ /*
+ * We want to touch writable mappings with a write fault in order
+ * to break COW, except for shared mappings because these don't COW
+ * and we would not want to dirty them for nothing.
+ */
+ write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
BUG_ON(addr >= end);
BUG_ON(end > vma->vm_end);
len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
diff --git a/mm/mlock.c b/mm/mlock.c
index b70919c..4f31864 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -171,7 +171,12 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
gup_flags = FOLL_TOUCH | FOLL_GET;
- if (vma->vm_flags & VM_WRITE)
+ /*
+ * We want to touch writable mappings with a write fault in order
+ * to break COW, except for shared mappings because these don't COW
+ * and we would not want to dirty them for nothing.
+ */
+ if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
gup_flags |= FOLL_WRITE;
/* We don't try to access the guard page of a stack vma */
--
1.7.3.1
Reorganize the code so that dirty pages are handled closer to the place
that makes them dirty (handling write fault into shared, writable VMAs).
No behavior changes.
Signed-off-by: Michel Lespinasse <[email protected]>
---
mm/memory.c | 72 +++++++++++++++++++++++++++++++---------------------------
1 files changed, 38 insertions(+), 34 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 810a75f..d4c0c2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2227,8 +2227,45 @@ reuse:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, page_table);
+ pte_unmap_unlock(page_table, ptl);
ret |= VM_FAULT_WRITE;
- goto unlock;
+
+ if (!dirty_page)
+ return ret;
+
+ /*
+ * Yes, Virginia, this is actually required to prevent a race
+ * with clear_page_dirty_for_io() from clearing the page dirty
+ * bit after it clear all dirty ptes, but before a racing
+ * do_wp_page installs a dirty pte.
+ *
+ * do_no_page is protected similarly.
+ */
+ if (!page_mkwrite) {
+ wait_on_page_locked(dirty_page);
+ set_page_dirty_balance(dirty_page, page_mkwrite);
+ }
+ put_page(dirty_page);
+ if (page_mkwrite) {
+ struct address_space *mapping = dirty_page->mapping;
+
+ set_page_dirty(dirty_page);
+ unlock_page(dirty_page);
+ page_cache_release(dirty_page);
+ if (mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+ }
+
+ /* file_update_time outside page_lock */
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
+ return ret;
}
/*
@@ -2334,39 +2371,6 @@ gotten:
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
- if (dirty_page) {
- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * do_no_page is protected similarly.
- */
- if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page, page_mkwrite);
- }
- put_page(dirty_page);
- if (page_mkwrite) {
- struct address_space *mapping = dirty_page->mapping;
-
- set_page_dirty(dirty_page);
- unlock_page(dirty_page);
- page_cache_release(dirty_page);
- if (mapping) {
- /*
- * Some device drivers do not set page.mapping
- * but still dirty their pages
- */
- balance_dirty_pages_ratelimited(mapping);
- }
- }
-
- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
- }
return ret;
oom_free_new:
page_cache_release(new_page);
--
1.7.3.1
On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote:
> When faulting in pages for mlock(), we want to break COW for anonymous
> or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
> no need to write-fault into VM_SHARED vmas since shared file pages can
> be mlocked first and dirtied later, when/if they actually get written to.
> Skipping the write fault is desirable, as we don't want to unnecessarily
> cause these pages to be dirtied and queued for writeback.
It's not just to break COW, but to do block allocation and such
(filesystem's page_mkwrite op). That needs to at least be explained
in the changelog.
Filesystem doesn't have a good way to fully pin required things
according to mlock, but page_mkwrite provides some reasonable things
(like block allocation / reservation).
On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote:
> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote:
> > When faulting in pages for mlock(), we want to break COW for anonymous
> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
> > no need to write-fault into VM_SHARED vmas since shared file pages can
> > be mlocked first and dirtied later, when/if they actually get written to.
> > Skipping the write fault is desirable, as we don't want to unnecessarily
> > cause these pages to be dirtied and queued for writeback.
>
> It's not just to break COW, but to do block allocation and such
> (filesystem's page_mkwrite op). That needs to at least be explained
> in the changelog.
Agreed, the 0/3 description actually does mention this.
> Filesystem doesn't have a good way to fully pin required things
> according to mlock, but page_mkwrite provides some reasonable things
> (like block allocation / reservation).
Right, but marking all pages dirty isn't really sane. I can imagine
making the reservation but not marking things dirty solution, although
it might be lots harder to implement, esp since some filesystems don't
actually have a page_mkwrite() implementation.
On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote:
>> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote:
>> > When faulting in pages for mlock(), we want to break COW for anonymous
>> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
>> > no need to write-fault into VM_SHARED vmas since shared file pages can
>> > be mlocked first and dirtied later, when/if they actually get written to.
>> > Skipping the write fault is desirable, as we don't want to unnecessarily
>> > cause these pages to be dirtied and queued for writeback.
>>
>> It's not just to break COW, but to do block allocation and such
>> (filesystem's page_mkwrite op). That needs to at least be explained
>> in the changelog.
>
> Agreed, the 0/3 description actually does mention this.
>
>> Filesystem doesn't have a good way to fully pin required things
>> according to mlock, but page_mkwrite provides some reasonable things
>> (like block allocation / reservation).
>
> Right, but marking all pages dirty isn't really sane. I can imagine
> making the reservation but not marking things dirty solution, although
> it might be lots harder to implement, esp since some filesystems don't
> actually have a page_mkwrite() implementation.
Really, my understanding is that not pre-allocating filesystem blocks
is just fine. This is, after all, what happens with ext3 and it's
never been reported as a bug (that I know of).
If filesystem people's feedback is that they really want mlock() to
continue pre-allocating blocks, maybe we can just do it using
fallocate() rather than page_mkwrite() callbacks ?
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Wed, 2010-11-17 at 14:05 -0800, Michel Lespinasse wrote:
>
> Really, my understanding is that not pre-allocating filesystem blocks
> is just fine. This is, after all, what happens with ext3 and it's
> never been reported as a bug (that I know of).
>
fwiw I'm perfectly fine with it
> If filesystem people's feedback is that they really want mlock() to
> continue pre-allocating blocks, maybe we can just do it using
> fallocate() rather than page_mkwrite() callbacks ?
Sounds sensible..
On Wed, Nov 17, 2010 at 02:05:30PM -0800, Michel Lespinasse wrote:
> On Wed, Nov 17, 2010 at 7:28 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote:
> >> On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote:
> >> > When faulting in pages for mlock(), we want to break COW for anonymous
> >> > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
> >> > no need to write-fault into VM_SHARED vmas since shared file pages can
> >> > be mlocked first and dirtied later, when/if they actually get written to.
> >> > Skipping the write fault is desirable, as we don't want to unnecessarily
> >> > cause these pages to be dirtied and queued for writeback.
> >>
> >> It's not just to break COW, but to do block allocation and such
> >> (filesystem's page_mkwrite op). That needs to at least be explained
> >> in the changelog.
> >
> > Agreed, the 0/3 description actually does mention this.
> >
> >> Filesystem doesn't have a good way to fully pin required things
> >> according to mlock, but page_mkwrite provides some reasonable things
> >> (like block allocation / reservation).
> >
> > Right, but marking all pages dirty isn't really sane. I can imagine
> > making the reservation but not marking things dirty solution, although
> > it might be lots harder to implement, esp since some filesystems don't
> > actually have a page_mkwrite() implementation.
>
> Really, my understanding is that not pre-allocating filesystem blocks
> is just fine. This is, after all, what happens with ext3 and it's
> never been reported as a bug (that I know of).
It's not ext3 you have to worry about - it's the filesystems that
need special state set up on their pages/buffers for ->writepage to
work correctly that are the problem. You need to call
->write_begin/->write_end to get the state set up properly.
If this state is not set up properly, silent data loss will occur
during mmap writes either by ENOSPC or failing to set up writes into
unwritten extents correctly (i.e. we'll be back to where we were in
2.6.15).
I don't think ->page_mkwrite can be worked around - we need that to
be called on the first write fault of any mmap()d page to ensure it
is set up correctly for writeback. If we don't get write faults
after the page is mlock()d, then we need the ->page_mkwrite() call
during the mlock() call.
> If filesystem people's feedback is that they really want mlock() to
> continue pre-allocating blocks, maybe we can just do it using
> fallocate() rather than page_mkwrite() callbacks ?
Fallocate is not good enough to avoid ->page_mkwrite callbacks.
Indeed, XFS (at least) requires the ->page_mkwrite() callout even
on preallocated space to correctly mark the buffers as unwritten so
extent conversion in ->writepage is triggered correctly (see test
#166 in xfstests).
Hence I think that avoiding ->page_mkwrite callouts is likely to
break some filesystems in subtle, undetected ways. IMO, regardless
of what is done, it would be really good to start by writing a new
regression test to exercise and encode the expected the mlock
behaviour so we can detect regressions later on....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <[email protected]> wrote:
>> Really, my understanding is that not pre-allocating filesystem blocks
>> is just fine. This is, after all, what happens with ext3 and it's
>> never been reported as a bug (that I know of).
>
> It's not ext3 you have to worry about - it's the filesystems that
> need special state set up on their pages/buffers for ->writepage to
> work correctly that are the problem. You need to call
> ->write_begin/->write_end to get the state set up properly.
>
> If this state is not set up properly, silent data loss will occur
> during mmap writes either by ENOSPC or failing to set up writes into
> unwritten extents correctly (i.e. we'll be back to where we were in
> 2.6.15).
>
> I don't think ->page_mkwrite can be worked around - we need that to
> be called on the first write fault of any mmap()d page to ensure it
> is set up correctly for writeback. ?If we don't get write faults
> after the page is mlock()d, then we need the ->page_mkwrite() call
> during the mlock() call.
Just to be clear - I'm proposing to skip the entire do_wp_page() call
by doing a read fault rather than a write fault. If the page wasn't
dirty already, it will stay clean and with a non-writable PTE until it
gets actually written to, at which point we'll get a write fault and
do_wp_page will be invoked as usual.
I am not proposing to skip the page_mkwrite() while upgrading the PTE
permissions, which I think is what you were arguing against ?
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> I don't think ->page_mkwrite can be worked around - we need that to
> be called on the first write fault of any mmap()d page to ensure it
> is set up correctly for writeback. If we don't get write faults
> after the page is mlock()d, then we need the ->page_mkwrite() call
> during the mlock() call.
OK, so I'm not an mm hacker, so maybe I'm missing something. Could
part of this be fixed by simply sending the write faults for
mlock()'ed pages, so page_mkwrite() gets called when the page is
dirtied. Seems like a real waste to have the file system pre-allocate
all of the blocks for a mlock()'ed region. Why does mlock() have to
result in the write faults getting suppressed when the page is
actually dirtied?
- Ted
On Wed, 17 Nov 2010 18:52:30 -0500
"Ted Ts'o" <[email protected]> wrote:
> On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > I don't think ->page_mkwrite can be worked around - we need that to
> > be called on the first write fault of any mmap()d page to ensure it
> > is set up correctly for writeback. If we don't get write faults
> > after the page is mlock()d, then we need the ->page_mkwrite() call
> > during the mlock() call.
>
> OK, so I'm not an mm hacker, so maybe I'm missing something. Could
> part of this be fixed by simply sending the write faults for
> mlock()'ed pages, so page_mkwrite() gets called when the page is
> dirtied. Seems like a real waste to have the file system pre-allocate
> all of the blocks for a mlock()'ed region. Why does mlock() have to
> result in the write faults getting suppressed when the page is
> actually dirtied?
Yup, I don't think it would be too bad to take a minor fault each time
an mlocked page transitions from clean->dirty.
In fact we should already be doing that, after the mlocked page gets
written back by kupdate? Hope so!
On Wed, Nov 17, 2010 at 04:28:54PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 23:57 +1100, Nick Piggin wrote:
> > On Wed, Nov 17, 2010 at 04:23:58AM -0800, Michel Lespinasse wrote:
> > > When faulting in pages for mlock(), we want to break COW for anonymous
> > > or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
> > > no need to write-fault into VM_SHARED vmas since shared file pages can
> > > be mlocked first and dirtied later, when/if they actually get written to.
> > > Skipping the write fault is desirable, as we don't want to unnecessarily
> > > cause these pages to be dirtied and queued for writeback.
> >
> > It's not just to break COW, but to do block allocation and such
> > (filesystem's page_mkwrite op). That needs to at least be explained
> > in the changelog.
>
> Agreed, the 0/3 description actually does mention this.
Oh, missed that, but yes the changelog needs to.
> > Filesystem doesn't have a good way to fully pin required things
> > according to mlock, but page_mkwrite provides some reasonable things
> > (like block allocation / reservation).
>
> Right, but marking all pages dirty isn't really sane. I can imagine
> making the reservation but not marking things dirty solution, although
> it might be lots harder to implement, esp since some filesystems don't
> actually have a page_mkwrite() implementation.
I think it is sane enough. Going back to previous behaviour would be
a regression, wouldn't it?
The right way to fix this would not be to introduce the new regression
but either/both: a specific syscall to mlock-for-read which does not do
any reservations, fix filesystem hook to allow reservation without
implying dirtying. A simple flag to page_mkwrite will be enough (plus
the logic to call it from VM).
If an app has called mlock, presumably it doesn't want to SIGBUS from
out of space, if it can possibly help it. If it isn't going to write
to it, then PROT_READ would be appropriate.
If not, then a note to man page maintainer, and an idea of performance
improvement in an actual use case would be nice.
On Nov 18, 2010, at 12:46 AM, Nick Piggin wrote:
> The right way to fix this would not be to introduce the new regression
> but either/both: a specific syscall to mlock-for-read which does not do
> any reservations, fix filesystem hook to allow reservation without
> implying dirtying. A simple flag to page_mkwrite will be enough (plus
> the logic to call it from VM).
Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that.
Are there really programs that assume that mlock() == fallocate()?!?
-- Ted
On Wed, Nov 17, 2010 at 04:53:09PM -0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 18:52:30 -0500
> "Ted Ts'o" <[email protected]> wrote:
>
> > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > > I don't think ->page_mkwrite can be worked around - we need that to
> > > be called on the first write fault of any mmap()d page to ensure it
> > > is set up correctly for writeback. If we don't get write faults
> > > after the page is mlock()d, then we need the ->page_mkwrite() call
> > > during the mlock() call.
> >
> > OK, so I'm not an mm hacker, so maybe I'm missing something. Could
> > part of this be fixed by simply sending the write faults for
> > mlock()'ed pages, so page_mkwrite() gets called when the page is
> > dirtied. Seems like a real waste to have the file system pre-allocate
> > all of the blocks for a mlock()'ed region. Why does mlock() have to
> > result in the write faults getting suppressed when the page is
> > actually dirtied?
This is actually what the patch does - by having mlock() use a read fault,
pages are loaded in memory and mlocked, but the ptes are not marked as
writable so that a later write access will be caught as a write fault at
that time (with all the usual dirtying and page_mkwrite() callbacks).
> Yup, I don't think it would be too bad to take a minor fault each time
> an mlocked page transitions from clean->dirty.
>
> In fact we should already be doing that, after the mlocked page gets
> written back by kupdate? Hope so!
Yes, handle_mm_fault() is careful to never create writable ptes pointing
to clean file pages, so that a later write fault will correctly dirty
the corresponding page.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> Hence I think that avoiding ->page_mkwrite callouts is likely to
> break some filesystems in subtle, undetected ways. IMO, regardless
> of what is done, it would be really good to start by writing a new
> regression test to exercise and encode the expected the mlock
> behaviour so we can detect regressions later on....
I think it would help if we could drink a bit of the test driven design
coolaid here. Michel, can you write some testcases where pages on a
shared mapping are mlocked, then dirtied and then munlocked, and then
written out using msync/fsync. Anything that fails this test on
btrfs/ext4/gfs/xfs/etc obviously doesn't work.
On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote:
> Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that.
>
> Are there really programs that assume that mlock() == fallocate()?!?
If there are programs that do they can't predate linux 2.6.15, and only
work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather
unlikely to me.
On Thu, 18 Nov 2010, Christoph Hellwig wrote:
> On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > Hence I think that avoiding ->page_mkwrite callouts is likely to
> > break some filesystems in subtle, undetected ways. IMO, regardless
> > of what is done, it would be really good to start by writing a new
> > regression test to exercise and encode the expected the mlock
> > behaviour so we can detect regressions later on....
>
> I think it would help if we could drink a bit of the test driven design
> coolaid here. Michel, can you write some testcases where pages on a
> shared mapping are mlocked, then dirtied and then munlocked, and then
> written out using msync/fsync. Anything that fails this test on
> btrfs/ext4/gfs/xfs/etc obviously doesn't work.
Whilst it's hard to argue against a request for testing, Dave's worries
just sprang from a misunderstanding of all the talk about "avoiding ->
page_mkwrite". There's nothing strange or risky about Michel's patch,
it does not avoid ->page_mkwrite when there is a write: it just stops
pretending that there was a write when locking down the shared area.
Hugh
On Thu, 18 Nov 2010, Christoph Hellwig wrote:
> On Thu, Nov 18, 2010 at 05:43:06AM -0500, Theodore Tso wrote:
> > Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that.
> >
> > Are there really programs that assume that mlock() == fallocate()?!?
>
> If there are programs that do they can't predate linux 2.6.15, and only
> work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather
> unlikely to me.
Yes, almost. I'm very much on this side, that mlocking should not dirty
all those pages; but better admit one argument for the opposition - it's
possible that we'd find a case somewhere, which has always (i.e. even pre-
page_mkwrite) relied upon mlock of an entirely sparse file to result in
a nicely ordered allocation of blocks to the file (as would often have
happened with pdflush, I think), to give good sequential read patterns
ever after; but with this patch would now get much more random block
ordering, according to where the real writes actually fall.
It would be possible for a filesystem's ->fault(vma, &vmf) to observe
that it's being called on a VM_LOCKED|VM_SHARED vma, and make sure that
the page has backing in that case, to reproduce the old allocation behaviour
without all the unnecessary writing. But that would be extra work in every
filesystem that cares.
Hugh
On Wed, Nov 17, 2010 at 03:31:37PM -0800, Michel Lespinasse wrote:
> On Wed, Nov 17, 2010 at 3:11 PM, Dave Chinner <[email protected]> wrote:
> >> Really, my understanding is that not pre-allocating filesystem blocks
> >> is just fine. This is, after all, what happens with ext3 and it's
> >> never been reported as a bug (that I know of).
> >
> > It's not ext3 you have to worry about - it's the filesystems that
> > need special state set up on their pages/buffers for ->writepage to
> > work correctly that are the problem. You need to call
> > ->write_begin/->write_end to get the state set up properly.
> >
> > If this state is not set up properly, silent data loss will occur
> > during mmap writes either by ENOSPC or failing to set up writes into
> > unwritten extents correctly (i.e. we'll be back to where we were in
> > 2.6.15).
> >
> > I don't think ->page_mkwrite can be worked around - we need that to
> > be called on the first write fault of any mmap()d page to ensure it
> > is set up correctly for writeback. ?If we don't get write faults
> > after the page is mlock()d, then we need the ->page_mkwrite() call
> > during the mlock() call.
>
> Just to be clear - I'm proposing to skip the entire do_wp_page() call
> by doing a read fault rather than a write fault. If the page wasn't
> dirty already, it will stay clean and with a non-writable PTE until it
> gets actually written to, at which point we'll get a write fault and
> do_wp_page will be invoked as usual.
I have no problem with that - I'm surprised that mlock didn't work
that way in the first place.
> I am not proposing to skip the page_mkwrite() while upgrading the PTE
> permissions, which I think is what you were arguing against ?
I wasn't arguing against anything, merely pointing out that the
->page_mkwrite call is aboslutely necessary. You've made clarified
that it still occurs, so I'm happy...
FWIW, what I was responding to was the assumption that "this is
alright for ext3, so it must be OK" extrapolation about
->page_mkwrite behaviour. Especially as ext3 does not even implement
the ->page_mkwrite operation (which means mmap writes into holes
can't detect ENOSPC correctly)...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote:
> On Thu, 18 Nov 2010, Christoph Hellwig wrote:
> > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > > Hence I think that avoiding ->page_mkwrite callouts is likely to
> > > break some filesystems in subtle, undetected ways. IMO, regardless
> > > of what is done, it would be really good to start by writing a new
> > > regression test to exercise and encode the expected the mlock
> > > behaviour so we can detect regressions later on....
> >
> > I think it would help if we could drink a bit of the test driven design
> > coolaid here. Michel, can you write some testcases where pages on a
> > shared mapping are mlocked, then dirtied and then munlocked, and then
> > written out using msync/fsync. Anything that fails this test on
> > btrfs/ext4/gfs/xfs/etc obviously doesn't work.
I think it's still under debate what's an acceptable result for this test
(i.e. what's supposed to happen during mlock of a shared mapping of
a sparsely allocated file - is a fallocate equivalent supposed to happen ?)
But I agree discussing based on test results will make things more concrete.
> Whilst it's hard to argue against a request for testing, Dave's worries
> just sprang from a misunderstanding of all the talk about "avoiding ->
> page_mkwrite". There's nothing strange or risky about Michel's patch,
> it does not avoid ->page_mkwrite when there is a write: it just stops
> pretending that there was a write when locking down the shared area.
So, I decided to test this using memtoy. /data is a separate partition
where I had just 10GB free space, and /data/hole20G was created using
dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0.
memtoy>file /data/hole20G shared
memtoy>map hole20G
At this point the file is mapped using a writable, shared VMA.
memtoy>touch hole20G
memtoy: touched 5242880 pages in 30.595 secs
At this point memtoy's address space is populated with zeroed
pages. The pages are distinct (meminfo does show 20G of allocated pages),
are classified as file pages, not anon, and are associated with the
struct address_space for /data/hole20G. That file still does not have
blocks allocated, as can be seen with du /data/hole20G.
memtoy>lock hole20G
memtoy tries to mlock the hole20G VMA.
This is where things get interesting.
Using ext2, without my patch (ext3 should be similar):
- first, mlock does fast progress going though file pages, marking them
as dirty. Eventually, it hits the dirty limit and gets throttled.
- then, mlock does slow progress as it needs to wait for writeback.
writeback occurs and allocates blocks for the /data/hole20G.
Eventually, the /data partition gets full.
- then, mlock does no progress as it's at the dirty limit and nothing
gets written back.
- mlock never terminates.
Using ext2, with my patch (ext3 should be similar):
- mlock goes through all pages in ~5 seconds, marking them as mlocked
(but still not dirty)
- mlock completes. /data/hole20G still does not have blocks allocated.
- if memtoy is then instructed to dirty all the pages
(using 'touch hole20G write'):
- first, memtoy does fast progress faulting through file pages, marking
them as dirty. Eventually, it hits the dirty limit and gets throttled.
- then, memtoy does slow progress as it needs to wait for writeback.
writeback occurs and allocates blocks for the /data/hole20G.
Eventually, the /data partition gets full.
- then, memtoy does no progress as it's at the dirty limit and nothing
gets written back. It gets stuck into a write fault that never
completes.
- i.e. this is essentially the same lockup as without my patch, except that
it occurs when the application tries to dirty the shared file rather than
during mlock itself.
Using ext4, without my patch:
- first, mlock does fast progress going though file pages, marking them
as dirty. Eventually, it hits the dirty limit and gets throttled.
- then, mlock does slow progress as it needs to wait for writeback.
writeback occurs and allocates blocks for the /data/hole20G.
Eventually, the /data partition gets full.
- then, mlock returns an error.
Using ext4, with my patch:
- mlock goes through all pages in ~5 seconds, marking them as mlocked
(but still not dirty)
- mlock completes. /data/hole20G still does not have blocks allocated.
- if memtoy is then instructed to dirty all the pages
(using 'touch hole20G write'):
- first, memtoy does fast progress faulting through file pages, marking
them as dirty. Eventually, it hits the dirty limit and gets throttled.
- then, memtoy does slow progress as it needs to wait for writeback.
writeback occurs and allocates blocks for the /data/hole20G.
Eventually, the /data partition gets full.
- at that point, memtoy dies of SIGBUS.
- i.e. for filesystems that define the page_mkwrite callback, the mlock
behavior when running out of space writing to sparse files is clearly
nicer without my patch than with it.
Not 100% sure what to make of these results.
Approaching the problem the other way - would there be any objection to
adding code to do an fallocate() equivalent at the start of mlock ?
This would be a no-op when the file is fully allocated on disk, and would
allow mlock to return an error if the file can't get fully allocated
(no idea what errno should be for such case, though).
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Fri, Nov 19, 2010 at 2:23 AM, Michel Lespinasse <[email protected]> wrote:
>
> Approaching the problem the other way - would there be any objection to
> adding code to do an fallocate() equivalent at the start of mlock ?
> This would be a no-op when the file is fully allocated on disk, and would
> allow mlock to return an error if the file can't get fully allocated
> (no idea what errno should be for such case, though).
My vote would be against. ? If you if you mmap a sparse file and then
try writing to it willy-nilly, bad things will happen. ?This is true without
a mlock(). ? Where is it written that mlock() has anything to do with
improving this situation?
If userspace wants to call fallocate() before it calls mlock(), it should
do that. ?And in fact, in most cases, userspace should probably be
encouraged to do that. ? But having mlock() call fallocate() and
then return ENOSPC if there's no room? Isn't it confusing that mlock()
call ENOSPC? Doesn't that give you cognitive dissonance? It should
because fundamentally mlock() has nothing to do with block allocation!!
Read the API spec!
Look, it was an accident / bug of the implementation that mlock()
magically dirtied all these pages. ?It might have made some situations
better, but I very much doubt applications depended upon it, and I'd
really rather not perpetuate this particular magic side effect of the
previously buggy implementation of mlock().
-- Ted
On Fri, Nov 19, 2010 at 08:42:05AM -0500, Theodore Tso wrote:
> My vote would be against. ? If you if you mmap a sparse file and then
> try writing to it willy-nilly, bad things will happen. ?This is true without
> a mlock(). ? Where is it written that mlock() has anything to do with
> improving this situation?
Exactly. Allocating space has been a side-effect on a handfull
filesystem for about 20 kernel releases.
> If userspace wants to call fallocate() before it calls mlock(), it should
> do that. ?And in fact, in most cases, userspace should probably be
> encouraged to do that. ? But having mlock() call fallocate() and
> then return ENOSPC if there's no room? Isn't it confusing that mlock()
> call ENOSPC? Doesn't that give you cognitive dissonance? It should
> because fundamentally mlock() has nothing to do with block allocation!!
> Read the API spec!
Indeed. There is no need to make mlock + flag a parallel-API to
fallocate.
On Thu, 18 Nov 2010 23:23:16 -0800
Michel Lespinasse <[email protected]> wrote:
> On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote:
> > On Thu, 18 Nov 2010, Christoph Hellwig wrote:
> > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > > > Hence I think that avoiding ->page_mkwrite callouts is likely to
> > > > break some filesystems in subtle, undetected ways. IMO, regardless
> > > > of what is done, it would be really good to start by writing a new
> > > > regression test to exercise and encode the expected the mlock
> > > > behaviour so we can detect regressions later on....
> > >
> > > I think it would help if we could drink a bit of the test driven design
> > > coolaid here. Michel, can you write some testcases where pages on a
> > > shared mapping are mlocked, then dirtied and then munlocked, and then
> > > written out using msync/fsync. Anything that fails this test on
> > > btrfs/ext4/gfs/xfs/etc obviously doesn't work.
>
> I think it's still under debate what's an acceptable result for this test
> (i.e. what's supposed to happen during mlock of a shared mapping of
> a sparsely allocated file - is a fallocate equivalent supposed to happen ?)
> But I agree discussing based on test results will make things more concrete.
>
> > Whilst it's hard to argue against a request for testing, Dave's worries
> > just sprang from a misunderstanding of all the talk about "avoiding ->
> > page_mkwrite". There's nothing strange or risky about Michel's patch,
> > it does not avoid ->page_mkwrite when there is a write: it just stops
> > pretending that there was a write when locking down the shared area.
>
> So, I decided to test this using memtoy.
Wait. You *tested* the kernel?
I dunno, kids these days...
> /data is a separate partition
> where I had just 10GB free space, and /data/hole20G was created using
> dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0.
>
> memtoy>file /data/hole20G shared
> memtoy>map hole20G
>
> At this point the file is mapped using a writable, shared VMA.
>
> memtoy>touch hole20G
> memtoy: touched 5242880 pages in 30.595 secs
>
> At this point memtoy's address space is populated with zeroed
> pages. The pages are distinct (meminfo does show 20G of allocated pages),
> are classified as file pages, not anon, and are associated with the
> struct address_space for /data/hole20G. That file still does not have
> blocks allocated, as can be seen with du /data/hole20G.
>
> memtoy>lock hole20G
>
> memtoy tries to mlock the hole20G VMA.
> This is where things get interesting.
>
> Using ext2, without my patch (ext3 should be similar):
> - first, mlock does fast progress going though file pages, marking them
> as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, mlock does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - then, mlock does no progress as it's at the dirty limit and nothing
> gets written back.
> - mlock never terminates.
>
> Using ext2, with my patch (ext3 should be similar):
> - mlock goes through all pages in ~5 seconds, marking them as mlocked
> (but still not dirty)
> - mlock completes. /data/hole20G still does not have blocks allocated.
> - if memtoy is then instructed to dirty all the pages
> (using 'touch hole20G write'):
> - first, memtoy does fast progress faulting through file pages, marking
> them as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, memtoy does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - then, memtoy does no progress as it's at the dirty limit and nothing
> gets written back. It gets stuck into a write fault that never
> completes.
> - i.e. this is essentially the same lockup as without my patch, except that
> it occurs when the application tries to dirty the shared file rather than
> during mlock itself.
Seems to me that this bug is the first thing we should be looking at.
> Using ext4, without my patch:
> - first, mlock does fast progress going though file pages, marking them
> as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, mlock does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - then, mlock returns an error.
>
> Using ext4, with my patch:
> - mlock goes through all pages in ~5 seconds, marking them as mlocked
> (but still not dirty)
> - mlock completes. /data/hole20G still does not have blocks allocated.
> - if memtoy is then instructed to dirty all the pages
> (using 'touch hole20G write'):
> - first, memtoy does fast progress faulting through file pages, marking
> them as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, memtoy does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - at that point, memtoy dies of SIGBUS.
> - i.e. for filesystems that define the page_mkwrite callback, the mlock
> behavior when running out of space writing to sparse files is clearly
> nicer without my patch than with it.
>
>
> Not 100% sure what to make of these results.
>
> Approaching the problem the other way - would there be any objection to
> adding code to do an fallocate() equivalent at the start of mlock ?
> This would be a no-op when the file is fully allocated on disk, and would
> allow mlock to return an error if the file can't get fully allocated
> (no idea what errno should be for such case, though).
Dirtying all that memory at mlock() time is pretty obnoxious.
I'm inclined to agree that your patch implements the desirable
behaviour: don't dirty the page, don't do block allocation. Take a
fault at first-dirtying and do it then. This does degrade mlock a bit:
the user will find that the first touch of an mlocked page can cause
synchronous physical I/O, which isn't mlocky behaviour *at all*. But
we have to be able to do this anyway - whenever the kupdate function
writes back the dirty pages it has to mark them read-only again so the
kernel knows when they get redirtied.
I do agree that this will result in worse file layout for some
reasonable userspace code patterns. But it was always that way, except
for the eleven-release window where we kinda accidentally fixed that up
in-kernel. Hopefully most apps which care are already ensuring that
the file is well laid-out.
So all that leaves me thinking that we merge your patches as-is. Then
work out why users can fairly trivially use mlock to hang the kernel on
ext2 and ext3 (and others?)
On Fri, Nov 19, 2010 at 02:54:42PM -0800, Andrew Morton wrote:
>
> Dirtying all that memory at mlock() time is pretty obnoxious.
> ...
> So all that leaves me thinking that we merge your patches as-is. Then
> work out why users can fairly trivially use mlock to hang the kernel on
> ext2 and ext3 (and others?)
So at least on RHEL 4 and 5 systems, pam_limits was configured so that
unprivileged processes could only mlock() at most 16k. This was
deemed enough so that programs could protect crypto keys. The
thinking when we added the mlock() ulimit setting was that
unprivileged users could very easily make a nuisance of themselves,
and grab way too much system resources, by using mlock() in obnoxious
ways.
I was just checking to see if my memory was correct, and to my
surprise, I've just found that Ubuntu deliberately sets the memlock
ulimit to be unlimited. Which means that Ubuntu systems are
completely wide open for this particular DOS attack. So if you
administer an Ubuntu-based server, it might be a good idea to make a
tiny little change to /etc/security/limits.conf....
- Ted
On Fri, Nov 19, 2010 at 2:54 PM, Andrew Morton
<[email protected]> wrote:
> On Thu, 18 Nov 2010 23:23:16 -0800
> Michel Lespinasse <[email protected]> wrote:
>
>> On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote:
>> > On Thu, 18 Nov 2010, Christoph Hellwig wrote:
>> > > I think it would help if we could drink a bit of the test driven design
>> > > coolaid here. Michel, can you write some testcases where pages on a
>> > > shared mapping are mlocked, then dirtied and then munlocked, and then
>> > > written out using msync/fsync. ?Anything that fails this test on
>> > > btrfs/ext4/gfs/xfs/etc obviously doesn't work.
>> > Whilst it's hard to argue against a request for testing, Dave's worries
>> > just sprang from a misunderstanding of all the talk about "avoiding ->
>> > page_mkwrite". ?There's nothing strange or risky about Michel's patch,
>> > it does not avoid ->page_mkwrite when there is a write: it just stops
>> > pretending that there was a write when locking down the shared area.
>>
>> So, I decided to test this using memtoy.
>
> Wait. ?You *tested* the kernel?
>
> I dunno, kids these days...
Not guilty - I mean, Christoph made me do it !
> Dirtying all that memory at mlock() time is pretty obnoxious.
>
> I'm inclined to agree that your patch implements the desirable
> behaviour: don't dirty the page, don't do block allocation. ?Take a
> fault at first-dirtying and do it then. ?This does degrade mlock a bit:
> the user will find that the first touch of an mlocked page can cause
> synchronous physical I/O, which isn't mlocky behaviour *at all*. ?But
> we have to be able to do this anyway - whenever the kupdate function
> writes back the dirty pages it has to mark them read-only again so the
> kernel knows when they get redirtied.
Glad to see that we seem to be coming to an agreement here.
> So all that leaves me thinking that we merge your patches as-is. ?Then
> work out why users can fairly trivially use mlock to hang the kernel on
> ext2 and ext3 (and others?)
I would say the hang is not even mlock related - you see without it
also. All you need is mmap a large file with holes and write fault
pages until you run out of disk space. At that point additional write
faults wait for a writeback that can never complete. Sysadmin can
however kill -9 such processes and/or free some space, though.
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
On Thu, Nov 18, 2010 at 11:23:16PM -0800, Michel Lespinasse wrote:
> On Thu, Nov 18, 2010 at 09:41:22AM -0800, Hugh Dickins wrote:
> > On Thu, 18 Nov 2010, Christoph Hellwig wrote:
> > > On Thu, Nov 18, 2010 at 10:11:43AM +1100, Dave Chinner wrote:
> > > > Hence I think that avoiding ->page_mkwrite callouts is likely to
> > > > break some filesystems in subtle, undetected ways. IMO, regardless
> > > > of what is done, it would be really good to start by writing a new
> > > > regression test to exercise and encode the expected the mlock
> > > > behaviour so we can detect regressions later on....
> > >
> > > I think it would help if we could drink a bit of the test driven design
> > > coolaid here. Michel, can you write some testcases where pages on a
> > > shared mapping are mlocked, then dirtied and then munlocked, and then
> > > written out using msync/fsync. Anything that fails this test on
> > > btrfs/ext4/gfs/xfs/etc obviously doesn't work.
>
> I think it's still under debate what's an acceptable result for this test
> (i.e. what's supposed to happen during mlock of a shared mapping of
> a sparsely allocated file - is a fallocate equivalent supposed to happen ?)
> But I agree discussing based on test results will make things more concrete.
>
> > Whilst it's hard to argue against a request for testing, Dave's worries
> > just sprang from a misunderstanding of all the talk about "avoiding ->
> > page_mkwrite". There's nothing strange or risky about Michel's patch,
> > it does not avoid ->page_mkwrite when there is a write: it just stops
> > pretending that there was a write when locking down the shared area.
>
> So, I decided to test this using memtoy. /data is a separate partition
> where I had just 10GB free space, and /data/hole20G was created using
> dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0.
>
> memtoy>file /data/hole20G shared
> memtoy>map hole20G
>
> At this point the file is mapped using a writable, shared VMA.
>
> memtoy>touch hole20G
> memtoy: touched 5242880 pages in 30.595 secs
>
> At this point memtoy's address space is populated with zeroed
> pages. The pages are distinct (meminfo does show 20G of allocated pages),
> are classified as file pages, not anon, and are associated with the
> struct address_space for /data/hole20G. That file still does not have
> blocks allocated, as can be seen with du /data/hole20G.
>
> memtoy>lock hole20G
>
> memtoy tries to mlock the hole20G VMA.
> This is where things get interesting.
>
> Using ext2, without my patch (ext3 should be similar):
> - first, mlock does fast progress going though file pages, marking them
> as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, mlock does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - then, mlock does no progress as it's at the dirty limit and nothing
> gets written back.
> - mlock never terminates.
That's the typical result of ENOSPC during mmap writes into holes on
filesystems that don't implement ->page_mkwrite() - this effectively
overcommits the filesystem free space with more dirty pages than can
fit in the free space, and then the system will get stuck on ENOSPC
errors when trying to allocate in the writeback path, leaving you
with dirty pages that can't be cleaned...
> Using ext2, with my patch (ext3 should be similar):
....
> - i.e. this is essentially the same lockup as without my patch, except that
> it occurs when the application tries to dirty the shared file rather than
> during mlock itself.
Same thing - your patch just moves the trigger.
> Using ext4, without my patch:
> - first, mlock does fast progress going though file pages, marking them
> as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, mlock does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - then, mlock returns an error.
Yup, ext4 implements page_mkwrite(), so can detect ENOSPC during the
write page fault.
> Using ext4, with my patch:
> - mlock goes through all pages in ~5 seconds, marking them as mlocked
> (but still not dirty)
> - mlock completes. /data/hole20G still does not have blocks allocated.
> - if memtoy is then instructed to dirty all the pages
> (using 'touch hole20G write'):
> - first, memtoy does fast progress faulting through file pages, marking
> them as dirty. Eventually, it hits the dirty limit and gets throttled.
> - then, memtoy does slow progress as it needs to wait for writeback.
> writeback occurs and allocates blocks for the /data/hole20G.
> Eventually, the /data partition gets full.
> - at that point, memtoy dies of SIGBUS.
> - i.e. for filesystems that define the page_mkwrite callback, the mlock
> behavior when running out of space writing to sparse files is clearly
> nicer without my patch than with it.
Which is how all filesystems _should_ behave. Any filesystem that does
not implement ->page_mkwrite will break under this test, regardless
of your patch. It is exercising the exact problem that page_mkwrite
was introduced to solve....
> Not 100% sure what to make of these results.
>
> Approaching the problem the other way - would there be any objection to
> adding code to do an fallocate() equivalent at the start of mlock ?
IMO, you do not need fallocate or any form of preallocation at all
during mlock(). As I've already pointed out, this still requires
filesystems to implement ->page_mkwrite to ensure mmap writes into
preallocated regions work correctly.
Fundamentally, filesystems need to implement ->page-mkwrite() to do
the right thing w.r.t. mmap writes and ENOSPC, and any filesystem
that doesn't is plain broken. You don't need to try to fix this
problem by making mlock() jump through hoops - it'll be fixed by
people implementing a working ->page_mkwrite function for their
filesystem.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, 2010-11-19 at 18:22 -0500, Ted Ts'o wrote:
> On Fri, Nov 19, 2010 at 02:54:42PM -0800, Andrew Morton wrote:
> >
> > Dirtying all that memory at mlock() time is pretty obnoxious.
> > ...
> > So all that leaves me thinking that we merge your patches as-is. Then
> > work out why users can fairly trivially use mlock to hang the kernel on
> > ext2 and ext3 (and others?)
>
> So at least on RHEL 4 and 5 systems, pam_limits was configured so that
> unprivileged processes could only mlock() at most 16k. This was
> deemed enough so that programs could protect crypto keys. The
> thinking when we added the mlock() ulimit setting was that
> unprivileged users could very easily make a nuisance of themselves,
> and grab way too much system resources, by using mlock() in obnoxious
> ways.
>
> I was just checking to see if my memory was correct, and to my
> surprise, I've just found that Ubuntu deliberately sets the memlock
> ulimit to be unlimited. Which means that Ubuntu systems are
> completely wide open for this particular DOS attack. So if you
> administer an Ubuntu-based server, it might be a good idea to make a
> tiny little change to /etc/security/limits.conf....
>
> - Ted
Kees,
Copying you into this thread, in case you'd like to respond from the
Ubuntu side. Thanks for the heads-up, Ted.
--
:-Dustin
Dustin Kirkland
Canonical, LTD
[email protected]
GPG: 1024D/83A61194