2013-09-16 08:48:15

by Fengguang Wu

[permalink] [raw]
Subject: [munlock] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067

Greetings,

I got the below dmesg and the first bad commit is

commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9
Author: Vlastimil Babka <[email protected]>
Date: Wed Sep 11 14:22:35 2013 -0700

mm: munlock: manual pte walk in fast path instead of follow_page_mask()

Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
each individual struct page. This entails repeated full page table
translations and page table lock taken for each page separately.

This patch avoids the costly follow_page_mask() where possible, by
iterating over ptes within single pmd under single page table lock. The
first pte is obtained by get_locked_pte() for non-THP page acquired by the
initial follow_page_mask(). The rest of the on-stack pagevec for munlock
is filled up using pte_walk as long as pte_present() and vm_normal_page()
are sufficient to obtain the struct page.

After this patch, a 14% speedup was measured for munlocking a 56GB large
memory area with THP disabled.

Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Jörn Engel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>


[ 56.020577] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067
[ 56.022578] addr:08800000 vm_flags:00100073 anon_vma:7f5f6f00 mapping: (null) index:8800
[ 56.025276] CPU: 0 PID: 101 Comm: killall5 Not tainted 3.11.0-09272-g666a584 #52

git bisect start 666a584d3a765a914642f80deef7a33fb309df5d v3.11 --
git bisect good a09e9a7a4b907f2dfa9bdb2b98a1828ab4b340b2 # 22:15 1080+ Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
git bisect good 8e73e367f7dc50f1d1bc22a63e5764bb4eea9b48 # 22:43 1080+ Merge tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 64c353864e3f7ccba0ade1bd6f562f9a3bc7e68d # 00:14 1080+ Merge branch 'for-v3.12' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
git bisect good 640414171818c6293c23e74a28d1c69b2a1a7fe5 # 00:23 1080+ Merge tag 'late-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good fa1586a7e43760f0e25e72b2e3f97ee18b2be967 # 01:08 1080+ Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
git bisect good bc4b4448dba660afc8df3790564320302d9709a1 # 01:43 1080+ mm: move pgtable related functions to right place
git bisect bad 325c4ef5c4b17372c3222d896040d7848e67fbdb # 02:08 133- mm/madvise.c:madvise_hwpoison(): remove local `ret'
git bisect good e76b63f80d938a1319eb5fb0ae7ea69bddfbae38 # 02:27 1340+ memblock, numa: binary search node id
git bisect bad 762216ab4e175f49d17bc7ad778c57b9028184e6 # 02:51 708- mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area
git bisect good 586a32ac1d33ce7a7548a27e4087e98842c3a06f # 03:40 3517+ mm: munlock: remove unnecessary call to lru_add_drain()
git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7 # 04:23 3517+ mm: munlock: remove redundant get_page/put_page pair on the fast path
git bisect bad 6e543d5780e36ff5ee56c44d7e2e30db3457a7ed # 04:53 148- mm: vmscan: fix do_try_to_free_pages() livelock
git bisect bad 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 # 05:03 69- mm: munlock: manual pte walk in fast path instead of follow_page_mask()
git bisect good 5b40998ae35cf64561868370e6c9f3d3e94b6bf7 # 09:42 10000+ mm: munlock: remove redundant get_page/put_page pair on the fast path
git bisect bad d5d04bb48f0eb89c14e76779bb46212494de0bec # 10:08 128- Bye, bye, WfW flag
git bisect good 14f83d4c02fa126fd699570429a0bb888e12ddf7 # 16:20 10000+ Revert "mm: munlock: manual pte walk in fast path instead of follow_page_mask()"
git bisect bad d8efd82eece89f8a5790b0febf17522affe9e1f1 # 16:34 45- Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus

Thanks,
Fengguang


Attachments:
(No filename) (4.19 kB)
dmesg-quantal-ant-14:20130912234102:3.11.0-09272-g666a584:52 (48.09 kB)
bisect-666a584d3a765a914642f80deef7a33fb309df5d-i386-randconfig-i002-0912-BUG:-Bad-page-map-in-process-22004.log (28.52 kB)
config-3.11.0-09272-g666a584 (68.20 kB)
Download all attachments

2013-09-16 15:27:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [munlock] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067

On 09/16/2013 10:47 AM, Fengguang Wu wrote:
> Greetings,
>
> I got the below dmesg and the first bad commit is
>
> commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9
> Author: Vlastimil Babka <[email protected]>
> Date: Wed Sep 11 14:22:35 2013 -0700
>
> mm: munlock: manual pte walk in fast path instead of follow_page_mask()
>

>
> [ 56.020577] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067
> [ 56.022578] addr:08800000 vm_flags:00100073 anon_vma:7f5f6f00 mapping: (null) index:8800
> [ 56.025276] CPU: 0 PID: 101 Comm: killall5 Not tainted 3.11.0-09272-g666a584 #52
>

Hello,

the stacktrace points clearly to the code added by the patch (function __munlock_pagevec_fill),
no question about that. However, the addresses that are reported by print_bad_pte() in the logs
(08800000 and 0a000000) are both on the page table boundary (note this is x86_32 without PAE)
and should never appear inside the while loop of the function (and be passed to vm_normal_page()).
This could only happen if pmd_addr_end() failed to prevent crossing the page table boundary and
I just cannot see how that could occur without some variables being corrupted :/

Also, some of the failures during bisect were not due to this bug, but a WARNING for
list_add corruption which hopefully is not related to munlock. While it is probably a far stretch,
some kind of memory corruption could also lead to the erroneous behavior of the munlock code.

Can you therefore please retest with the bisected patch reverted (patch below) to see if the other
WARNING still occurs and can be dealt with separately, so there are not potentially two bugs to
be chased at the same time?

Thanks,
Vlastimil


-----8<-----
>From 979cbdeaaed76e25a9e08c7ccadba5baf5e7c619 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Mon, 16 Sep 2013 17:06:12 +0200
Subject: [PATCH] Revert "mm: munlock: manual pte walk in fast path instead of
follow_page_mask()"

This reverts commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9 for testing.
---
include/linux/mm.h | 12 +++---
mm/mlock.c | 110 +++++++++++++++--------------------------------------
2 files changed, 37 insertions(+), 85 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..e9bab9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -630,12 +630,12 @@ static inline enum zone_type page_zonenum(const struct page *page)
#endif

/*
- * The identification function is mainly used by the buddy allocator for
- * determining if two pages could be buddies. We are not really identifying
- * the zone since we could be using the section number id if we do not have
- * node id available in page flags.
- * We only guarantee that it will return the same value for two combinable
- * pages in a zone.
+ * The identification function is only used by the buddy allocator for
+ * determining if two pages could be buddies. We are not really
+ * identifying a zone since we could be using a the section number
+ * id if we have not node id available in page flags.
+ * We guarantee only that it will return the same value for two
+ * combinable pages in a zone.
*/
static inline int page_zone_id(struct page *page)
{
diff --git a/mm/mlock.c b/mm/mlock.c
index d638026..19a934d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -280,7 +280,8 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
* The second phase finishes the munlock only for pages where isolation
* succeeded.
*
- * Note that the pagevec may be modified during the process.
+ * Note that pvec is modified during the process. Before returning
+ * pagevec_reinit() is called on it.
*/
static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
{
@@ -355,60 +356,8 @@ skip_munlock:
*/
if (pagevec_count(&pvec_putback))
__putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
- struct vm_area_struct *vma, int zoneid, unsigned long start,
- unsigned long end)
-{
- pte_t *pte;
- spinlock_t *ptl;
-
- /*
- * Initialize pte walk starting at the already pinned page where we
- * are sure that there is a pte.
- */
- pte = get_locked_pte(vma->vm_mm, start, &ptl);
- end = min(end, pmd_addr_end(start, end));
-
- /* The page next to the pinned page is the first we will try to get */
- start += PAGE_SIZE;
- while (start < end) {
- struct page *page = NULL;
- pte++;
- if (pte_present(*pte))
- page = vm_normal_page(vma, start, *pte);
- /*
- * Break if page could not be obtained or the page's node+zone does not
- * match
- */
- if (!page || page_zone_id(page) != zoneid)
- break;

- get_page(page);
- /*
- * Increase the address that will be returned *before* the
- * eventual break due to pvec becoming full by adding the page
- */
- start += PAGE_SIZE;
- if (pagevec_add(pvec, page) == 0)
- break;
- }
- pte_unmap_unlock(pte, ptl);
- return start;
+ pagevec_reinit(pvec);
}

/*
@@ -432,16 +381,17 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
+ struct pagevec pvec;
+ struct zone *zone = NULL;
+
+ pagevec_init(&pvec, 0);
vma->vm_flags &= ~VM_LOCKED;

while (start < end) {
- struct page *page = NULL;
+ struct page *page;
unsigned int page_mask, page_increm;
- struct pagevec pvec;
- struct zone *zone;
- int zoneid;
+ struct zone *pagezone;

- pagevec_init(&pvec, 0);
/*
* Although FOLL_DUMP is intended for get_dump_page(),
* it just so happens that its special treatment of the
@@ -450,10 +400,22 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
* has sneaked into the range, we won't oops here: great).
*/
page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
- &page_mask);
-
+ &page_mask);
if (page && !IS_ERR(page)) {
+ pagezone = page_zone(page);
+ /* The whole pagevec must be in the same zone */
+ if (pagezone != zone) {
+ if (pagevec_count(&pvec))
+ __munlock_pagevec(&pvec, zone);
+ zone = pagezone;
+ }
if (PageTransHuge(page)) {
+ /*
+ * THP pages are not handled by pagevec due
+ * to their possible split (see below).
+ */
+ if (pagevec_count(&pvec))
+ __munlock_pagevec(&pvec, zone);
lock_page(page);
/*
* Any THP page found by follow_page_mask() may
@@ -466,31 +428,21 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
put_page(page); /* follow_page_mask() */
} else {
/*
- * Non-huge pages are handled in batches via
- * pagevec. The pin from follow_page_mask()
- * prevents them from collapsing by THP.
- */
- pagevec_add(&pvec, page);
- zone = page_zone(page);
- zoneid = page_zone_id(page);
-
- /*
- * Try to fill the rest of pagevec using fast
- * pte walk. This will also update start to
- * the next page to process. Then munlock the
- * pagevec.
+ * Non-huge pages are handled in batches
+ * via pagevec. The pin from
+ * follow_page_mask() prevents them from
+ * collapsing by THP.
*/
- start = __munlock_pagevec_fill(&pvec, vma,
- zoneid, start, end);
- __munlock_pagevec(&pvec, zone);
- goto next;
+ if (pagevec_add(&pvec, page) == 0)
+ __munlock_pagevec(&pvec, zone);
}
}
page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
start += page_increm * PAGE_SIZE;
-next:
cond_resched();
}
+ if (pagevec_count(&pvec))
+ __munlock_pagevec(&pvec, zone);
}

/*
--
1.8.1.4

2013-09-17 13:29:41

by Fengguang Wu

[permalink] [raw]
Subject: Re: [munlock] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067

Hi Vlastimil,

On Mon, Sep 16, 2013 at 05:27:05PM +0200, Vlastimil Babka wrote:
> On 09/16/2013 10:47 AM, Fengguang Wu wrote:
> > Greetings,
> >
> > I got the below dmesg and the first bad commit is
> >
> > commit 7a8010cd36273ff5f6fea5201ef9232f30cebbd9
> > Author: Vlastimil Babka <[email protected]>
> > Date: Wed Sep 11 14:22:35 2013 -0700
> >
> > mm: munlock: manual pte walk in fast path instead of follow_page_mask()
> >
>
> >
> > [ 56.020577] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067
> > [ 56.022578] addr:08800000 vm_flags:00100073 anon_vma:7f5f6f00 mapping: (null) index:8800
> > [ 56.025276] CPU: 0 PID: 101 Comm: killall5 Not tainted 3.11.0-09272-g666a584 #52
> >
>
> Hello,
>
> the stacktrace points clearly to the code added by the patch (function __munlock_pagevec_fill),
> no question about that. However, the addresses that are reported by print_bad_pte() in the logs
> (08800000 and 0a000000) are both on the page table boundary (note this is x86_32 without PAE)
> and should never appear inside the while loop of the function (and be passed to vm_normal_page()).
> This could only happen if pmd_addr_end() failed to prevent crossing the page table boundary and
> I just cannot see how that could occur without some variables being corrupted :/
>
> Also, some of the failures during bisect were not due to this bug, but a WARNING for
> list_add corruption which hopefully is not related to munlock. While it is probably a far stretch,
> some kind of memory corruption could also lead to the erroneous behavior of the munlock code.
>
> Can you therefore please retest with the bisected patch reverted (patch below) to see if the other
> WARNING still occurs and can be dealt with separately, so there are not potentially two bugs to
> be chased at the same time?

Yes there seems to be one more bug, the attached dmesg is for the
kernel with your patch reverted. I'm trying to bisect the other bug
now.

Thanks,
Fengguang


Attachments:
(No filename) (1.94 kB)
dmesg-yocto-xian-9:20130916101717:3.11.0-rc3-00230-g5bbd533:8 (35.31 kB)
Download all attachments

2013-09-17 13:34:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [munlock] BUG: Bad page map in process killall5 pte:53425553 pmd:075f4067

On 09/17/2013 03:29 PM, Fengguang Wu wrote:
> Hi Vlastimil,
>
>>
>> Also, some of the failures during bisect were not due to this bug, but a WARNING for
>> list_add corruption which hopefully is not related to munlock. While it is probably a far stretch,
>> some kind of memory corruption could also lead to the erroneous behavior of the munlock code.
>>
>> Can you therefore please retest with the bisected patch reverted (patch below) to see if the other
>> WARNING still occurs and can be dealt with separately, so there are not potentially two bugs to
>> be chased at the same time?
>
> Yes there seems to be one more bug, the attached dmesg is for the
> kernel with your patch reverted. I'm trying to bisect the other bug
> now.

Thanks. Meanwhile I was able to reproduce the bug in my patch in a VM
with x86_32 without PAE. As it turns out, pmd_addr_end() on such
configuration without pmd really does not bound the address to page
table boundary, but is a no-op. Working on a fix.

Vlastimil

> Thanks,
> Fengguang
>

2013-09-17 14:22:50

by Vlastimil Babka

[permalink] [raw]
Subject: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

The function __munlock_pagevec_fill() introduced in commit 7a8010cd3
("mm: munlock: manual pte walk in fast path instead of follow_page_mask()")
uses pmd_addr_end() for restricting its operation within current page table.
This is insufficient on architectures/configurations where pmd is folded
and pmd_addr_end() just returns the end of the full range to be walked. In
this case, it allows pte++ to walk off the end of a page table resulting in
unpredictable behaviour.

This patch fixes the function by using pgd_addr_end() and pud_addr_end()
before pmd_addr_end(), which will yield correct page table boundary on all
configurations. This is similar to what existing page walkers do when walking
each level of the page table.

Additionaly, the patch clarifies a comment for get_locked_pte() call in the
function.

Reported-by: Fengguang Wu <[email protected]>
Cc: Jörn Engel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mlock.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index d638026..758c0fc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,

/*
* Initialize pte walk starting at the already pinned page where we
- * are sure that there is a pte.
+ * are sure that there is a pte, as it was pinned under the same
+ * mmap_sem write op.
*/
pte = get_locked_pte(vma->vm_mm, start, &ptl);
- end = min(end, pmd_addr_end(start, end));
+ /* Make sure we do not cross the page table boundary */
+ end = pgd_addr_end(start, end);
+ end = pud_addr_end(start, end);
+ end = pmd_addr_end(start, end);

/* The page next to the pinned page is the first we will try to get */
start += PAGE_SIZE;
--
1.8.1.4

2013-09-18 01:19:10

by Bob Liu

[permalink] [raw]
Subject: Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

On 09/17/2013 10:22 PM, Vlastimil Babka wrote:
> The function __munlock_pagevec_fill() introduced in commit 7a8010cd3
> ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()")
> uses pmd_addr_end() for restricting its operation within current page table.
> This is insufficient on architectures/configurations where pmd is folded
> and pmd_addr_end() just returns the end of the full range to be walked. In
> this case, it allows pte++ to walk off the end of a page table resulting in
> unpredictable behaviour.
>
> This patch fixes the function by using pgd_addr_end() and pud_addr_end()
> before pmd_addr_end(), which will yield correct page table boundary on all
> configurations. This is similar to what existing page walkers do when walking
> each level of the page table.
>
> Additionaly, the patch clarifies a comment for get_locked_pte() call in the
> function.
>
> Reported-by: Fengguang Wu <[email protected]>
> Cc: Jörn Engel <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/mlock.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index d638026..758c0fc 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>
> /*
> * Initialize pte walk starting at the already pinned page where we
> - * are sure that there is a pte.
> + * are sure that there is a pte, as it was pinned under the same
> + * mmap_sem write op.
> */
> pte = get_locked_pte(vma->vm_mm, start, &ptl);
> - end = min(end, pmd_addr_end(start, end));
> + /* Make sure we do not cross the page table boundary */
> + end = pgd_addr_end(start, end);
> + end = pud_addr_end(start, end);
> + end = pmd_addr_end(start, end);
>

Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like:

--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct
pagevec *pvec,
{
pte_t *pte;
spinlock_t *ptl;
+ unsigned long pmd_end = (start + PMD_SIZE) & PMD_MASK;
+ end = (pmd_end - 1 < end - 1) ? pmd_end : end;

/*
* Initialize pte walk starting at the already pinned page where we
* are sure that there is a pte.
*/
pte = get_locked_pte(vma->vm_mm, start, &ptl);
- end = min(end, pmd_addr_end(start, end));

/* The page next to the pinned page is the first we will try to
get */
start += PAGE_SIZE;

Anyway,
Reviewed-by: Bob Liu <[email protected]>

--
Regards,
-Bob

2013-09-20 09:04:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [RFC PATCH RESEND] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

On 09/18/2013 03:17 AM, Bob Liu wrote:
> On 09/17/2013 10:22 PM, Vlastimil Babka wrote:

>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
>>
>> /*
>> * Initialize pte walk starting at the already pinned page where we
>> - * are sure that there is a pte.
>> + * are sure that there is a pte, as it was pinned under the same
>> + * mmap_sem write op.
>> */
>> pte = get_locked_pte(vma->vm_mm, start, &ptl);
>> - end = min(end, pmd_addr_end(start, end));
>> + /* Make sure we do not cross the page table boundary */
>> + end = pgd_addr_end(start, end);
>> + end = pud_addr_end(start, end);
>> + end = pmd_addr_end(start, end);
>>
>
> Nitpick, how about unfolding pmd_addr_end(start, end) directly? Like:
>
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -376,13 +376,14 @@ static unsigned long __munlock_pagevec_fill(struct
> pagevec *pvec,
> {
> pte_t *pte;
> spinlock_t *ptl;
> + unsigned long pmd_end = (start + PMD_SIZE) & PMD_MASK;
> + end = (pmd_end - 1 < end - 1) ? pmd_end : end;
>
> /*
> * Initialize pte walk starting at the already pinned page where we
> * are sure that there is a pte.
> */
> pte = get_locked_pte(vma->vm_mm, start, &ptl);
> - end = min(end, pmd_addr_end(start, end));
>
> /* The page next to the pinned page is the first we will try to
> get */
> start += PAGE_SIZE;
>

That should also work but for maintainability reasons I wouldn't like to
special case it like this, instead of using standard functions as they
would be used in a full pagewalk.

> Anyway,
> Reviewed-by: Bob Liu <[email protected]>

Thanks.

Vlastimil

2013-09-20 09:17:00

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH] mm: munlock: Prevent walking off the end of a pagetable in no-pmd configuration

The function __munlock_pagevec_fill() introduced in commit 7a8010cd3
("mm: munlock: manual pte walk in fast path instead of follow_page_mask()")
uses pmd_addr_end() for restricting its operation within current page table.
This is insufficient on architectures/configurations where pmd is folded
and pmd_addr_end() just returns the end of the full range to be walked. In
this case, it allows pte++ to walk off the end of a page table resulting in
unpredictable behaviour.

This patch fixes the function by using pgd_addr_end() and pud_addr_end()
before pmd_addr_end(), which will yield correct page table boundary on all
configurations. This is similar to what existing page walkers do when walking
each level of the page table.

Additionaly, the patch clarifies a comment for get_locked_pte() call in the
function.

Reported-by: Fengguang Wu <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
Cc: Jörn Engel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mlock.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index d638026..758c0fc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -379,10 +379,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,

/*
* Initialize pte walk starting at the already pinned page where we
- * are sure that there is a pte.
+ * are sure that there is a pte, as it was pinned under the same
+ * mmap_sem write op.
*/
pte = get_locked_pte(vma->vm_mm, start, &ptl);
- end = min(end, pmd_addr_end(start, end));
+ /* Make sure we do not cross the page table boundary */
+ end = pgd_addr_end(start, end);
+ end = pud_addr_end(start, end);
+ end = pmd_addr_end(start, end);

/* The page next to the pinned page is the first we will try to get */
start += PAGE_SIZE;
--
1.8.1.4