2013-07-29 05:28:26

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 0/9] mm, hugetlb: clean-up and possible bug fix

First 5 patches are almost trivial clean-up patches.

The others are for fixing three bugs.
Perhaps, these problems are minor, because this codes are used
for a long time, and there is no bug reporting for these problems.

These patches are based on v3.10.0 and
passed the libhugetlbfs test suite.

Changes from v2.
There is no code change in all patches from v2.
Omit patch 2('clean-up alloc_huge_page()') which try to remove one
goto label.
Add more comments to changelog as Michal's opinion.
Add reviewed-by or acked-by.

Changes from v1.
Split patch 1 into two patches to clear it's purpose.
Remove useless indentation changes in 'clean-up alloc_huge_page()'
Fix new iteration code bug.
Add reviewed-by or acked-by.

Joonsoo Kim (9):
mm, hugetlb: move up the code which check availability of free huge
page
mm, hugetlb: trivial commenting fix
mm, hugetlb: clean-up alloc_huge_page()
mm, hugetlb: fix and clean-up node iteration code to alloc or free
mm, hugetlb: remove redundant list_empty check in
gather_surplus_pages()
mm, hugetlb: do not use a page in page cache for cow optimization
mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()
mm, hugetlb: remove decrement_hugepage_resv_vma()
mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

mm/hugetlb.c | 243 ++++++++++++++++++++++++++--------------------------------
1 file changed, 110 insertions(+), 133 deletions(-)

--
1.7.9.5


2013-07-29 05:28:30

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 9/9] mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

If a vma with VM_NORESERVE allocate a new page for page cache, we should
check whether this area is reserved or not. If this address is
already reserved by other process(in case of chg == 0), we should
decrement reserve count, because this allocated page will go into page
cache and currently, there is no way to know that this page comes from
reserved pool or not when releasing inode. This may introduce
over-counting problem to reserved count. With following example code,
you can easily reproduce this situation.

Assume 2MB, nr_hugepages = 100

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}

flag = MAP_SHARED | MAP_NORESERVE;
q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (q == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
q[0] = 'c';

After finish the program, run 'cat /proc/meminfo'.
You can see below result.

HugePages_Free: 100
HugePages_Rsvd: 1

To fix this, we should check our mapping type and tracked region.
If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply
that current allocated page will go into page cache which is already
reserved region when mapping is created. In this case, we should decrease
reserve count. As implementing above, this patch solve the problem.

Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4b1b043..b3b8252 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -443,10 +443,23 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
}

/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma)
+static int vma_has_reserves(struct vm_area_struct *vma, long chg)
{
- if (vma->vm_flags & VM_NORESERVE)
- return 0;
+ if (vma->vm_flags & VM_NORESERVE) {
+ /*
+ * This address is already reserved by other process(chg == 0),
+ * so, we should decreament reserved count. Without
+ * decreamenting, reserve count is remained after releasing
+ * inode, because this allocated page will go into page cache
+ * and is regarded as coming from reserved pool in releasing
+ * step. Currently, we don't have any other solution to deal
+ * with this situation properly, so add work-around here.
+ */
+ if (vma->vm_flags & VM_MAYSHARE && chg == 0)
+ return 1;
+ else
+ return 0;
+ }

/* Shared mappings always use reserves */
if (vma->vm_flags & VM_MAYSHARE)
@@ -520,7 +533,8 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)

static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
- unsigned long address, int avoid_reserve)
+ unsigned long address, int avoid_reserve,
+ long chg)
{
struct page *page = NULL;
struct mempolicy *mpol;
@@ -535,7 +549,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
* have no page reserves. This check ensures that reservations are
* not "stolen". The child may still get SIGKILLed
*/
- if (!vma_has_reserves(vma) &&
+ if (!vma_has_reserves(vma, chg) &&
h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

@@ -553,8 +567,12 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
- if (!avoid_reserve && vma_has_reserves(vma))
- h->resv_huge_pages--;
+ if (avoid_reserve)
+ break;
+ if (!vma_has_reserves(vma, chg))
+ break;
+
+ h->resv_huge_pages--;
break;
}
}
@@ -1138,7 +1156,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
+ page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
--
1.7.9.5

2013-07-29 05:28:27

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 1/9] mm, hugetlb: move up the code which check availability of free huge page

In this time we are holding a hugetlb_lock, so hstate values can't
be changed. If we don't have any usable free huge page in this time,
we don't need to proceede the processing. So move this code up.

Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e2bfbf7..fc4988c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -539,10 +539,6 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
struct zoneref *z;
unsigned int cpuset_mems_cookie;

-retry_cpuset:
- cpuset_mems_cookie = get_mems_allowed();
- zonelist = huge_zonelist(vma, address,
- htlb_alloc_mask, &mpol, &nodemask);
/*
* A child process with MAP_PRIVATE mappings created by their parent
* have no page reserves. This check ensures that reservations are
@@ -556,6 +552,11 @@ retry_cpuset:
if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

+retry_cpuset:
+ cpuset_mems_cookie = get_mems_allowed();
+ zonelist = huge_zonelist(vma, address,
+ htlb_alloc_mask, &mpol, &nodemask);
+
for_each_zone_zonelist_nodemask(zone, z, zonelist,
MAX_NR_ZONES - 1, nodemask) {
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
@@ -574,7 +575,6 @@ retry_cpuset:
return page;

err:
- mpol_cond_put(mpol);
return NULL;
}

--
1.7.9.5

2013-07-29 05:29:00

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 8/9] mm, hugetlb: remove decrement_hugepage_resv_vma()

Now, Checking condition of decrement_hugepage_resv_vma() and
vma_has_reserves() is same, so we can clean-up this function with
vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
one call site, so we can remove function and embed it into
dequeue_huge_page_vma() directly. This patch implement it.

Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ca15854..4b1b043 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -434,25 +434,6 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
return (get_vma_private_data(vma) & flag) != 0;
}

-/* Decrement the reserved pages in the hugepage pool by one */
-static void decrement_hugepage_resv_vma(struct hstate *h,
- struct vm_area_struct *vma)
-{
- if (vma->vm_flags & VM_NORESERVE)
- return;
-
- if (vma->vm_flags & VM_MAYSHARE) {
- /* Shared mappings always use reserves */
- h->resv_huge_pages--;
- } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- /*
- * Only the process that called mmap() has reserves for
- * private mappings.
- */
- h->resv_huge_pages--;
- }
-}
-
/* Reset counters to 0 and clear all HPAGE_RESV_* flags */
void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
{
@@ -466,10 +447,18 @@ static int vma_has_reserves(struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_NORESERVE)
return 0;
+
+ /* Shared mappings always use reserves */
if (vma->vm_flags & VM_MAYSHARE)
return 1;
+
+ /*
+ * Only the process that called mmap() has reserves for
+ * private mappings.
+ */
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return 1;
+
return 0;
}

@@ -564,8 +553,8 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
- if (!avoid_reserve)
- decrement_hugepage_resv_vma(h, vma);
+ if (!avoid_reserve && vma_has_reserves(vma))
+ h->resv_huge_pages--;
break;
}
}
--
1.7.9.5

2013-07-29 05:29:20

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 7/9] mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()

If we map the region with MAP_NORESERVE and MAP_SHARED,
we can skip to check reserve counting and eventually we cannot be ensured
to allocate a huge page in fault time.
With following example code, you can easily find this situation.

Assume 2MB, nr_hugepages = 100

fd = hugetlbfs_unlinked_fd();
if (fd < 0)
return 1;

size = 200 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}

size = 2 * MB;
flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
p[0] = '0';
sleep(10);

During executing sleep(10), run 'cat /proc/meminfo' on another process.

HugePages_Free: 99
HugePages_Rsvd: 100

Number of free should be higher or equal than number of reserve,
but this aren't. This represent that non reserved shared mapping steal
a reserved page. Non reserved shared mapping should not eat into
reserve space.

If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
that we don't have reserved pages, then we check that we have enough
free pages in dequeue_huge_page_vma(). This prevent to steal
a reserved page.

With this change, above test generate a SIGBUG which is correct,
because all free pages are reserved and non reserved shared mapping
can't get a free page.

Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f6b3a6..ca15854 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -464,6 +464,8 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
/* Returns true if the VMA has associated reserve pages */
static int vma_has_reserves(struct vm_area_struct *vma)
{
+ if (vma->vm_flags & VM_NORESERVE)
+ return 0;
if (vma->vm_flags & VM_MAYSHARE)
return 1;
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
--
1.7.9.5

2013-07-29 05:31:22

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

Currently, we use a page with mapped count 1 in page cache for cow
optimization. If we find this condition, we don't allocate a new
page and copy contents. Instead, we map this page directly.
This may introduce a problem that writting to private mapping overwrite
hugetlb file directly. You can find this situation with following code.

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
p[0] = 's';
fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
munmap(p, size);

flag = MAP_PRIVATE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
p[0] = 'c';
munmap(p, size);

flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
munmap(p, size);

We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL
PRIVATE WRITE: s". If we turn off this optimization to a page
in page cache, the problem is disappeared.

So, I change the trigger condition of optimization. If this page is not
AnonPage, we don't do optimization. This makes this optimization turning
off for a page cache.

Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e52afea..1f6b3a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2511,7 +2511,6 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
- int avoidcopy;
int outside_reserve = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
@@ -2521,10 +2520,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
retry_avoidcopy:
/* If no-one else is actually using this page, avoid the copy
* and just make the page writable */
- avoidcopy = (page_mapcount(old_page) == 1);
- if (avoidcopy) {
- if (PageAnon(old_page))
- page_move_anon_rmap(old_page, vma, address);
+ if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
+ page_move_anon_rmap(old_page, vma, address);
set_huge_ptep_writable(vma, address, ptep);
return 0;
}
--
1.7.9.5

2013-07-29 05:31:40

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 4/9] mm, hugetlb: fix and clean-up node iteration code to alloc or free

Current node iteration code have a minor problem which do one more
node rotation if we can't succeed to allocate. For example,
if we start to allocate at node 0, we stop to iterate at node 0.
Then we start to allocate at node 1 for next allocation.

I introduce new macros "for_each_node_mask_to_[alloc|free]" and
fix and clean-up node iteration code to alloc or free.
This makes code more understandable.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31d78c5..87d7637 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -755,33 +755,6 @@ static int hstate_next_node_to_alloc(struct hstate *h,
return nid;
}

-static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
-{
- struct page *page;
- int start_nid;
- int next_nid;
- int ret = 0;
-
- start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
- next_nid = start_nid;
-
- do {
- page = alloc_fresh_huge_page_node(h, next_nid);
- if (page) {
- ret = 1;
- break;
- }
- next_nid = hstate_next_node_to_alloc(h, nodes_allowed);
- } while (next_nid != start_nid);
-
- if (ret)
- count_vm_event(HTLB_BUDDY_PGALLOC);
- else
- count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
-
- return ret;
-}
-
/*
* helper for free_pool_huge_page() - return the previously saved
* node ["this node"] from which to free a huge page. Advance the
@@ -800,6 +773,40 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
return nid;
}

+#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
+ for (nr_nodes = nodes_weight(*mask); \
+ nr_nodes > 0 && \
+ ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
+ nr_nodes--)
+
+#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
+ for (nr_nodes = nodes_weight(*mask); \
+ nr_nodes > 0 && \
+ ((node = hstate_next_node_to_free(hs, mask)) || 1); \
+ nr_nodes--)
+
+static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+{
+ struct page *page;
+ int nr_nodes, node;
+ int ret = 0;
+
+ for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ page = alloc_fresh_huge_page_node(h, node);
+ if (page) {
+ ret = 1;
+ break;
+ }
+ }
+
+ if (ret)
+ count_vm_event(HTLB_BUDDY_PGALLOC);
+ else
+ count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+
+ return ret;
+}
+
/*
* Free huge page from pool from next node to free.
* Attempt to keep persistent huge pages more or less
@@ -809,36 +816,31 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
bool acct_surplus)
{
- int start_nid;
- int next_nid;
+ int nr_nodes, node;
int ret = 0;

- start_nid = hstate_next_node_to_free(h, nodes_allowed);
- next_nid = start_nid;
-
- do {
+ for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
* If we're returning unused surplus pages, only examine
* nodes with surplus pages.
*/
- if ((!acct_surplus || h->surplus_huge_pages_node[next_nid]) &&
- !list_empty(&h->hugepage_freelists[next_nid])) {
+ if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
+ !list_empty(&h->hugepage_freelists[node])) {
struct page *page =
- list_entry(h->hugepage_freelists[next_nid].next,
+ list_entry(h->hugepage_freelists[node].next,
struct page, lru);
list_del(&page->lru);
h->free_huge_pages--;
- h->free_huge_pages_node[next_nid]--;
+ h->free_huge_pages_node[node]--;
if (acct_surplus) {
h->surplus_huge_pages--;
- h->surplus_huge_pages_node[next_nid]--;
+ h->surplus_huge_pages_node[node]--;
}
update_and_free_page(h, page);
ret = 1;
break;
}
- next_nid = hstate_next_node_to_free(h, nodes_allowed);
- } while (next_nid != start_nid);
+ }

return ret;
}
@@ -1175,14 +1177,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
int __weak alloc_bootmem_huge_page(struct hstate *h)
{
struct huge_bootmem_page *m;
- int nr_nodes = nodes_weight(node_states[N_MEMORY]);
+ int nr_nodes, node;

- while (nr_nodes) {
+ for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
void *addr;

- addr = __alloc_bootmem_node_nopanic(
- NODE_DATA(hstate_next_node_to_alloc(h,
- &node_states[N_MEMORY])),
+ addr = __alloc_bootmem_node_nopanic(NODE_DATA(node),
huge_page_size(h), huge_page_size(h), 0);

if (addr) {
@@ -1194,7 +1194,6 @@ int __weak alloc_bootmem_huge_page(struct hstate *h)
m = addr;
goto found;
}
- nr_nodes--;
}
return 0;

@@ -1333,48 +1332,28 @@ static inline void try_to_free_low(struct hstate *h, unsigned long count,
static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
int delta)
{
- int start_nid, next_nid;
- int ret = 0;
+ int nr_nodes, node;

VM_BUG_ON(delta != -1 && delta != 1);

- if (delta < 0)
- start_nid = hstate_next_node_to_alloc(h, nodes_allowed);
- else
- start_nid = hstate_next_node_to_free(h, nodes_allowed);
- next_nid = start_nid;
-
- do {
- int nid = next_nid;
- if (delta < 0) {
- /*
- * To shrink on this node, there must be a surplus page
- */
- if (!h->surplus_huge_pages_node[nid]) {
- next_nid = hstate_next_node_to_alloc(h,
- nodes_allowed);
- continue;
- }
+ if (delta < 0) {
+ for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ if (h->surplus_huge_pages_node[node])
+ goto found;
}
- if (delta > 0) {
- /*
- * Surplus cannot exceed the total number of pages
- */
- if (h->surplus_huge_pages_node[nid] >=
- h->nr_huge_pages_node[nid]) {
- next_nid = hstate_next_node_to_free(h,
- nodes_allowed);
- continue;
- }
+ } else {
+ for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+ if (h->surplus_huge_pages_node[node] <
+ h->nr_huge_pages_node[node])
+ goto found;
}
+ }
+ return 0;

- h->surplus_huge_pages += delta;
- h->surplus_huge_pages_node[nid] += delta;
- ret = 1;
- break;
- } while (next_nid != start_nid);
-
- return ret;
+found:
+ h->surplus_huge_pages += delta;
+ h->surplus_huge_pages_node[node] += delta;
+ return 1;
}

#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
--
1.7.9.5

2013-07-29 05:31:38

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 5/9] mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()

If list is empty, list_for_each_entry_safe() doesn't do anything.
So, this check is redundant. Remove it.

Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87d7637..2e52afea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1020,11 +1020,8 @@ free:
spin_unlock(&hugetlb_lock);

/* Free unnecessary surplus pages to the buddy allocator */
- if (!list_empty(&surplus_list)) {
- list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- put_page(page);
- }
- }
+ list_for_each_entry_safe(page, tmp, &surplus_list, lru)
+ put_page(page);
spin_lock(&hugetlb_lock);

return ret;
--
1.7.9.5

2013-07-29 05:32:17

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 3/9] mm, hugetlb: clean-up alloc_huge_page()

This patch unifies successful allocation paths to make the code more
readable. There are no functional changes.

Acked-by: Michal Hocko <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51564a8..31d78c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1149,12 +1149,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
}
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
- if (page) {
- /* update page cgroup details */
- hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
- h_cg, page);
- spin_unlock(&hugetlb_lock);
- } else {
+ if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
@@ -1165,11 +1160,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
- h_cg, page);
list_move(&page->lru, &h->hugepage_activelist);
- spin_unlock(&hugetlb_lock);
+ /* Fall through */
}
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+ spin_unlock(&hugetlb_lock);

set_page_private(page, (unsigned long)spool);

--
1.7.9.5

2013-07-29 05:37:31

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v3 2/9] mm, hugetlb: trivial commenting fix

The name of the mutex written in comment is wrong.
Fix it.

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fc4988c..51564a8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -135,9 +135,9 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
* across the pages in a mapping.
*
* The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex. To access or modify a region the caller
+ * and the hugetlb_instantiation_mutex. To access or modify a region the caller
* must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
+ * the hugetlb_instantiation_mutex:
*
* down_write(&mm->mmap_sem);
* or
--
1.7.9.5

2013-07-29 09:17:53

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm, hugetlb: move up the code which check availability of free huge page

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> In this time we are holding a hugetlb_lock, so hstate values can't
> be changed. If we don't have any usable free huge page in this time,
> we don't need to proceede the processing. So move this code up.
>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e2bfbf7..fc4988c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -539,10 +539,6 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct zoneref *z;
> unsigned int cpuset_mems_cookie;
>
> -retry_cpuset:
> - cpuset_mems_cookie = get_mems_allowed();
> - zonelist = huge_zonelist(vma, address,
> - htlb_alloc_mask, &mpol, &nodemask);
> /*
> * A child process with MAP_PRIVATE mappings created by their parent
> * have no page reserves. This check ensures that reservations are
> @@ -556,6 +552,11 @@ retry_cpuset:
> if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
> goto err;
>
> +retry_cpuset:
> + cpuset_mems_cookie = get_mems_allowed();
> + zonelist = huge_zonelist(vma, address,
> + htlb_alloc_mask, &mpol, &nodemask);
> +
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> MAX_NR_ZONES - 1, nodemask) {
> if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> @@ -574,7 +575,6 @@ retry_cpuset:
> return page;
>
> err:
> - mpol_cond_put(mpol);
> return NULL;
> }
>
> --
> 1.7.9.5
>

2013-07-29 09:20:41

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 3/9] mm, hugetlb: clean-up alloc_huge_page()

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> This patch unifies successful allocation paths to make the code more
> readable. There are no functional changes.
>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 51564a8..31d78c5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1149,12 +1149,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> }
> spin_lock(&hugetlb_lock);
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> - if (page) {
> - /* update page cgroup details */
> - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
> - h_cg, page);
> - spin_unlock(&hugetlb_lock);
> - } else {
> + if (!page) {
> spin_unlock(&hugetlb_lock);
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> if (!page) {
> @@ -1165,11 +1160,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h),
> - h_cg, page);
> list_move(&page->lru, &h->hugepage_activelist);
> - spin_unlock(&hugetlb_lock);
> + /* Fall through */
> }
> + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
> + spin_unlock(&hugetlb_lock);
>
> set_page_private(page, (unsigned long)spool);
>
> --
> 1.7.9.5
>

2013-07-29 09:30:20

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 5/9] mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> If list is empty, list_for_each_entry_safe() doesn't do anything.
> So, this check is redundant. Remove it.
>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 87d7637..2e52afea 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1020,11 +1020,8 @@ free:
> spin_unlock(&hugetlb_lock);
>
> /* Free unnecessary surplus pages to the buddy allocator */
> - if (!list_empty(&surplus_list)) {
> - list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> - put_page(page);
> - }
> - }
> + list_for_each_entry_safe(page, tmp, &surplus_list, lru)
> + put_page(page);
> spin_lock(&hugetlb_lock);
>
> return ret;
> --
> 1.7.9.5
>

2013-07-29 09:33:30

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> Currently, we use a page with mapped count 1 in page cache for cow
> optimization. If we find this condition, we don't allocate a new
> page and copy contents. Instead, we map this page directly.
> This may introduce a problem that writting to private mapping overwrite
> hugetlb file directly. You can find this situation with following code.
>
> size = 20 * MB;
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
> p[0] = 's';
> fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
> munmap(p, size);
>
> flag = MAP_PRIVATE;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> }
> p[0] = 'c';
> munmap(p, size);
>
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
> fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
> munmap(p, size);
>
> We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL
> PRIVATE WRITE: s". If we turn off this optimization to a page
> in page cache, the problem is disappeared.
>
> So, I change the trigger condition of optimization. If this page is not
> AnonPage, we don't do optimization. This makes this optimization turning
> off for a page cache.
>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2e52afea..1f6b3a6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2511,7 +2511,6 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> struct hstate *h = hstate_vma(vma);
> struct page *old_page, *new_page;
> - int avoidcopy;
> int outside_reserve = 0;
> unsigned long mmun_start; /* For mmu_notifiers */
> unsigned long mmun_end; /* For mmu_notifiers */
> @@ -2521,10 +2520,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> retry_avoidcopy:
> /* If no-one else is actually using this page, avoid the copy
> * and just make the page writable */
> - avoidcopy = (page_mapcount(old_page) == 1);
> - if (avoidcopy) {
> - if (PageAnon(old_page))
> - page_move_anon_rmap(old_page, vma, address);
> + if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
> + page_move_anon_rmap(old_page, vma, address);
> set_huge_ptep_writable(vma, address, ptep);
> return 0;
> }
> --
> 1.7.9.5
>

2013-07-29 09:34:55

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 7/9] mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> If we map the region with MAP_NORESERVE and MAP_SHARED,
> we can skip to check reserve counting and eventually we cannot be ensured
> to allocate a huge page in fault time.
> With following example code, you can easily find this situation.
>
> Assume 2MB, nr_hugepages = 100
>
> fd = hugetlbfs_unlinked_fd();
> if (fd < 0)
> return 1;
>
> size = 200 * MB;
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
>
> size = 2 * MB;
> flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> }
> p[0] = '0';
> sleep(10);
>
> During executing sleep(10), run 'cat /proc/meminfo' on another process.
>
> HugePages_Free: 99
> HugePages_Rsvd: 100
>
> Number of free should be higher or equal than number of reserve,
> but this aren't. This represent that non reserved shared mapping steal
> a reserved page. Non reserved shared mapping should not eat into
> reserve space.
>
> If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
> that we don't have reserved pages, then we check that we have enough
> free pages in dequeue_huge_page_vma(). This prevent to steal
> a reserved page.
>
> With this change, above test generate a SIGBUG which is correct,
> because all free pages are reserved and non reserved shared mapping
> can't get a free page.
>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f6b3a6..ca15854 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -464,6 +464,8 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> /* Returns true if the VMA has associated reserve pages */
> static int vma_has_reserves(struct vm_area_struct *vma)
> {
> + if (vma->vm_flags & VM_NORESERVE)
> + return 0;
> if (vma->vm_flags & VM_MAYSHARE)
> return 1;
> if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> --
> 1.7.9.5
>

2013-07-29 09:39:08

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] mm, hugetlb: remove decrement_hugepage_resv_vma()

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> Now, Checking condition of decrement_hugepage_resv_vma() and
> vma_has_reserves() is same, so we can clean-up this function with
> vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
> one call site, so we can remove function and embed it into
> dequeue_huge_page_vma() directly. This patch implement it.
>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ca15854..4b1b043 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -434,25 +434,6 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
> return (get_vma_private_data(vma) & flag) != 0;
> }
>
> -/* Decrement the reserved pages in the hugepage pool by one */
> -static void decrement_hugepage_resv_vma(struct hstate *h,
> - struct vm_area_struct *vma)
> -{
> - if (vma->vm_flags & VM_NORESERVE)
> - return;
> -
> - if (vma->vm_flags & VM_MAYSHARE) {
> - /* Shared mappings always use reserves */
> - h->resv_huge_pages--;
> - } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - /*
> - * Only the process that called mmap() has reserves for
> - * private mappings.
> - */
> - h->resv_huge_pages--;
> - }
> -}
> -
> /* Reset counters to 0 and clear all HPAGE_RESV_* flags */
> void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> {
> @@ -466,10 +447,18 @@ static int vma_has_reserves(struct vm_area_struct *vma)
> {
> if (vma->vm_flags & VM_NORESERVE)
> return 0;
> +
> + /* Shared mappings always use reserves */
> if (vma->vm_flags & VM_MAYSHARE)
> return 1;
> +
> + /*
> + * Only the process that called mmap() has reserves for
> + * private mappings.
> + */
> if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> return 1;
> +
> return 0;
> }
>
> @@ -564,8 +553,8 @@ retry_cpuset:
> if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> page = dequeue_huge_page_node(h, zone_to_nid(zone));
> if (page) {
> - if (!avoid_reserve)
> - decrement_hugepage_resv_vma(h, vma);
> + if (!avoid_reserve && vma_has_reserves(vma))
> + h->resv_huge_pages--;
> break;
> }
> }
> --
> 1.7.9.5
>

2013-07-29 09:43:28

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH v3 9/9] mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache

On Mon, Jul 29, 2013 at 1:28 PM, Joonsoo Kim <[email protected]> wrote:
> If a vma with VM_NORESERVE allocate a new page for page cache, we should
> check whether this area is reserved or not. If this address is
> already reserved by other process(in case of chg == 0), we should
> decrement reserve count, because this allocated page will go into page
> cache and currently, there is no way to know that this page comes from
> reserved pool or not when releasing inode. This may introduce
> over-counting problem to reserved count. With following example code,
> you can easily reproduce this situation.
>
> Assume 2MB, nr_hugepages = 100
>
> size = 20 * MB;
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
>
> flag = MAP_SHARED | MAP_NORESERVE;
> q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (q == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> }
> q[0] = 'c';
>
> After finish the program, run 'cat /proc/meminfo'.
> You can see below result.
>
> HugePages_Free: 100
> HugePages_Rsvd: 1
>
> To fix this, we should check our mapping type and tracked region.
> If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply
> that current allocated page will go into page cache which is already
> reserved region when mapping is created. In this case, we should decrease
> reserve count. As implementing above, this patch solve the problem.
>
> Reviewed-by: Wanpeng Li <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
Acked-by: Hillf Danton <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4b1b043..b3b8252 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -443,10 +443,23 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> }
>
> /* Returns true if the VMA has associated reserve pages */
> -static int vma_has_reserves(struct vm_area_struct *vma)
> +static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> {
> - if (vma->vm_flags & VM_NORESERVE)
> - return 0;
> + if (vma->vm_flags & VM_NORESERVE) {
> + /*
> + * This address is already reserved by other process(chg == 0),
> + * so, we should decreament reserved count. Without
> + * decreamenting, reserve count is remained after releasing
> + * inode, because this allocated page will go into page cache
> + * and is regarded as coming from reserved pool in releasing
> + * step. Currently, we don't have any other solution to deal
> + * with this situation properly, so add work-around here.
> + */
> + if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> + return 1;
> + else
> + return 0;
> + }
>
> /* Shared mappings always use reserves */
> if (vma->vm_flags & VM_MAYSHARE)
> @@ -520,7 +533,8 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>
> static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct vm_area_struct *vma,
> - unsigned long address, int avoid_reserve)
> + unsigned long address, int avoid_reserve,
> + long chg)
> {
> struct page *page = NULL;
> struct mempolicy *mpol;
> @@ -535,7 +549,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> * have no page reserves. This check ensures that reservations are
> * not "stolen". The child may still get SIGKILLed
> */
> - if (!vma_has_reserves(vma) &&
> + if (!vma_has_reserves(vma, chg) &&
> h->free_huge_pages - h->resv_huge_pages == 0)
> goto err;
>
> @@ -553,8 +567,12 @@ retry_cpuset:
> if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
> page = dequeue_huge_page_node(h, zone_to_nid(zone));
> if (page) {
> - if (!avoid_reserve && vma_has_reserves(vma))
> - h->resv_huge_pages--;
> + if (avoid_reserve)
> + break;
> + if (!vma_has_reserves(vma, chg))
> + break;
> +
> + h->resv_huge_pages--;
> break;
> }
> }
> @@ -1138,7 +1156,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> --
> 1.7.9.5
>

2013-07-29 22:43:16

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

On Mon, Jul 29, 2013 at 02:28:18PM +0900, Joonsoo Kim wrote:
> Currently, we use a page with mapped count 1 in page cache for cow
> optimization. If we find this condition, we don't allocate a new
> page and copy contents. Instead, we map this page directly.
> This may introduce a problem that writting to private mapping overwrite
> hugetlb file directly. You can find this situation with following code.
>
> size = 20 * MB;
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
> p[0] = 's';
> fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
> munmap(p, size);
>
> flag = MAP_PRIVATE;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> }
> p[0] = 'c';
> munmap(p, size);
>
> flag = MAP_SHARED;
> p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> if (p == MAP_FAILED) {
> fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> return -1;
> }
> fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
> munmap(p, size);
>
> We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL
> PRIVATE WRITE: s". If we turn off this optimization to a page
> in page cache, the problem is disappeared.

Please add this testcase to libhugetlbfs as well.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.79 kB)
(No filename) (198.00 B)
Download all attachments

2013-07-31 05:41:56

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 6/9] mm, hugetlb: do not use a page in page cache for cow optimization

On Tue, Jul 30, 2013 at 08:37:08AM +1000, David Gibson wrote:
> On Mon, Jul 29, 2013 at 02:28:18PM +0900, Joonsoo Kim wrote:
> > Currently, we use a page with mapped count 1 in page cache for cow
> > optimization. If we find this condition, we don't allocate a new
> > page and copy contents. Instead, we map this page directly.
> > This may introduce a problem that writting to private mapping overwrite
> > hugetlb file directly. You can find this situation with following code.
> >
> > size = 20 * MB;
> > flag = MAP_SHARED;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > return -1;
> > }
> > p[0] = 's';
> > fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
> > munmap(p, size);
> >
> > flag = MAP_PRIVATE;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > }
> > p[0] = 'c';
> > munmap(p, size);
> >
> > flag = MAP_SHARED;
> > p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
> > if (p == MAP_FAILED) {
> > fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
> > return -1;
> > }
> > fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
> > munmap(p, size);
> >
> > We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL
> > PRIVATE WRITE: s". If we turn off this optimization to a page
> > in page cache, the problem is disappeared.
>
> Please add this testcase to libhugetlbfs as well.

Okay!

Thanks.

>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson