Hi,
This patchset implements a cgroup resource controller for HugeTLB
pages. The controller allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since
HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if
it tries to access HugeTLB pages beyond its limit. This requires
the application to know beforehand how much HugeTLB pages it would
require for its use.
The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met,
job schedulers like (SLURM) will schedule the job. We need to make sure
that the jobs won't consume more resources than requested. If they do
we should either error out or kill the application.
Patches are on top of 731a7378b81c2f5fa88ca1ae20b83d548d5613dc
Changes from V6:
* Implement the controller as a seperate HugeTLB cgroup.
* Folded fixup patches in -mm to the original patches
Changes from V5:
* Address review feedback.
Changes from V4:
* Add support for charge/uncharge during page migration
* Drop the usage of page->lru in unmap_hugepage_range.
Changes from v3:
* Address review feedback.
* Fix a bug in cgroup removal related parent charging with use_hierarchy set
Changes from V2:
* Changed the implementation to limit the HugeTLB usage during page
fault time. This simplifies the extension and keep it closer to
memcg design. This also allows to support cgroup removal with less
complexity. Only caveat is the application should ensure its HugeTLB
usage doesn't cross the cgroup limit.
Changes from V1:
* Changed the implementation as a memcg extension. We still use
the same logic to track the cgroup and range.
Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes
-aneesh
From: "Aneesh Kumar K.V" <[email protected]>
i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
fix linked list corruption in unmap_hugepage_range()") but we don't use
page->lru in unmap_hugepage_range any more. Also the lock was taken
higher up in the stack in some code path. That would result in deadlock.
unmap_mapping_range (i_mmap_mutex)
-> unmap_mapping_range_tree
-> unmap_mapping_range_vma
-> zap_page_range_single
-> unmap_single_vma
-> unmap_hugepage_range (i_mmap_mutex)
For shared pagetable support for huge pages, since pagetable pages are ref
counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
(39dde65c9940c97f ("shared page table for hugetlb page")).
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
mm/memory.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 545e18a..f6bc04f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file) {
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ if (vma->vm_file)
__unmap_hugepage_range(tlb, vma, start, end, NULL);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
- }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This patch implements a new controller that allows us to control HugeTLB
allocations. The extension allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to access HugeTLB
pages beyond its limit. This requires the application to know beforehand
how much HugeTLB pages it would require for its use.
The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/cgroup_subsys.h | 6 +
include/linux/hugetlb_cgroup.h | 79 ++++++++++++
init/Kconfig | 14 ++
mm/Makefile | 1 +
mm/hugetlb_cgroup.c | 280 ++++++++++++++++++++++++++++++++++++++++
mm/page_cgroup.c | 5 +-
6 files changed, 383 insertions(+), 2 deletions(-)
create mode 100644 include/linux/hugetlb_cgroup.h
create mode 100644 mm/hugetlb_cgroup.c
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..895923a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,9 @@ SUBSYS(net_prio)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+SUBSYS(hugetlb)
+#endif
+
+/* */
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
new file mode 100644
index 0000000..5794be4
--- /dev/null
+++ b/include/linux/hugetlb_cgroup.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_HUGETLB_CGROUP_H
+#define _LINUX_HUGETLB_CGROUP_H
+
+#include <linux/res_counter.h>
+
+struct hugetlb_cgroup {
+ struct cgroup_subsys_state css;
+ /*
+ * the counter to account for hugepages from hugetlb.
+ */
+ struct res_counter hugepage[HUGE_MAX_HSTATE];
+};
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ if (hugetlb_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr);
+extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg);
+#else
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline int
+hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline void
+hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ return;
+}
+#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 1363203..73b14b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -714,6 +714,20 @@ config CGROUP_MEM_RES_CTLR
This config option also selects MM_OWNER config option, which
could in turn add some fork/exit overhead.
+config CGROUP_HUGETLB_RES_CTLR
+ bool "HugeTLB Resource Controller for Control Groups"
+ depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
+ select PAGE_CGROUP
+ default n
+ help
+ Provides a simple cgroup Resource Controller for HugeTLB pages.
+ When you enable this, you can put a per cgroup limit on HugeTLB usage.
+ The limit is enforced during page fault. Since HugeTLB doesn't
+ support page reclaim, enforcing the limit at page fault time implies
+ that, the application will get SIGBUS signal if it tries to access
+ HugeTLB pages beyond its limit. This requires the application to know
+ beforehand how much HugeTLB pages it would require for its use.
+
config CGROUP_MEM_RES_CTLR_SWAP
bool "Memory Resource Controller Swap Extension"
depends on CGROUP_MEM_RES_CTLR && SWAP
diff --git a/mm/Makefile b/mm/Makefile
index a70f9a9..bed4944 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
obj-$(CONFIG_PAGE_CGROUP) += page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
new file mode 100644
index 0000000..3a288f7
--- /dev/null
+++ b/mm/hugetlb_cgroup.c
@@ -0,0 +1,280 @@
+/*
+ *
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/page_cgroup.h>
+#include <linux/hugetlb_cgroup.h>
+
+struct cgroup_subsys hugetlb_subsys __read_mostly;
+struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
+{
+ return container_of(s, struct hugetlb_cgroup, css);
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
+{
+ if (!cgroup)
+ return NULL;
+ return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
+ hugetlb_subsys_id));
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
+{
+ return hugetlb_cgroup_from_css(task_subsys_state(task,
+ hugetlb_subsys_id));
+}
+
+static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
+{
+ return (h_cg == root_h_cgroup);
+}
+
+static struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
+{
+ if (!cg->parent)
+ return NULL;
+ return hugetlb_cgroup_from_cgroup(cg->parent);
+}
+
+static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
+{
+ int idx;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
+
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
+ if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+ return 1;
+ }
+ return 0;
+}
+
+static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
+{
+ int idx;
+ struct cgroup *parent_cgroup;
+ struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
+
+ h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+ if (!h_cgroup)
+ return ERR_PTR(-ENOMEM);
+
+ parent_cgroup = cgroup->parent;
+ if (parent_cgroup) {
+ parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx],
+ &parent_h_cgroup->hugepage[idx]);
+ } else {
+ root_h_cgroup = h_cgroup;
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx], NULL);
+ }
+ return &h_cgroup->css;
+}
+
+static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
+ struct page *page)
+{
+ int csize, ret = 0;
+ struct page_cgroup *pc;
+ struct res_counter *counter;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+ struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
+
+ if (!get_page_unless_zero(page))
+ goto out;
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ if (!PageCgroupUsed(pc) || pc->cgroup != cgroup)
+ goto err_out;
+
+ csize = PAGE_SIZE << compound_order(page);
+ /* If use_hierarchy == 0, we need to charge root */
+ if (!parent) {
+ parent = root_h_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);
+ }
+ counter = &h_cg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
+ pc->cgroup = cgroup->parent;
+err_out:
+ unlock_page_cgroup(pc);
+ put_page(page);
+out:
+ return ret;
+}
+
+/*
+ * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
+ * the parent cgroup.
+ */
+static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
+{
+ struct hstate *h;
+ struct page *page;
+ int ret = 0, idx = 0;
+
+ do {
+ if (cgroup_task_count(cgroup) ||
+ !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /*
+ * If the task doing the cgroup_rmdir got a signal
+ * we don't really need to loop till the hugetlb resource
+ * usage become zero.
+ */
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ for_each_hstate(h) {
+ spin_lock(&hugetlb_lock);
+ list_for_each_entry(page, &h->hugepage_activelist, lru) {
+ ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
+ if (ret) {
+ spin_unlock(&hugetlb_lock);
+ goto out;
+ }
+ }
+ spin_unlock(&hugetlb_lock);
+ idx++;
+ }
+ cond_resched();
+ } while (hugetlb_cgroup_have_usage(cgroup));
+out:
+ return ret;
+}
+
+static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
+{
+ struct hugetlb_cgroup *h_cgroup;
+
+ h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
+ kfree(h_cgroup);
+}
+
+int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ int ret = 0;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *h_cg = NULL;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ goto done;
+again:
+ rcu_read_lock();
+ h_cg = hugetlb_cgroup_from_task(current);
+ if (!h_cg)
+ h_cg = root_h_cgroup;
+
+ if (!css_tryget(&h_cg->css)) {
+ rcu_read_unlock();
+ goto again;
+ }
+ rcu_read_unlock();
+
+ ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+ css_put(&h_cg->css);
+done:
+ *ptr = h_cg;
+ return ret;
+}
+
+void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ struct page_cgroup *pc;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+ if (unlikely(PageCgroupUsed(pc))) {
+ unlock_page_cgroup(pc);
+ hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
+ return;
+ }
+ pc->cgroup = h_cg->css.cgroup;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+ return;
+}
+
+void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page)
+{
+ struct page_cgroup *pc;
+ struct hugetlb_cgroup *h_cg;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(page);
+ if (unlikely(!PageCgroupUsed(pc)))
+ return;
+
+ lock_page_cgroup(pc);
+ if (!PageCgroupUsed(pc)) {
+ unlock_page_cgroup(pc);
+ return;
+ }
+ h_cg = hugetlb_cgroup_from_cgroup(pc->cgroup);
+ pc->cgroup = root_h_cgroup->css.cgroup;
+ ClearPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
+void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
+struct cgroup_subsys hugetlb_subsys = {
+ .name = "hugetlb",
+ .create = hugetlb_cgroup_create,
+ .pre_destroy = hugetlb_cgroup_pre_destroy,
+ .destroy = hugetlb_cgroup_destroy,
+ .subsys_id = hugetlb_subsys_id,
+};
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 1ccbd71..26271b7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -10,6 +10,7 @@
#include <linux/cgroup.h>
#include <linux/swapops.h>
#include <linux/kmemleak.h>
+#include <linux/hugetlb_cgroup.h>
static unsigned long total_usage;
@@ -68,7 +69,7 @@ void __init page_cgroup_init_flatmem(void)
int nid, fail;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && hugetlb_cgroup_disabled())
return;
for_each_online_node(nid) {
@@ -268,7 +269,7 @@ void __init page_cgroup_init(void)
unsigned long pfn;
int nid;
- if (mem_cgroup_disabled())
+ if (mem_cgroup_disabled() && hugetlb_cgroup_disabled())
return;
for_each_node_state(nid, N_HIGH_MEMORY) {
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This adds necessary charge/uncharge calls in the HugeTLB code. We do
hugetlb cgroup charge in page alloc and uncharge in compound page destructor. We
also need to ignore HugeTLB pages in __mem_cgroup_uncharge_common because
that get called from delete_from_page_cache
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
mm/hugetlb.c | 16 +++++++++++++++-
mm/memcontrol.c | 5 +++++
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6330de2..cad7a4d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -625,6 +625,8 @@ static void free_huge_page(struct page *page)
BUG_ON(page_count(page));
BUG_ON(page_mapcount(page));
+ hugetlb_cgroup_uncharge_page(hstate_index(h),
+ pages_per_huge_page(h), page);
spin_lock(&hugetlb_lock);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
/* remove the page from active list */
@@ -1112,7 +1114,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *page;
long chg;
+ int ret, idx;
+ struct hugetlb_cgroup *h_cg;
+ idx = hstate_index(h);
/*
* Processes that did not create the mapping will have no
* reserves and will not have accounted against subpool
@@ -1128,6 +1133,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (hugepage_subpool_get_pages(spool, chg))
return ERR_PTR(-ENOSPC);
+ ret = hugetlb_cgroup_charge_page(idx, pages_per_huge_page(h), &h_cg);
+ if (ret) {
+ hugepage_subpool_put_pages(spool, chg);
+ return ERR_PTR(-ENOSPC);
+ }
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
spin_unlock(&hugetlb_lock);
@@ -1135,6 +1145,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
+ hugetlb_cgroup_uncharge_cgroup(idx,
+ pages_per_huge_page(h),
+ h_cg);
hugepage_subpool_put_pages(spool, chg);
return ERR_PTR(-ENOSPC);
}
@@ -1143,7 +1156,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long)spool);
vma_commit_reservation(h, vma, addr);
-
+ /* update page cgroup details */
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
return page;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6df019b..a52780b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2931,6 +2931,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
if (PageSwapCache(page))
return NULL;
+ /*
+ * HugeTLB page uncharge happen in the HugeTLB compound page destructor
+ */
+ if (PageHuge(page))
+ return NULL;
if (PageTransHuge(page)) {
nr_pages <<= compound_order(page);
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
mm/hugetlb.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e07d4cd..8ded02d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
*/
chg = vma_needs_reservation(h, vma, addr);
if (chg < 0)
- return ERR_PTR(-VM_FAULT_OOM);
+ return ERR_PTR(-ENOMEM);
if (chg)
if (hugepage_subpool_get_pages(spool, chg))
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
hugepage_subpool_put_pages(spool, chg);
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);
}
}
@@ -2496,6 +2496,7 @@ retry_avoidcopy:
new_page = alloc_huge_page(vma, address, outside_reserve);
if (IS_ERR(new_page)) {
+ int err = PTR_ERR(new_page);
page_cache_release(old_page);
/*
@@ -2524,7 +2525,10 @@ retry_avoidcopy:
/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
- return -PTR_ERR(new_page);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ else
+ return VM_FAULT_SIGBUS;
}
/*
@@ -2642,7 +2646,11 @@ retry:
goto out;
page = alloc_huge_page(vma, address, 0);
if (IS_ERR(page)) {
- ret = -PTR_ERR(page);
+ ret = PTR_ERR(page);
+ if (ret == -ENOMEM)
+ ret = VM_FAULT_OOM;
+ else
+ ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
Documentation/cgroups/hugetlb.txt | 45 +++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
create mode 100644 Documentation/cgroups/hugetlb.txt
diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
new file mode 100644
index 0000000..a9faaca
--- /dev/null
+++ b/Documentation/cgroups/hugetlb.txt
@@ -0,0 +1,45 @@
+HugeTLB Controller
+-------------------
+
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
+HugeTLB controller can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -o hugetlb none /sys/fs/cgroup
+
+With the above step, the initial or the parent HugeTLB group becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
+
+New groups can be created under the parent group /sys/fs/cgroup.
+
+# cd /sys/fs/cgroup
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it.
+
+Brief summary of control files
+
+ hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
+ hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
+
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+hugetlb.16GB.limit_in_bytes
+hugetlb.16GB.max_usage_in_bytes
+hugetlb.16GB.usage_in_bytes
+hugetlb.16GB.failcnt
+hugetlb.16MB.limit_in_bytes
+hugetlb.16MB.max_usage_in_bytes
+hugetlb.16MB.usage_in_bytes
+hugetlb.16MB.failcnt
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Add the control files for hugetlb controller
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 ++
include/linux/hugetlb_cgroup.h | 6 ++
mm/hugetlb.c | 2 +
mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 143 insertions(+)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index dcd55c7..92f75a5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
#include <linux/mm_types.h>
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
struct ctl_table;
struct user_struct;
@@ -221,6 +222,10 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+ /* cgroup control files */
+ struct cftype cgroup_files[5];
+#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 5794be4..fbf8c5f 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -42,6 +42,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page);
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
+extern int hugetlb_cgroup_file_init(int idx) __init;
#else
static inline bool hugetlb_cgroup_disabled(void)
{
@@ -75,5 +76,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
{
return;
}
+
+static inline int __init hugetlb_cgroup_file_init(int idx)
+{
+ return 0;
+}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 53840dd..6330de2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@
#include <linux/io.h>
#include <linux/hugetlb.h>
#include <linux/node.h>
+#include <linux/hugetlb_cgroup.h>
#include "internal.h"
const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1912,6 +1913,7 @@ void __init hugetlb_add_hstate(unsigned order)
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
+ hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
parsed_hstate = h;
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 3a288f7..49a3f20 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -19,6 +19,11 @@
#include <linux/page_cgroup.h>
#include <linux/hugetlb_cgroup.h>
+/* lifted from mem control */
+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
struct cgroup_subsys hugetlb_subsys __read_mostly;
struct hugetlb_cgroup *root_h_cgroup __read_mostly;
@@ -271,6 +276,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}
+static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ u64 val;
+ char str[64];
+ int idx, name, len;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ val = res_counter_read_u64(&h_cg->hugepage[idx], name);
+ len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
+ return simple_read_from_buffer(buf, nbytes, ppos, str, len);
+}
+
+static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+ const char *buffer)
+{
+ int idx, name, ret;
+ unsigned long long val;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ switch (name) {
+ case RES_LIMIT:
+ if (hugetlb_cgroup_is_root(h_cg)) {
+ /* Can't set limit on root */
+ ret = -EINVAL;
+ break;
+ }
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
+{
+ int idx, name, ret = 0;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(event);
+ name = MEMFILE_ATTR(event);
+
+ switch (name) {
+ case RES_MAX_USAGE:
+ res_counter_reset_max(&h_cg->hugepage[idx]);
+ break;
+ case RES_FAILCNT:
+ res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static char *mem_fmt(char *buf, int size, unsigned long hsize)
+{
+ if (hsize >= (1UL << 30))
+ snprintf(buf, size, "%luGB", hsize >> 30);
+ else if (hsize >= (1UL << 20))
+ snprintf(buf, size, "%luMB", hsize >> 20);
+ else
+ snprintf(buf, size, "%luKB", hsize >> 10);
+ return buf;
+}
+
+int __init hugetlb_cgroup_file_init(int idx)
+{
+ char buf[32];
+ struct cftype *cft;
+ struct hstate *h = &hstates[idx];
+
+ /* format the size */
+ mem_fmt(buf, 32, huge_page_size(h));
+
+ /* Add the limit file */
+ cft = &h->cgroup_files[0];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->read = hugetlb_cgroup_read;
+ cft->write_string = hugetlb_cgroup_write;
+
+ /* Add the usage file */
+ cft = &h->cgroup_files[1];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the MAX usage file */
+ cft = &h->cgroup_files[2];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the failcntfile */
+ cft = &h->cgroup_files[3];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* NULL terminate the last cft */
+ cft = &h->cgroup_files[4];
+ memset(cft, 0, sizeof(*cft));
+
+ WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
+
+ return 0;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
We will use it later to make page_cgroup track the hugetlb cgroup information.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/mmzone.h | 2 +-
include/linux/page_cgroup.h | 8 ++++----
init/Kconfig | 4 ++++
mm/Makefile | 3 ++-
mm/memcontrol.c | 42 +++++++++++++++++++++++++-----------------
5 files changed, 36 insertions(+), 23 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..2483cc5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1052,7 +1052,7 @@ struct mem_section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_CGROUP
/*
* If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
* section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index a88cdba..7bbfe37 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -12,7 +12,7 @@ enum {
#ifndef __GENERATING_BOUNDS_H
#include <generated/bounds.h>
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_PAGE_CGROUP
#include <linux/bit_spinlock.h>
/*
@@ -24,7 +24,7 @@ enum {
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
+ struct cgroup *cgroup;
};
void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -82,7 +82,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_PAGE_CGROUP */
struct page_cgroup;
static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -102,7 +102,7 @@ static inline void __init page_cgroup_init_flatmem(void)
{
}
-#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+#endif /* CONFIG_PAGE_CGROUP */
#include <linux/swap.h>
diff --git a/init/Kconfig b/init/Kconfig
index 81816b8..1363203 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -687,10 +687,14 @@ config RESOURCE_COUNTERS
This option enables controller independent resource accounting
infrastructure that works with cgroups.
+config PAGE_CGROUP
+ bool
+
config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS
select MM_OWNER
+ select PAGE_CGROUP
help
Provides a memory resource controller that manages both anonymous
memory and page cache. (See Documentation/cgroups/memory.txt)
diff --git a/mm/Makefile b/mm/Makefile
index a156285..a70f9a9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,8 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_PAGE_CGROUP) += page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac35bcc..6df019b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -864,6 +864,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
{
+ if (!cont)
+ return NULL;
return container_of(cgroup_subsys_state(cont,
mem_cgroup_subsys_id), struct mem_cgroup,
css);
@@ -1097,7 +1099,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
return &zone->lruvec;
pc = lookup_page_cgroup(page);
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
/*
* Surreptitiously switch any uncharged offlist page to root:
@@ -1108,8 +1110,10 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
* under page_cgroup lock: between them, they make all uses
* of pc->mem_cgroup safe.
*/
- if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup)
- pc->mem_cgroup = memcg = root_mem_cgroup;
+ if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup) {
+ memcg = root_mem_cgroup;
+ pc->cgroup = memcg->css.cgroup;
+ }
mz = page_cgroup_zoneinfo(memcg, page);
return &mz->lruvec;
@@ -1889,12 +1893,14 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
void __mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
+ struct cgroup *cgroup;
struct mem_cgroup *memcg;
struct page_cgroup *pc;
pc = lookup_page_cgroup(page);
again:
- memcg = pc->mem_cgroup;
+ cgroup = pc->cgroup;
+ memcg = mem_cgroup_from_cont(cgroup);
if (unlikely(!memcg || !PageCgroupUsed(pc)))
return;
/*
@@ -1907,7 +1913,7 @@ again:
return;
move_lock_mem_cgroup(memcg, flags);
- if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+ if (cgroup != pc->cgroup || !PageCgroupUsed(pc)) {
move_unlock_mem_cgroup(memcg, flags);
goto again;
}
@@ -1923,7 +1929,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
* lock is held because a routine modifies pc->mem_cgroup
* should take move_lock_page_cgroup().
*/
- move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+ move_unlock_mem_cgroup(mem_cgroup_from_cont(pc->cgroup), flags);
}
void mem_cgroup_update_page_stat(struct page *page,
@@ -1936,7 +1942,7 @@ void mem_cgroup_update_page_stat(struct page *page,
if (mem_cgroup_disabled())
return;
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
if (unlikely(!memcg || !PageCgroupUsed(pc)))
return;
@@ -2444,7 +2450,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
if (memcg && !css_tryget(&memcg->css))
memcg = NULL;
} else if (PageSwapCache(page)) {
@@ -2491,14 +2497,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
if (PageLRU(page)) {
- lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+ lruvec = mem_cgroup_zone_lruvec(zone,
+ mem_cgroup_from_cont(pc->cgroup));
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_lru(page));
was_on_lru = true;
}
}
- pc->mem_cgroup = memcg;
+ pc->cgroup = memcg->css.cgroup;
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
* Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -2511,7 +2518,8 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
if (lrucare) {
if (was_on_lru) {
- lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+ lruvec = mem_cgroup_zone_lruvec(zone,
+ mem_cgroup_from_cont(pc->cgroup));
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
add_page_to_lru_list(page, lruvec, page_lru(page));
@@ -2601,7 +2609,7 @@ static int mem_cgroup_move_account(struct page *page,
lock_page_cgroup(pc);
ret = -EINVAL;
- if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
+ if (!PageCgroupUsed(pc) || pc->cgroup != from->css.cgroup)
goto unlock;
move_lock_mem_cgroup(from, &flags);
@@ -2616,7 +2624,7 @@ static int mem_cgroup_move_account(struct page *page,
mem_cgroup_charge_statistics(from, anon, -nr_pages);
/* caller should have done css_get */
- pc->mem_cgroup = to;
+ pc->cgroup = to->css.cgroup;
mem_cgroup_charge_statistics(to, anon, nr_pages);
/*
* We charges against "to" which may not have any tasks. Then, "to"
@@ -2937,7 +2945,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
lock_page_cgroup(pc);
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
if (!PageCgroupUsed(pc))
goto unlock_out;
@@ -3183,7 +3191,7 @@ int mem_cgroup_prepare_migration(struct page *page,
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
css_get(&memcg->css);
/*
* At migrating an anonymous page, its mapcount goes down
@@ -3328,7 +3336,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
/* fix accounting on old pages */
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- memcg = pc->mem_cgroup;
+ memcg = mem_cgroup_from_cont(pc->cgroup);
mem_cgroup_charge_statistics(memcg, false, -1);
ClearPageCgroupUsed(pc);
}
@@ -5135,7 +5143,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
* mem_cgroup_move_account() checks the pc is valid or not under
* the lock.
*/
- if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+ if (PageCgroupUsed(pc) && pc->cgroup == mc.from->css.cgroup) {
ret = MC_TARGET_PAGE;
if (target)
target->page = page;
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
we are holding a hugepage reference, we can be sure that old page won't
get uncharged till the last put_page(). On successful migrate, we can
move the memcg information to new page's page_cgroup and mark the old
page's page_cgroup unused.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
include/linux/hugetlb_cgroup.h | 8 ++++++++
mm/hugetlb_cgroup.c | 30 ++++++++++++++++++++++++++++++
mm/migrate.c | 5 +++++
3 files changed, 43 insertions(+)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index fbf8c5f..387cbd6 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -43,6 +43,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
extern int hugetlb_cgroup_file_init(int idx) __init;
+extern void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage);
#else
static inline bool hugetlb_cgroup_disabled(void)
{
@@ -81,5 +83,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
{
return 0;
}
+
+static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage)
+{
+ return;
+}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 49a3f20..f99007b 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -401,6 +401,36 @@ int __init hugetlb_cgroup_file_init(int idx)
return 0;
}
+void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
+{
+ struct cgroup *cg;
+ struct page_cgroup *pc;
+ struct hugetlb_cgroup *h_cg;
+
+ VM_BUG_ON(!PageHuge(oldhpage));
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ pc = lookup_page_cgroup(oldhpage);
+ lock_page_cgroup(pc);
+ cg = pc->cgroup;
+ h_cg = hugetlb_cgroup_from_cgroup(cg);
+ pc->cgroup = root_h_cgroup->css.cgroup;
+ ClearPageCgroupUsed(pc);
+ cgroup_exclude_rmdir(&h_cg->css);
+ unlock_page_cgroup(pc);
+
+ /* move the h_cg details to new cgroup */
+ pc = lookup_page_cgroup(newhpage);
+ lock_page_cgroup(pc);
+ pc->cgroup = cg;
+ SetPageCgroupUsed(pc);
+ unlock_page_cgroup(pc);
+ cgroup_release_and_wakeup_rmdir(&h_cg->css);
+ return;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
diff --git a/mm/migrate.c b/mm/migrate.c
index 927254c..22f414f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -33,6 +33,7 @@
#include <linux/memcontrol.h>
#include <linux/syscalls.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
#include <asm/tlbflush.h>
@@ -928,6 +929,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (anon_vma)
put_anon_vma(anon_vma);
+
+ if (!rc)
+ hugetlb_cgroup_migrate(hpage, new_hpage);
+
unlock_page(hpage);
out:
put_page(new_hpage);
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
We will use them later in hugetlb_cgroup.c
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 7 ++-----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c4353ea..dcd55c7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -21,6 +21,11 @@ struct hugepage_subpool {
long max_hpages, used_hpages;
};
+extern spinlock_t hugetlb_lock;
+extern int hugetlb_max_hstate;
+#define for_each_hstate(h) \
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
+
struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
void hugepage_put_subpool(struct hugepage_subpool *spool);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0f38728..53840dd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-static int hugetlb_max_hstate;
+int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];
@@ -46,13 +46,10 @@ static struct hstate * __initdata parsed_hstate;
static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
-#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
-
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
*/
-static DEFINE_SPINLOCK(hugetlb_lock);
+DEFINE_SPINLOCK(hugetlb_lock);
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Use an mmu_gather instead of a temporary linked list for accumulating
pages when we unmap a hugepage range. This also allows us to get rid of
i_mmap_mutex unmap_hugepage_range in the following patch.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
include/linux/hugetlb.h | 22 ++++++++++++++----
mm/hugetlb.c | 59 ++++++++++++++++++++++++++++-------------------
mm/memory.c | 7 ++++--
4 files changed, 59 insertions(+), 33 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cc9281b..ff233e4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
else
v_offset = 0;
- __unmap_hugepage_range(vma,
- vma->vm_start + v_offset, vma->vm_end, NULL);
+ unmap_hugepage_range(vma, vma->vm_start + v_offset,
+ vma->vm_end, NULL);
}
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 217f528..c21e136 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -7,6 +7,7 @@
struct ctl_table;
struct user_struct;
+struct mmu_gather;
#ifdef CONFIG_HUGETLB_PAGE
@@ -40,9 +41,10 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
struct page **, struct vm_area_struct **,
unsigned long *, int *, int, unsigned int flags);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vms,
+ unsigned long start, unsigned long end,
+ struct page *ref_page);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(int, char *);
@@ -98,7 +100,6 @@ static inline unsigned long hugetlb_total_pages(void)
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end, page) BUG()
static inline void hugetlb_report_meminfo(struct seq_file *m)
{
}
@@ -112,13 +113,24 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
#define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
#define huge_pte_offset(mm, address) 0
-#define dequeue_hwpoisoned_huge_page(page) 0
+static inline int dequeue_hwpoisoned_huge_page(struct page *page)
+{
+ return 0;
+}
+
static inline void copy_huge_page(struct page *dst, struct page *src)
{
}
#define hugetlb_change_protection(vma, address, end, newprot)
+static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, struct page *ref_page)
+{
+ BUG();
+}
+
#endif /* !CONFIG_HUGETLB_PAGE */
#define HUGETLB_ANON_FILE "anon_hugepage"
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b97a5c..704a269 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,8 +24,9 @@
#include <asm/page.h>
#include <asm/pgtable.h>
-#include <linux/io.h>
+#include <asm/tlb.h>
+#include <linux/io.h>
#include <linux/hugetlb.h>
#include <linux/node.h>
#include "internal.h"
@@ -2310,30 +2311,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
return 0;
}
-void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page)
{
+ int force_flush = 0;
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
pte_t *ptep;
pte_t pte;
struct page *page;
- struct page *tmp;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- /*
- * A page gathering list, protected by per file i_mmap_mutex. The
- * lock is used to avoid list corruption from multiple unmapping
- * of the same page since we are using page->lru.
- */
- LIST_HEAD(page_list);
-
WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));
+ tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, start, end);
+again:
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -2372,30 +2369,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
}
pte = huge_ptep_get_and_clear(mm, address, ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address);
if (pte_dirty(pte))
set_page_dirty(page);
- list_add(&page->lru, &page_list);
+ page_remove_rmap(page);
+ force_flush = !__tlb_remove_page(tlb, page);
+ if (force_flush)
+ break;
/* Bail out after unmapping reference page if supplied */
if (ref_page)
break;
}
- flush_tlb_range(vma, start, end);
spin_unlock(&mm->page_table_lock);
- mmu_notifier_invalidate_range_end(mm, start, end);
- list_for_each_entry_safe(page, tmp, &page_list, lru) {
- page_remove_rmap(page);
- list_del(&page->lru);
- put_page(page);
+ /*
+ * mmu_gather ran out of room to batch pages, we break out of
+ * the PTE lock to avoid doing the potential expensive TLB invalidate
+ * and page-free while holding it.
+ */
+ if (force_flush) {
+ force_flush = 0;
+ tlb_flush_mmu(tlb);
+ if (address < end && !ref_page)
+ goto again;
}
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_end_vma(tlb, vma);
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page)
{
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- __unmap_hugepage_range(vma, start, end, ref_page);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ struct mm_struct *mm;
+ struct mmu_gather tlb;
+
+ mm = vma->vm_mm;
+
+ tlb_gather_mmu(&tlb, mm, 0);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ tlb_finish_mmu(&tlb, start, end);
}
/*
@@ -2440,9 +2452,8 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* from the time of fork. This would look like data corruption
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
- __unmap_hugepage_range(iter_vma,
- address, address + huge_page_size(h),
- page);
+ unmap_hugepage_range(iter_vma, address,
+ address + huge_page_size(h), page);
}
mutex_unlock(&mapping->i_mmap_mutex);
diff --git a/mm/memory.c b/mm/memory.c
index 1b7dc66..545e18a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file)
- unmap_hugepage_range(vma, start, end, NULL);
+ if (vma->vm_file) {
+ mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ __unmap_hugepage_range(tlb, vma, start, end, NULL);
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Add an inline helper and use it in the code.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 20 +++++++++++---------
2 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5d6bbe..217f528 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -302,6 +302,11 @@ static inline unsigned hstate_index_to_shift(unsigned index)
return hstates[index].order + PAGE_SHIFT;
}
+static inline int hstate_index(struct hstate *h)
+{
+ return h - hstates;
+}
+
#else
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -320,6 +325,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
return 1;
}
#define hstate_index_to_shift(index) 0
+#define hstate_index(h) 0
#endif
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ded02d..9b97a5c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1646,7 +1646,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct attribute_group *hstate_attr_group)
{
int retval;
- int hi = h - hstates;
+ int hi = hstate_index(h);
hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
if (!hstate_kobjs[hi])
@@ -1741,11 +1741,13 @@ void hugetlb_unregister_node(struct node *node)
if (!nhs->hugepages_kobj)
return; /* no hstate attributes */
- for_each_hstate(h)
- if (nhs->hstate_kobjs[h - hstates]) {
- kobject_put(nhs->hstate_kobjs[h - hstates]);
- nhs->hstate_kobjs[h - hstates] = NULL;
+ for_each_hstate(h) {
+ int idx = hstate_index(h);
+ if (nhs->hstate_kobjs[idx]) {
+ kobject_put(nhs->hstate_kobjs[idx]);
+ nhs->hstate_kobjs[idx] = NULL;
}
+ }
kobject_put(nhs->hugepages_kobj);
nhs->hugepages_kobj = NULL;
@@ -1848,7 +1850,7 @@ static void __exit hugetlb_exit(void)
hugetlb_unregister_all_nodes();
for_each_hstate(h) {
- kobject_put(hstate_kobjs[h - hstates]);
+ kobject_put(hstate_kobjs[hstate_index(h)]);
}
kobject_put(hugepages_kobj);
@@ -1869,7 +1871,7 @@ static int __init hugetlb_init(void)
if (!size_to_hstate(default_hstate_size))
hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
}
- default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size));
if (default_hstate_max_huge_pages)
default_hstate.max_huge_pages = default_hstate_max_huge_pages;
@@ -2687,7 +2689,7 @@ retry:
*/
if (unlikely(PageHWPoison(page))) {
ret = VM_FAULT_HWPOISON |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto backout_unlocked;
}
}
@@ -2760,7 +2762,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
}
ptep = huge_pte_alloc(mm, address, huge_page_size(h));
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support memcg removal. On
memcg removal we update the page's memory cgroup to point to parent
cgroup.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 12 +++++++-----
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c21e136..c4353ea 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -211,6 +211,7 @@ struct hstate {
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
+ struct list_head hugepage_activelist;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 704a269..0f38728 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -510,7 +510,7 @@ void copy_huge_page(struct page *dst, struct page *src)
static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
- list_add(&page->lru, &h->hugepage_freelists[nid]);
+ list_move(&page->lru, &h->hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
}
@@ -522,7 +522,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
if (list_empty(&h->hugepage_freelists[nid]))
return NULL;
page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
- list_del(&page->lru);
+ list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
@@ -626,10 +626,11 @@ static void free_huge_page(struct page *page)
page->mapping = NULL;
BUG_ON(page_count(page));
BUG_ON(page_mapcount(page));
- INIT_LIST_HEAD(&page->lru);
spin_lock(&hugetlb_lock);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+ /* remove the page from active list */
+ list_del(&page->lru);
update_and_free_page(h, page);
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
@@ -642,6 +643,7 @@ static void free_huge_page(struct page *page)
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
+ INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
h->nr_huge_pages++;
@@ -890,6 +892,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
spin_lock(&hugetlb_lock);
if (page) {
+ INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
/*
@@ -994,7 +997,6 @@ retry:
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
if ((--needed) < 0)
break;
- list_del(&page->lru);
/*
* This page is now managed by the hugetlb allocator and has
* no users -- drop the buddy allocator's reference.
@@ -1009,7 +1011,6 @@ free:
/* Free unnecessary surplus pages to the buddy allocator */
if (!list_empty(&surplus_list)) {
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- list_del(&page->lru);
put_page(page);
}
}
@@ -1909,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+ INIT_LIST_HEAD(&h->hugepage_activelist);
h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Since we migrate only one hugepage don't use linked list for passing the
page around. Directly pass page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
include/linux/migrate.h | 4 +--
mm/memory-failure.c | 13 ++--------
mm/migrate.c | 65 +++++++++++++++--------------------------------
3 files changed, 25 insertions(+), 57 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 855c337..ce7e667 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
-extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+extern int migrate_huge_page(struct page *, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
@@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
-static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab1e714..53a1495 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
int ret;
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);
- LIST_HEAD(pagelist);
ret = get_any_page(page, pfn, flags);
if (ret < 0)
@@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
}
/* Keep page count to indicate a given hugepage is isolated. */
-
- list_add(&hpage->lru, &pagelist);
- ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
- true);
+ ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
+ put_page(hpage);
if (ret) {
- struct page *page1, *page2;
- list_for_each_entry_safe(page1, page2, &pagelist, lru)
- put_page(page1);
-
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
- if (ret > 0)
- ret = -EIO;
return ret;
}
done:
diff --git a/mm/migrate.c b/mm/migrate.c
index ab81d48..927254c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -929,15 +929,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (anon_vma)
put_anon_vma(anon_vma);
unlock_page(hpage);
-
out:
- if (rc != -EAGAIN) {
- list_del(&hpage->lru);
- put_page(hpage);
- }
-
put_page(new_hpage);
-
if (result) {
if (rc)
*result = rc;
@@ -1013,48 +1006,32 @@ out:
return nr_failed + retry;
}
-int migrate_huge_pages(struct list_head *from,
- new_page_t get_new_page, unsigned long private, bool offlining,
- enum migrate_mode mode)
+int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
+ unsigned long private, bool offlining,
+ enum migrate_mode mode)
{
- int retry = 1;
- int nr_failed = 0;
- int pass = 0;
- struct page *page;
- struct page *page2;
- int rc;
-
- for (pass = 0; pass < 10 && retry; pass++) {
- retry = 0;
-
- list_for_each_entry_safe(page, page2, from, lru) {
+ int pass, rc;
+
+ for (pass = 0; pass < 10; pass++) {
+ rc = unmap_and_move_huge_page(get_new_page,
+ private, hpage, pass > 2, offlining,
+ mode);
+ switch (rc) {
+ case -ENOMEM:
+ goto out;
+ case -EAGAIN:
+ /* try again */
cond_resched();
-
- rc = unmap_and_move_huge_page(get_new_page,
- private, page, pass > 2, offlining,
- mode);
-
- switch(rc) {
- case -ENOMEM:
- goto out;
- case -EAGAIN:
- retry++;
- break;
- case 0:
- break;
- default:
- /* Permanent failure */
- nr_failed++;
- break;
- }
+ break;
+ case 0:
+ goto out;
+ default:
+ rc = -EIO;
+ goto out;
}
}
- rc = 0;
out:
- if (rc)
- return rc;
-
- return nr_failed + retry;
+ return rc;
}
#ifdef CONFIG_NUMA
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Rename max_hstate to hugetlb_max_hstate. We will be using this from other
subsystems like hugetlb controller in later patches.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
mm/hugetlb.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 285a81e..e07d4cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-static int max_hstate;
+static int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];
@@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
return;
}
- BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+ BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
- h = &hstates[max_hstate++];
+ h = &hstates[hugetlb_max_hstate++];
h->order = order;
h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
h->nr_huge_pages = 0;
@@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
static unsigned long *last_mhp;
/*
- * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+ * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
* so this hugepages= parameter goes to the "default hstate".
*/
- if (!max_hstate)
+ if (!hugetlb_max_hstate)
mhp = &default_hstate_max_huge_pages;
else
mhp = &parsed_hstate->max_huge_pages;
@@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
* But we need to allocate >= MAX_ORDER hstates here early to still
* use the bootmem allocator.
*/
- if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+ if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
hugetlb_hstate_alloc_pages(parsed_hstate);
last_mhp = mhp;
--
1.7.10
On Wed, May 30, 2012 at 08:08:46PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Rename max_hstate to hugetlb_max_hstate. We will be using this from other
> subsystems like hugetlb controller in later patches.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Hillf Danton <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Johannes Weiner <[email protected]>
Your SOB needs to be the last thing.
> ---
> mm/hugetlb.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 285a81e..e07d4cd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
> unsigned long hugepages_treat_as_movable;
>
> -static int max_hstate;
> +static int hugetlb_max_hstate;
> unsigned int default_hstate_idx;
> struct hstate hstates[HUGE_MAX_HSTATE];
>
> @@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
> static unsigned long __initdata default_hstate_size;
>
> #define for_each_hstate(h) \
> - for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
> + for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
>
> /*
> * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> @@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
> printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
> return;
> }
> - BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> + BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
> BUG_ON(order == 0);
> - h = &hstates[max_hstate++];
> + h = &hstates[hugetlb_max_hstate++];
> h->order = order;
> h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> h->nr_huge_pages = 0;
> @@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
> static unsigned long *last_mhp;
>
> /*
> - * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
> + * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
> * so this hugepages= parameter goes to the "default hstate".
> */
> - if (!max_hstate)
> + if (!hugetlb_max_hstate)
> mhp = &default_hstate_max_huge_pages;
> else
> mhp = &parsed_hstate->max_huge_pages;
> @@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
> * But we need to allocate >= MAX_ORDER hstates here early to still
> * use the bootmem allocator.
> */
> - if (max_hstate && parsed_hstate->order >= MAX_ORDER)
> + if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
> hugetlb_hstate_alloc_pages(parsed_hstate);
>
> last_mhp = mhp;
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Wed, May 30, 2012 at 08:08:47PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> VM_FAULT_* values from MAX_ERRNO.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Hillf Danton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> ---
> mm/hugetlb.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e07d4cd..8ded02d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> */
> chg = vma_needs_reservation(h, vma, addr);
> if (chg < 0)
> - return ERR_PTR(-VM_FAULT_OOM);
> + return ERR_PTR(-ENOMEM);
> if (chg)
> if (hugepage_subpool_get_pages(spool, chg))
> - return ERR_PTR(-VM_FAULT_SIGBUS);
> + return ERR_PTR(-ENOSPC);
Not enough space? Why not just pass what 'hugepage_subpool_get_pages'
returns?
>
> spin_lock(&hugetlb_lock);
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> @@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> if (!page) {
> hugepage_subpool_put_pages(spool, chg);
> - return ERR_PTR(-VM_FAULT_SIGBUS);
> + return ERR_PTR(-ENOSPC);
-ENOMEM seems more appropiate?
> }
> }
>
> @@ -2496,6 +2496,7 @@ retry_avoidcopy:
> new_page = alloc_huge_page(vma, address, outside_reserve);
>
> if (IS_ERR(new_page)) {
> + int err = PTR_ERR(new_page);
> page_cache_release(old_page);
>
> /*
> @@ -2524,7 +2525,10 @@ retry_avoidcopy:
>
> /* Caller expects lock to be held */
> spin_lock(&mm->page_table_lock);
> - return -PTR_ERR(new_page);
> + if (err == -ENOMEM)
> + return VM_FAULT_OOM;
> + else
> + return VM_FAULT_SIGBUS;
Ah, you are doing it to translate it.
Perhaps you should return -EFAULT for the really bad case
where you need to do OOM and then for all the other cases
return SIGBUS? Or maybe the other way around? ENOSPC doesn't
seem like the right error.
> }
>
> /*
> @@ -2642,7 +2646,11 @@ retry:
> goto out;
> page = alloc_huge_page(vma, address, 0);
> if (IS_ERR(page)) {
> - ret = -PTR_ERR(page);
> + ret = PTR_ERR(page);
> + if (ret == -ENOMEM)
> + ret = VM_FAULT_OOM;
> + else
> + ret = VM_FAULT_SIGBUS;
> goto out;
> }
> clear_huge_page(page, address, pages_per_huge_page(h));
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Rename max_hstate to hugetlb_max_hstate. We will be using this from other
> subsystems like hugetlb controller in later patches.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Hillf Danton <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> VM_FAULT_* values from MAX_ERRNO.
>
Yeah, but is there a reason for using VM_FAULT_HWPOISON_LARGE_MASK since
that's the only VM_FAULT_* value that is greater than MAX_ERRNO? The rest
of your patch set doesn't require this, so I think this change should just
be dropped. (And PTR_ERR() still returns long, this wasn't fixed from my
original review.)
On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add an inline helper and use it in the code.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Hillf Danton <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
> +static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
> +{
> + int idx;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
> +
> + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
> + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
> + return 1;
return true;
> + }
> + return 0;
And return false here
> +}
> +
> +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
> +{
> + int idx;
> + struct cgroup *parent_cgroup;
> + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
> +
> + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> + if (!h_cgroup)
> + return ERR_PTR(-ENOMEM);
> +
No need to check cgroup for NULL?
> + parent_cgroup = cgroup->parent;
> + if (parent_cgroup) {
> + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
> + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> + res_counter_init(&h_cgroup->hugepage[idx],
> + &parent_h_cgroup->hugepage[idx]);
> + } else {
> + root_h_cgroup = h_cgroup;
> + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> + res_counter_init(&h_cgroup->hugepage[idx], NULL);
> + }
> + return &h_cgroup->css;
> +}
> +
> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> + struct page *page)
> +{
> + int csize, ret = 0;
> + struct page_cgroup *pc;
> + struct res_counter *counter;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> +
> + if (!get_page_unless_zero(page))
> + goto out;
Hmm, so it goes to out, and does return ret. ret is zero. Is
that correct? Should ret be set to -EBUSY or such?
> +
> + pc = lookup_page_cgroup(page);
What if pc is NULL? Or is it guaranteed that it will
never happen so?
> + lock_page_cgroup(pc);
> + if (!PageCgroupUsed(pc) || pc->cgroup != cgroup)
> + goto err_out;
err is still set to zero. Is that OK? Should it be -EINVAL
or such?
> +
> + csize = PAGE_SIZE << compound_order(page);
> + /* If use_hierarchy == 0, we need to charge root */
> + if (!parent) {
> + parent = root_h_cgroup;
> + /* root has no limit */
> + res_counter_charge_nofail(&parent->hugepage[idx],
> + csize, &fail_res);
> + }
> + counter = &h_cg->hugepage[idx];
> + res_counter_uncharge_until(counter, counter->parent, csize);
> +
> + pc->cgroup = cgroup->parent;
> +err_out:
> + unlock_page_cgroup(pc);
> + put_page(page);
> +out:
> + return ret;
> +}
> +
> +/*
> + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
> + * the parent cgroup.
> + */
> +static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> +{
> + struct hstate *h;
> + struct page *page;
> + int ret = 0, idx = 0;
> +
> + do {
> + if (cgroup_task_count(cgroup) ||
> + !list_empty(&cgroup->children)) {
> + ret = -EBUSY;
> + goto out;
> + }
> + /*
> + * If the task doing the cgroup_rmdir got a signal
> + * we don't really need to loop till the hugetlb resource
> + * usage become zero.
Why don't we need to loop? Is somebody else (and if so can you
say who) doing the deletion?
> + */
> + if (signal_pending(current)) {
> + ret = -EINTR;
> + goto out;
> + }
> + for_each_hstate(h) {
> + spin_lock(&hugetlb_lock);
> + list_for_each_entry(page, &h->hugepage_activelist, lru) {
> + ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
> + if (ret) {
> + spin_unlock(&hugetlb_lock);
> + goto out;
> + }
> + }
> + spin_unlock(&hugetlb_lock);
> + idx++;
> + }
> + cond_resched();
> + } while (hugetlb_cgroup_have_usage(cgroup));
> +out:
> + return ret;
> +}
> +
> +static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> +{
> + struct hugetlb_cgroup *h_cgroup;
> +
> + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
> + kfree(h_cgroup);
> +}
> +
> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup **ptr)
> +{
> + int ret = 0;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *h_cg = NULL;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled())
> + goto done;
> +again:
> + rcu_read_lock();
> + h_cg = hugetlb_cgroup_from_task(current);
> + if (!h_cg)
> + h_cg = root_h_cgroup;
> +
> + if (!css_tryget(&h_cg->css)) {
> + rcu_read_unlock();
> + goto again;
You don't want some form of limit on how many times you can
loop around?
> + }
> + rcu_read_unlock();
> +
> + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> + css_put(&h_cg->css);
> +done:
> + *ptr = h_cg;
> + return ret;
> +}
> +
On Wed, May 30, 2012 at 08:08:56PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the control files for hugetlb controller
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/hugetlb.h | 5 ++
> include/linux/hugetlb_cgroup.h | 6 ++
> mm/hugetlb.c | 2 +
> mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 143 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index dcd55c7..92f75a5 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -4,6 +4,7 @@
> #include <linux/mm_types.h>
> #include <linux/fs.h>
> #include <linux/hugetlb_inline.h>
> +#include <linux/cgroup.h>
>
> struct ctl_table;
> struct user_struct;
> @@ -221,6 +222,10 @@ struct hstate {
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> unsigned int free_huge_pages_node[MAX_NUMNODES];
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> + /* cgroup control files */
> + struct cftype cgroup_files[5];
Why five? Should there be a #define for this magic value?
> +#endif
> char name[HSTATE_NAME_LEN];
> };
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..fbf8c5f 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -42,6 +42,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> struct page *page);
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> +extern int hugetlb_cgroup_file_init(int idx) __init;
> #else
> static inline bool hugetlb_cgroup_disabled(void)
> {
> @@ -75,5 +76,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> {
> return;
> }
> +
> +static inline int __init hugetlb_cgroup_file_init(int idx)
> +{
> + return 0;
> +}
> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 53840dd..6330de2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -29,6 +29,7 @@
> #include <linux/io.h>
> #include <linux/hugetlb.h>
> #include <linux/node.h>
> +#include <linux/hugetlb_cgroup.h>
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1912,6 +1913,7 @@ void __init hugetlb_add_hstate(unsigned order)
> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>
> parsed_hstate = h;
> }
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 3a288f7..49a3f20 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -19,6 +19,11 @@
> #include <linux/page_cgroup.h>
> #include <linux/hugetlb_cgroup.h>
>
> +/* lifted from mem control */
Might also include the comment from said file explaining the
purpose of these #define.
> +#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
> +#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
> +#define MEMFILE_ATTR(val) ((val) & 0xffff)
> +
> struct cgroup_subsys hugetlb_subsys __read_mostly;
> struct hugetlb_cgroup *root_h_cgroup __read_mostly;
>
> @@ -271,6 +276,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> return;
> }
>
> +static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
> + struct file *file, char __user *buf,
> + size_t nbytes, loff_t *ppos)
> +{
> + u64 val;
> + char str[64];
I would think there would be a define for this somewhere?
> + int idx, name, len;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + val = res_counter_read_u64(&h_cg->hugepage[idx], name);
> + len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> + return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> +}
> +
> +static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
> + const char *buffer)
> +{
> + int idx, name, ret;
> + unsigned long long val;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + switch (name) {
> + case RES_LIMIT:
> + if (hugetlb_cgroup_is_root(h_cg)) {
> + /* Can't set limit on root */
> + ret = -EINVAL;
> + break;
> + }
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + break;
> + ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
> +{
> + int idx, name, ret = 0;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(event);
> + name = MEMFILE_ATTR(event);
> +
> + switch (name) {
> + case RES_MAX_USAGE:
> + res_counter_reset_max(&h_cg->hugepage[idx]);
> + break;
> + case RES_FAILCNT:
> + res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static char *mem_fmt(char *buf, int size, unsigned long hsize)
> +{
> + if (hsize >= (1UL << 30))
> + snprintf(buf, size, "%luGB", hsize >> 30);
> + else if (hsize >= (1UL << 20))
> + snprintf(buf, size, "%luMB", hsize >> 20);
> + else
> + snprintf(buf, size, "%luKB", hsize >> 10);
> + return buf;
> +}
> +
> +int __init hugetlb_cgroup_file_init(int idx)
> +{
> + char buf[32];
#define pls.
> + struct cftype *cft;
> + struct hstate *h = &hstates[idx];
> +
> + /* format the size */
> + mem_fmt(buf, 32, huge_page_size(h));
> +
> + /* Add the limit file */
> + cft = &h->cgroup_files[0];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> + cft->read = hugetlb_cgroup_read;
> + cft->write_string = hugetlb_cgroup_write;
> +
> + /* Add the usage file */
> + cft = &h->cgroup_files[1];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the MAX usage file */
> + cft = &h->cgroup_files[2];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the failcntfile */
> + cft = &h->cgroup_files[3];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* NULL terminate the last cft */
> + cft = &h->cgroup_files[4];
> + memset(cft, 0, sizeof(*cft));
> +
> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> +
Wouldn't doing:
return cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files);
be more appropiate?
> + return 0;
> +}
> +
> struct cgroup_subsys hugetlb_subsys = {
> .name = "hugetlb",
> .create = hugetlb_cgroup_create,
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index cc9281b..ff233e4 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
> else
> v_offset = 0;
>
> - __unmap_hugepage_range(vma,
> - vma->vm_start + v_offset, vma->vm_end, NULL);
> + unmap_hugepage_range(vma, vma->vm_start + v_offset,
> + vma->vm_end, NULL);
> }
> }
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 217f528..c21e136 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -7,6 +7,7 @@
>
> struct ctl_table;
> struct user_struct;
> +struct mmu_gather;
>
> #ifdef CONFIG_HUGETLB_PAGE
>
> @@ -40,9 +41,10 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> struct page **, struct vm_area_struct **,
> unsigned long *, int *, int, unsigned int flags);
> void unmap_hugepage_range(struct vm_area_struct *,
> - unsigned long, unsigned long, struct page *);
> -void __unmap_hugepage_range(struct vm_area_struct *,
> - unsigned long, unsigned long, struct page *);
> + unsigned long, unsigned long, struct page *);
> +void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vms,
s/vms/vma/
> + unsigned long start, unsigned long end,
> + struct page *ref_page);
> int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
> void hugetlb_report_meminfo(struct seq_file *);
> int hugetlb_report_node_meminfo(int, char *);
> @@ -98,7 +100,6 @@ static inline unsigned long hugetlb_total_pages(void)
> #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
> #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
> #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
> -#define unmap_hugepage_range(vma, start, end, page) BUG()
> static inline void hugetlb_report_meminfo(struct seq_file *m)
> {
> }
Why?
> @@ -112,13 +113,24 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
> #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
> #define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
> #define huge_pte_offset(mm, address) 0
> -#define dequeue_hwpoisoned_huge_page(page) 0
> +static inline int dequeue_hwpoisoned_huge_page(struct page *page)
> +{
> + return 0;
> +}
> +
Unrelated from this patchset.
> static inline void copy_huge_page(struct page *dst, struct page *src)
> {
> }
>
> #define hugetlb_change_protection(vma, address, end, newprot)
>
> +static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
> + struct vm_area_struct *vma, unsigned long start,
> + unsigned long end, struct page *ref_page)
> +{
> + BUG();
> +}
> +
I think this should be done under the unmap_hugepage_range() definition
you removed (and change it to be a static inline function as well).
> #endif /* !CONFIG_HUGETLB_PAGE */
>
> #define HUGETLB_ANON_FILE "anon_hugepage"
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9b97a5c..704a269 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -24,8 +24,9 @@
>
> #include <asm/page.h>
> #include <asm/pgtable.h>
> -#include <linux/io.h>
> +#include <asm/tlb.h>
>
> +#include <linux/io.h>
> #include <linux/hugetlb.h>
> #include <linux/node.h>
> #include "internal.h"
> @@ -2310,30 +2311,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
> return 0;
> }
>
> -void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> - unsigned long end, struct page *ref_page)
> +void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> + unsigned long start, unsigned long end,
> + struct page *ref_page)
> {
> + int force_flush = 0;
Can this be bool?
> struct mm_struct *mm = vma->vm_mm;
> unsigned long address;
> pte_t *ptep;
> pte_t pte;
> struct page *page;
> - struct page *tmp;
> struct hstate *h = hstate_vma(vma);
> unsigned long sz = huge_page_size(h);
>
> - /*
> - * A page gathering list, protected by per file i_mmap_mutex. The
> - * lock is used to avoid list corruption from multiple unmapping
> - * of the same page since we are using page->lru.
> - */
> - LIST_HEAD(page_list);
> -
> WARN_ON(!is_vm_hugetlb_page(vma));
> BUG_ON(start & ~huge_page_mask(h));
> BUG_ON(end & ~huge_page_mask(h));
>
> + tlb_start_vma(tlb, vma);
> mmu_notifier_invalidate_range_start(mm, start, end);
> +again:
> spin_lock(&mm->page_table_lock);
> for (address = start; address < end; address += sz) {
> ptep = huge_pte_offset(mm, address);
> @@ -2372,30 +2369,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> }
>
> pte = huge_ptep_get_and_clear(mm, address, ptep);
> + tlb_remove_tlb_entry(tlb, ptep, address);
> if (pte_dirty(pte))
> set_page_dirty(page);
> - list_add(&page->lru, &page_list);
>
> + page_remove_rmap(page);
> + force_flush = !__tlb_remove_page(tlb, page);
> + if (force_flush)
> + break;
> /* Bail out after unmapping reference page if supplied */
> if (ref_page)
> break;
> }
> - flush_tlb_range(vma, start, end);
> spin_unlock(&mm->page_table_lock);
> - mmu_notifier_invalidate_range_end(mm, start, end);
> - list_for_each_entry_safe(page, tmp, &page_list, lru) {
> - page_remove_rmap(page);
> - list_del(&page->lru);
> - put_page(page);
> + /*
> + * mmu_gather ran out of room to batch pages, we break out of
> + * the PTE lock to avoid doing the potential expensive TLB invalidate
> + * and page-free while holding it.
> + */
> + if (force_flush) {
> + force_flush = 0;
> + tlb_flush_mmu(tlb);
> + if (address < end && !ref_page)
> + goto again;
Shouldn't be copying "start" at the beginning of this function and then
updating that copy now and use it as the loop initialization?
> }
> + mmu_notifier_invalidate_range_end(mm, start, end);
> + tlb_end_vma(tlb, vma);
> }
>
> void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, struct page *ref_page)
> {
> - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> - __unmap_hugepage_range(vma, start, end, ref_page);
> - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> + struct mm_struct *mm;
> + struct mmu_gather tlb;
> +
> + mm = vma->vm_mm;
> +
> + tlb_gather_mmu(&tlb, mm, 0);
> + __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
> + tlb_finish_mmu(&tlb, start, end);
> }
>
> /*
On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> fix linked list corruption in unmap_hugepage_range()") but we don't use
> page->lru in unmap_hugepage_range any more. Also the lock was taken
> higher up in the stack in some code path. That would result in deadlock.
>
> unmap_mapping_range (i_mmap_mutex)
> -> unmap_mapping_range_tree
> -> unmap_mapping_range_vma
> -> zap_page_range_single
> -> unmap_single_vma
> -> unmap_hugepage_range (i_mmap_mutex)
>
You should be able to show this with lockdep?
> For shared pagetable support for huge pages, since pagetable pages are ref
> counted we don't need any lock during huge_pmd_unshare. We do take
> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> (39dde65c9940c97f ("shared page table for hugetlb page")).
>
I think this should be folded into patch 4, the code you're removing here
is just added in that function unnecessarily.
On Wed, May 30, 2012 at 06:57:47PM -0700, David Rientjes wrote:
> On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
>
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> > fix linked list corruption in unmap_hugepage_range()") but we don't use
> > page->lru in unmap_hugepage_range any more. Also the lock was taken
> > higher up in the stack in some code path. That would result in deadlock.
> >
> > unmap_mapping_range (i_mmap_mutex)
> > -> unmap_mapping_range_tree
> > -> unmap_mapping_range_vma
> > -> zap_page_range_single
> > -> unmap_single_vma
> > -> unmap_hugepage_range (i_mmap_mutex)
> >
>
> You should be able to show this with lockdep?
I was not able to get a lockdep report
>
> > For shared pagetable support for huge pages, since pagetable pages are ref
> > counted we don't need any lock during huge_pmd_unshare. We do take
> > i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> > (39dde65c9940c97f ("shared page table for hugetlb page")).
> >
>
> I think this should be folded into patch 4, the code you're removing here
> is just added in that function unnecessarily.
>
I am removing i_mmap_mutex in this patch. That is not added in patch 4.
-aneesh
On Wed, May 30, 2012 at 06:56:36PM -0700, David Rientjes wrote:
> On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
>
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index cc9281b..ff233e4 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
> > else
> > v_offset = 0;
> >
> > - __unmap_hugepage_range(vma,
> > - vma->vm_start + v_offset, vma->vm_end, NULL);
> > + unmap_hugepage_range(vma, vma->vm_start + v_offset,
> > + vma->vm_end, NULL);
> > }
> > }
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 217f528..c21e136 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -7,6 +7,7 @@
> >
> > struct ctl_table;
> > struct user_struct;
> > +struct mmu_gather;
> >
> > #ifdef CONFIG_HUGETLB_PAGE
> >
> > @@ -40,9 +41,10 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> > struct page **, struct vm_area_struct **,
> > unsigned long *, int *, int, unsigned int flags);
> > void unmap_hugepage_range(struct vm_area_struct *,
> > - unsigned long, unsigned long, struct page *);
> > -void __unmap_hugepage_range(struct vm_area_struct *,
> > - unsigned long, unsigned long, struct page *);
> > + unsigned long, unsigned long, struct page *);
> > +void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vms,
>
> s/vms/vma/
done
>
> > + unsigned long start, unsigned long end,
> > + struct page *ref_page);
> > int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
> > void hugetlb_report_meminfo(struct seq_file *);
> > int hugetlb_report_node_meminfo(int, char *);
> > @@ -98,7 +100,6 @@ static inline unsigned long hugetlb_total_pages(void)
> > #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
> > #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
> > #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
> > -#define unmap_hugepage_range(vma, start, end, page) BUG()
> > static inline void hugetlb_report_meminfo(struct seq_file *m)
> > {
> > }
>
> Why?
unmap_hugepage_range is now not used outside CONFIG_HUGETLB_PAGE
>
> > @@ -112,13 +113,24 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
> > #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
> > #define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
> > #define huge_pte_offset(mm, address) 0
> > -#define dequeue_hwpoisoned_huge_page(page) 0
> > +static inline int dequeue_hwpoisoned_huge_page(struct page *page)
> > +{
> > + return 0;
> > +}
> > +
>
> Unrelated from this patchset.
It throws a warning. Yes it can be a seperate patch. But was not sure whether to
move that one line patch outside
>
> > static inline void copy_huge_page(struct page *dst, struct page *src)
> > {
> > }
> >
> > #define hugetlb_change_protection(vma, address, end, newprot)
> >
> > +static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
> > + struct vm_area_struct *vma, unsigned long start,
> > + unsigned long end, struct page *ref_page)
> > +{
> > + BUG();
> > +}
> > +
>
> I think this should be done under the unmap_hugepage_range() definition
> you removed (and change it to be a static inline function as well).
Below is what unmap_hugepage_range is after all the change. And it doesn't get
used if CONFIG_HUGETLB_PAGE is not used. But we use __unmap_hugepage_range from
common code.If we get called with hugetlb not enabled, then that implies a BUG()
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page)
{
struct mm_struct *mm;
struct mmu_gather tlb;
mm = vma->vm_mm;
tlb_gather_mmu(&tlb, mm, 0);
__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
tlb_finish_mmu(&tlb, start, end);
}
>
> > #endif /* !CONFIG_HUGETLB_PAGE */
> >
> > #define HUGETLB_ANON_FILE "anon_hugepage"
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 9b97a5c..704a269 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -24,8 +24,9 @@
> >
> > #include <asm/page.h>
> > #include <asm/pgtable.h>
> > -#include <linux/io.h>
> > +#include <asm/tlb.h>
> >
> > +#include <linux/io.h>
> > #include <linux/hugetlb.h>
> > #include <linux/node.h>
> > #include "internal.h"
> > @@ -2310,30 +2311,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
> > return 0;
> > }
> >
> > -void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > - unsigned long end, struct page *ref_page)
> > +void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > + unsigned long start, unsigned long end,
> > + struct page *ref_page)
> > {
> > + int force_flush = 0;
>
> Can this be bool?
>
> > struct mm_struct *mm = vma->vm_mm;
> > unsigned long address;
> > pte_t *ptep;
> > pte_t pte;
> > struct page *page;
> > - struct page *tmp;
> > struct hstate *h = hstate_vma(vma);
> > unsigned long sz = huge_page_size(h);
> >
> > - /*
> > - * A page gathering list, protected by per file i_mmap_mutex. The
> > - * lock is used to avoid list corruption from multiple unmapping
> > - * of the same page since we are using page->lru.
> > - */
> > - LIST_HEAD(page_list);
> > -
> > WARN_ON(!is_vm_hugetlb_page(vma));
> > BUG_ON(start & ~huge_page_mask(h));
> > BUG_ON(end & ~huge_page_mask(h));
> >
> > + tlb_start_vma(tlb, vma);
> > mmu_notifier_invalidate_range_start(mm, start, end);
> > +again:
> > spin_lock(&mm->page_table_lock);
> > for (address = start; address < end; address += sz) {
> > ptep = huge_pte_offset(mm, address);
> > @@ -2372,30 +2369,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > }
> >
> > pte = huge_ptep_get_and_clear(mm, address, ptep);
> > + tlb_remove_tlb_entry(tlb, ptep, address);
> > if (pte_dirty(pte))
> > set_page_dirty(page);
> > - list_add(&page->lru, &page_list);
> >
> > + page_remove_rmap(page);
> > + force_flush = !__tlb_remove_page(tlb, page);
> > + if (force_flush)
> > + break;
> > /* Bail out after unmapping reference page if supplied */
> > if (ref_page)
> > break;
> > }
> > - flush_tlb_range(vma, start, end);
> > spin_unlock(&mm->page_table_lock);
> > - mmu_notifier_invalidate_range_end(mm, start, end);
> > - list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > - page_remove_rmap(page);
> > - list_del(&page->lru);
> > - put_page(page);
> > + /*
> > + * mmu_gather ran out of room to batch pages, we break out of
> > + * the PTE lock to avoid doing the potential expensive TLB invalidate
> > + * and page-free while holding it.
> > + */
> > + if (force_flush) {
> > + force_flush = 0;
> > + tlb_flush_mmu(tlb);
> > + if (address < end && !ref_page)
> > + goto again;
>
> Shouldn't be copying "start" at the beginning of this function and then
> updating that copy now and use it as the loop initialization?
>
I didn't want to make larger changes here. My goal was to switch to mmu_gather APIs
without changing other loop logic.
> > }
> > + mmu_notifier_invalidate_range_end(mm, start, end);
> > + tlb_end_vma(tlb, vma);
> > }
> >
> > void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
> > unsigned long end, struct page *ref_page)
> > {
> > - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > - __unmap_hugepage_range(vma, start, end, ref_page);
> > - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > + struct mm_struct *mm;
> > + struct mmu_gather tlb;
> > +
> > + mm = vma->vm_mm;
> > +
> > + tlb_gather_mmu(&tlb, mm, 0);
> > + __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
> > + tlb_finish_mmu(&tlb, start, end);
> > }
> >
> > /*
>
-aneesh
On Wed, May 30, 2012 at 09:32:25PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, May 30, 2012 at 08:08:56PM +0530, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > Add the control files for hugetlb controller
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > include/linux/hugetlb.h | 5 ++
> > include/linux/hugetlb_cgroup.h | 6 ++
> > mm/hugetlb.c | 2 +
> > mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 143 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index dcd55c7..92f75a5 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -4,6 +4,7 @@
> > #include <linux/mm_types.h>
> > #include <linux/fs.h>
> > #include <linux/hugetlb_inline.h>
> > +#include <linux/cgroup.h>
> >
> > struct ctl_table;
> > struct user_struct;
> > @@ -221,6 +222,10 @@ struct hstate {
> > unsigned int nr_huge_pages_node[MAX_NUMNODES];
> > unsigned int free_huge_pages_node[MAX_NUMNODES];
> > unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> > + /* cgroup control files */
> > + struct cftype cgroup_files[5];
>
> Why five? Should there be a #define for this magic value?
>
Because we have 4 control files. Was not sure whether that should be a #define because
we are not going to use that #define anywhere else. Also the same patch index them
in hugetlb_file_init. That is the only place it is used.
> > +#endif
> > char name[HSTATE_NAME_LEN];
> > };
> >
> > diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> > index 5794be4..fbf8c5f 100644
> > --- a/include/linux/hugetlb_cgroup.h
> > +++ b/include/linux/hugetlb_cgroup.h
> > @@ -42,6 +42,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> > struct page *page);
> > extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > struct hugetlb_cgroup *h_cg);
> > +extern int hugetlb_cgroup_file_init(int idx) __init;
> > #else
> > static inline bool hugetlb_cgroup_disabled(void)
> > {
> > @@ -75,5 +76,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > {
> > return;
> > }
> > +
> > +static inline int __init hugetlb_cgroup_file_init(int idx)
> > +{
> > + return 0;
> > +}
> > #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> > #endif
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 53840dd..6330de2 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -29,6 +29,7 @@
> > #include <linux/io.h>
> > #include <linux/hugetlb.h>
> > #include <linux/node.h>
> > +#include <linux/hugetlb_cgroup.h>
> > #include "internal.h"
> >
> > const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> > @@ -1912,6 +1913,7 @@ void __init hugetlb_add_hstate(unsigned order)
> > h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> > huge_page_size(h)/1024);
> > + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
> >
> > parsed_hstate = h;
> > }
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 3a288f7..49a3f20 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -19,6 +19,11 @@
> > #include <linux/page_cgroup.h>
> > #include <linux/hugetlb_cgroup.h>
> >
> > +/* lifted from mem control */
>
> Might also include the comment from said file explaining the
> purpose of these #define.
>
> > +#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
> > +#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
> > +#define MEMFILE_ATTR(val) ((val) & 0xffff)
> > +
> > struct cgroup_subsys hugetlb_subsys __read_mostly;
> > struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> >
> > @@ -271,6 +276,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> > return;
> > }
> >
> > +static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
> > + struct file *file, char __user *buf,
> > + size_t nbytes, loff_t *ppos)
> > +{
> > + u64 val;
> > + char str[64];
>
> I would think there would be a define for this somewhere?
lifted from mem_cgroup_read. Number is big enough to find the return for usage_in_bytes,
limit_in_bytes etc
>
> > + int idx, name, len;
> > + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> > +
> > + idx = MEMFILE_IDX(cft->private);
> > + name = MEMFILE_ATTR(cft->private);
> > +
> > + val = res_counter_read_u64(&h_cg->hugepage[idx], name);
> > + len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> > + return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> > +}
> > +
......
......
> > + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> > + cft->read = hugetlb_cgroup_read;
> > +
> > + /* Add the MAX usage file */
> > + cft = &h->cgroup_files[2];
> > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> > + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> > + cft->trigger = hugetlb_cgroup_reset;
> > + cft->read = hugetlb_cgroup_read;
> > +
> > + /* Add the failcntfile */
> > + cft = &h->cgroup_files[3];
> > + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> > + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> > + cft->trigger = hugetlb_cgroup_reset;
> > + cft->read = hugetlb_cgroup_read;
> > +
> > + /* NULL terminate the last cft */
> > + cft = &h->cgroup_files[4];
> > + memset(cft, 0, sizeof(*cft));
> > +
> > + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> > +
>
> Wouldn't doing:
> return cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files);
>
> be more appropiate?
cgroup wanted a WARN_ON around that IIUC. I guess we will drop all of the later.
> > + return 0;
> > +}
> > +
> > struct cgroup_subsys hugetlb_subsys = {
> > .name = "hugetlb",
> > .create = hugetlb_cgroup_create,
> > --
> > 1.7.10
> >
-aneesh
On Wed, May 30, 2012 at 09:19:54PM -0400, Konrad Rzeszutek Wilk wrote:
> > +static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
> > +{
> > + int idx;
> > + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
> > +
> > + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
> > + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
> > + return 1;
>
> return true;
> > + }
> > + return 0;
>
> And return false here
> > +}
> > +
> > +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
> > +{
> > + int idx;
> > + struct cgroup *parent_cgroup;
> > + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
> > +
> > + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> > + if (!h_cgroup)
> > + return ERR_PTR(-ENOMEM);
> > +
>
> No need to check cgroup for NULL?
Other cgroups (memcg) doesn't do that. Can we really get NULL cgroup tere ?
>
> > + parent_cgroup = cgroup->parent;
> > + if (parent_cgroup) {
> > + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
> > + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> > + res_counter_init(&h_cgroup->hugepage[idx],
> > + &parent_h_cgroup->hugepage[idx]);
> > + } else {
> > + root_h_cgroup = h_cgroup;
> > + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> > + res_counter_init(&h_cgroup->hugepage[idx], NULL);
> > + }
> > + return &h_cgroup->css;
> > +}
> > +
> > +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> > + struct page *page)
> > +{
> > + int csize, ret = 0;
> > + struct page_cgroup *pc;
> > + struct res_counter *counter;
> > + struct res_counter *fail_res;
> > + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> > + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> > +
> > + if (!get_page_unless_zero(page))
> > + goto out;
>
> Hmm, so it goes to out, and does return ret. ret is zero. Is
> that correct? Should ret be set to -EBUSY or such?
>
Fixed
> > +
> > + pc = lookup_page_cgroup(page);
>
> What if pc is NULL? Or is it guaranteed that it will
> never happen so?
>
> > + lock_page_cgroup(pc);
> > + if (!PageCgroupUsed(pc) || pc->cgroup != cgroup)
> > + goto err_out;
>
> err is still set to zero. Is that OK? Should it be -EINVAL
> or such?
>
Fixed
> > +
> > + csize = PAGE_SIZE << compound_order(page);
> > + /* If use_hierarchy == 0, we need to charge root */
> > + if (!parent) {
> > + parent = root_h_cgroup;
> > + /* root has no limit */
> > + res_counter_charge_nofail(&parent->hugepage[idx],
> > + csize, &fail_res);
> > + }
> > + counter = &h_cg->hugepage[idx];
> > + res_counter_uncharge_until(counter, counter->parent, csize);
> > +
> > + pc->cgroup = cgroup->parent;
> > +err_out:
> > + unlock_page_cgroup(pc);
> > + put_page(page);
> > +out:
> > + return ret;
> > +}
> > +
> > +/*
> > + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
> > + * the parent cgroup.
> > + */
> > +static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> > +{
> > + struct hstate *h;
> > + struct page *page;
> > + int ret = 0, idx = 0;
> > +
> > + do {
> > + if (cgroup_task_count(cgroup) ||
> > + !list_empty(&cgroup->children)) {
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > + /*
> > + * If the task doing the cgroup_rmdir got a signal
> > + * we don't really need to loop till the hugetlb resource
> > + * usage become zero.
>
> Why don't we need to loop? Is somebody else (and if so can you
> say who) doing the deletion?
>
No we just come out without doing the deletion and handle the signal.
> > + */
> > + if (signal_pending(current)) {
> > + ret = -EINTR;
> > + goto out;
> > + }
> > + for_each_hstate(h) {
> > + spin_lock(&hugetlb_lock);
> > + list_for_each_entry(page, &h->hugepage_activelist, lru) {
> > + ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
> > + if (ret) {
> > + spin_unlock(&hugetlb_lock);
> > + goto out;
> > + }
> > + }
> > + spin_unlock(&hugetlb_lock);
> > + idx++;
> > + }
> > + cond_resched();
> > + } while (hugetlb_cgroup_have_usage(cgroup));
> > +out:
> > + return ret;
> > +}
> > +
> > +static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> > +{
> > + struct hugetlb_cgroup *h_cgroup;
> > +
> > + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
> > + kfree(h_cgroup);
> > +}
> > +
> > +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> > + struct hugetlb_cgroup **ptr)
> > +{
> > + int ret = 0;
> > + struct res_counter *fail_res;
> > + struct hugetlb_cgroup *h_cg = NULL;
> > + unsigned long csize = nr_pages * PAGE_SIZE;
> > +
> > + if (hugetlb_cgroup_disabled())
> > + goto done;
> > +again:
> > + rcu_read_lock();
> > + h_cg = hugetlb_cgroup_from_task(current);
> > + if (!h_cg)
> > + h_cg = root_h_cgroup;
> > +
> > + if (!css_tryget(&h_cg->css)) {
> > + rcu_read_unlock();
> > + goto again;
>
> You don't want some form of limit on how many times you can
> loop around?
>
you mean fail the allocation after some tries. I am not sure memcg doesn't do that.
> > + }
> > + rcu_read_unlock();
> > +
> > + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> > + css_put(&h_cg->css);
> > +done:
> > + *ptr = h_cg;
> > + return ret;
> > +}
> > +
>
-aneesh
On Wed, May 30, 2012 at 06:02:59PM -0700, David Rientjes wrote:
> On Wed, 30 May 2012, Aneesh Kumar K.V wrote:
>
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> > VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> > VM_FAULT_* values from MAX_ERRNO.
> >
>
> Yeah, but is there a reason for using VM_FAULT_HWPOISON_LARGE_MASK since
> that's the only VM_FAULT_* value that is greater than MAX_ERRNO? The rest
> of your patch set doesn't require this, so I think this change should just
> be dropped. (And PTR_ERR() still returns long, this wasn't fixed from my
> original review.)
>
The changes was done as per Andrew's request so that we don't have such hidden
dependencies on the values of VM_FAULT_*. Yes it can be a seperate patch from
the patchset. I have changed int to long as per your review.
-aneesh
On Wed, May 30, 2012 at 08:48:40PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, May 30, 2012 at 08:08:46PM +0530, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > Rename max_hstate to hugetlb_max_hstate. We will be using this from other
> > subsystems like hugetlb controller in later patches.
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> > Acked-by: Hillf Danton <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
>
> Your SOB needs to be the last thing.
I started with patches in -next, because Andrew had few fixup on top of last
patch series. So I ended up with this format. I have fixed them locally now
-aneesh
On Thu, 31 May 2012, Aneesh Kumar K.V wrote:
> > Yeah, but is there a reason for using VM_FAULT_HWPOISON_LARGE_MASK since
> > that's the only VM_FAULT_* value that is greater than MAX_ERRNO? The rest
> > of your patch set doesn't require this, so I think this change should just
> > be dropped. (And PTR_ERR() still returns long, this wasn't fixed from my
> > original review.)
> >
>
> The changes was done as per Andrew's request so that we don't have such hidden
> dependencies on the values of VM_FAULT_*. Yes it can be a seperate patch from
> the patchset. I have changed int to long as per your review.
>
I think it confuscates the code, can't we just add something like
BUILD_BUG_ON() to ensure that PTR_ERR() never uses values that are outside
the bounds of MAX_ERRNO so we'll catch these at compile time if
mm/hugetlb.c or anything else is ever extended to use such values?
On Thu 31-05-12 11:13:16, Aneesh Kumar K.V wrote:
> On Wed, May 30, 2012 at 09:19:54PM -0400, Konrad Rzeszutek Wilk wrote:
[...]
> > > +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
> > > +{
> > > + int idx;
> > > + struct cgroup *parent_cgroup;
> > > + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
> > > +
> > > + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> > > + if (!h_cgroup)
> > > + return ERR_PTR(-ENOMEM);
> > > +
> >
> > No need to check cgroup for NULL?
>
> Other cgroups (memcg) doesn't do that. Can we really get NULL cgroup tere ?
No we cannot. See cfa449461e67b60df986170eecb089831fa9e49a
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Wed 30-05-12 20:08:55, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patch implements a new controller that allows us to control HugeTLB
> allocations. The extension allows to limit the HugeTLB usage per control
> group and enforces the controller limit during page fault. Since HugeTLB
> doesn't support page reclaim, enforcing the limit at page fault time implies
> that, the application will get SIGBUS signal if it tries to access HugeTLB
> pages beyond its limit. This requires the application to know beforehand
> how much HugeTLB pages it would require for its use.
You forgot to mention that the tracking is based on page_cgroup which
is essential IMO. This also means that shadow pages are allocated for
_every_ single page in the system even though only a preallocated huge
pages (their heads to be precise) use them. Please mention that in the
Kconfig help text as well. Users should be aware of that.
The overhead is huge but this might change in future because there is
tendency to merge page_cgroup with struct page.
I would also appreciate if you describe the motivation why is this a
separate controller here in the description.
You are also changing behavior of cgroup_disable slightly. Many users of
distribution kernels are used to disable memory controller (which is
compiled in by default) because of its memory footprint primarily so
they use cgroup_disable=memory boot parameter. Things changed with
this patch because this won't be enough and they have to learn about
hugetlb controller which has to be disabled as well (distributions will
have to compile it in as well).
As I already mentioned earlier I do not see any of these as a show
stopper. If people feel strong that this should be separate because
they need only hugetlb pages tracking without memcg then why not.
It is definitely much better than range tracking proposed at the
beginning.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
(2012/05/30 23:38), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> We will use it later to make page_cgroup track the hugetlb cgroup information.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
> ---
> include/linux/mmzone.h | 2 +-
> include/linux/page_cgroup.h | 8 ++++----
> init/Kconfig | 4 ++++
> mm/Makefile | 3 ++-
> mm/memcontrol.c | 42 +++++++++++++++++++++++++-----------------
> 5 files changed, 36 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..2483cc5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1052,7 +1052,7 @@ struct mem_section {
>
> /* See declaration of similar field in struct zone */
> unsigned long *pageblock_flags;
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_PAGE_CGROUP
> /*
> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
> * section. (see memcontrol.h/page_cgroup.h about this.)
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index a88cdba..7bbfe37 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -12,7 +12,7 @@ enum {
> #ifndef __GENERATING_BOUNDS_H
> #include<generated/bounds.h>
>
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +#ifdef CONFIG_PAGE_CGROUP
> #include<linux/bit_spinlock.h>
>
> /*
> @@ -24,7 +24,7 @@ enum {
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> + struct cgroup *cgroup;
> };
>
This patch seems very bad.
- What is the performance impact to memcg ? Doesn't this add extra overheads
to memcg lookup ?
- Hugetlb reuquires much more smaller number of tracking information rather
than memcg requires. I guess you can record the information into page->private
if you want.
- This may prevent us from the work 'reducing size of page_cgroup'
So, strong Nack to this. I guess you can use page->private or some entries in
struct page, you have many pages per accounting units. Please make an effort
to avoid using page_cgroup.
Thanks,
-Kame
> void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> @@ -82,7 +82,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
> bit_spin_unlock(PCG_LOCK,&pc->flags);
> }
>
> -#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#else /* CONFIG_PAGE_CGROUP */
> struct page_cgroup;
>
> static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> @@ -102,7 +102,7 @@ static inline void __init page_cgroup_init_flatmem(void)
> {
> }
>
> -#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
> +#endif /* CONFIG_PAGE_CGROUP */
>
> #include<linux/swap.h>
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 81816b8..1363203 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -687,10 +687,14 @@ config RESOURCE_COUNTERS
> This option enables controller independent resource accounting
> infrastructure that works with cgroups.
>
> +config PAGE_CGROUP
> + bool
> +
> config CGROUP_MEM_RES_CTLR
> bool "Memory Resource Controller for Control Groups"
> depends on RESOURCE_COUNTERS
> select MM_OWNER
> + select PAGE_CGROUP
> help
> Provides a memory resource controller that manages both anonymous
> memory and page cache. (See Documentation/cgroups/memory.txt)
> diff --git a/mm/Makefile b/mm/Makefile
> index a156285..a70f9a9 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -47,7 +47,8 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
> obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> -obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_PAGE_CGROUP) += page_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ac35bcc..6df019b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -864,6 +864,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
>
> struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
> {
> + if (!cont)
> + return NULL;
> return container_of(cgroup_subsys_state(cont,
> mem_cgroup_subsys_id), struct mem_cgroup,
> css);
> @@ -1097,7 +1099,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> return&zone->lruvec;
>
> pc = lookup_page_cgroup(page);
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
>
> /*
> * Surreptitiously switch any uncharged offlist page to root:
> @@ -1108,8 +1110,10 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> * under page_cgroup lock: between them, they make all uses
> * of pc->mem_cgroup safe.
> */
> - if (!PageLRU(page)&& !PageCgroupUsed(pc)&& memcg != root_mem_cgroup)
> - pc->mem_cgroup = memcg = root_mem_cgroup;
> + if (!PageLRU(page)&& !PageCgroupUsed(pc)&& memcg != root_mem_cgroup) {
> + memcg = root_mem_cgroup;
> + pc->cgroup = memcg->css.cgroup;
> + }
>
> mz = page_cgroup_zoneinfo(memcg, page);
> return&mz->lruvec;
> @@ -1889,12 +1893,14 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> void __mem_cgroup_begin_update_page_stat(struct page *page,
> bool *locked, unsigned long *flags)
> {
> + struct cgroup *cgroup;
> struct mem_cgroup *memcg;
> struct page_cgroup *pc;
>
> pc = lookup_page_cgroup(page);
> again:
> - memcg = pc->mem_cgroup;
> + cgroup = pc->cgroup;
> + memcg = mem_cgroup_from_cont(cgroup);
> if (unlikely(!memcg || !PageCgroupUsed(pc)))
> return;
> /*
> @@ -1907,7 +1913,7 @@ again:
> return;
>
> move_lock_mem_cgroup(memcg, flags);
> - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> + if (cgroup != pc->cgroup || !PageCgroupUsed(pc)) {
> move_unlock_mem_cgroup(memcg, flags);
> goto again;
> }
> @@ -1923,7 +1929,7 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> * lock is held because a routine modifies pc->mem_cgroup
> * should take move_lock_page_cgroup().
> */
> - move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> + move_unlock_mem_cgroup(mem_cgroup_from_cont(pc->cgroup), flags);
> }
>
> void mem_cgroup_update_page_stat(struct page *page,
> @@ -1936,7 +1942,7 @@ void mem_cgroup_update_page_stat(struct page *page,
> if (mem_cgroup_disabled())
> return;
>
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
> if (unlikely(!memcg || !PageCgroupUsed(pc)))
> return;
>
> @@ -2444,7 +2450,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
> pc = lookup_page_cgroup(page);
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
> if (memcg&& !css_tryget(&memcg->css))
> memcg = NULL;
> } else if (PageSwapCache(page)) {
> @@ -2491,14 +2497,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
> zone = page_zone(page);
> spin_lock_irq(&zone->lru_lock);
> if (PageLRU(page)) {
> - lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> + lruvec = mem_cgroup_zone_lruvec(zone,
> + mem_cgroup_from_cont(pc->cgroup));
> ClearPageLRU(page);
> del_page_from_lru_list(page, lruvec, page_lru(page));
> was_on_lru = true;
> }
> }
>
> - pc->mem_cgroup = memcg;
> + pc->cgroup = memcg->css.cgroup;
> /*
> * We access a page_cgroup asynchronously without lock_page_cgroup().
> * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> @@ -2511,7 +2518,8 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>
> if (lrucare) {
> if (was_on_lru) {
> - lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> + lruvec = mem_cgroup_zone_lruvec(zone,
> + mem_cgroup_from_cont(pc->cgroup));
> VM_BUG_ON(PageLRU(page));
> SetPageLRU(page);
> add_page_to_lru_list(page, lruvec, page_lru(page));
> @@ -2601,7 +2609,7 @@ static int mem_cgroup_move_account(struct page *page,
> lock_page_cgroup(pc);
>
> ret = -EINVAL;
> - if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
> + if (!PageCgroupUsed(pc) || pc->cgroup != from->css.cgroup)
> goto unlock;
>
> move_lock_mem_cgroup(from,&flags);
> @@ -2616,7 +2624,7 @@ static int mem_cgroup_move_account(struct page *page,
> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
> /* caller should have done css_get */
> - pc->mem_cgroup = to;
> + pc->cgroup = to->css.cgroup;
> mem_cgroup_charge_statistics(to, anon, nr_pages);
> /*
> * We charges against "to" which may not have any tasks. Then, "to"
> @@ -2937,7 +2945,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>
> lock_page_cgroup(pc);
>
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
>
> if (!PageCgroupUsed(pc))
> goto unlock_out;
> @@ -3183,7 +3191,7 @@ int mem_cgroup_prepare_migration(struct page *page,
> pc = lookup_page_cgroup(page);
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
> css_get(&memcg->css);
> /*
> * At migrating an anonymous page, its mapcount goes down
> @@ -3328,7 +3336,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
> /* fix accounting on old pages */
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> - memcg = pc->mem_cgroup;
> + memcg = mem_cgroup_from_cont(pc->cgroup);
> mem_cgroup_charge_statistics(memcg, false, -1);
> ClearPageCgroupUsed(pc);
> }
> @@ -5135,7 +5143,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> * mem_cgroup_move_account() checks the pc is valid or not under
> * the lock.
> */
> - if (PageCgroupUsed(pc)&& pc->mem_cgroup == mc.from) {
> + if (PageCgroupUsed(pc)&& pc->cgroup == mc.from->css.cgroup) {
> ret = MC_TARGET_PAGE;
> if (target)
> target->page = page;
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/05/30 23:38), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> We will use it later to make page_cgroup track the hugetlb cgroup information.
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>> ---
>> include/linux/mmzone.h | 2 +-
>> include/linux/page_cgroup.h | 8 ++++----
>> init/Kconfig | 4 ++++
>> mm/Makefile | 3 ++-
>> mm/memcontrol.c | 42 +++++++++++++++++++++++++-----------------
>> 5 files changed, 36 insertions(+), 23 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 2427706..2483cc5 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1052,7 +1052,7 @@ struct mem_section {
>>
>> /* See declaration of similar field in struct zone */
>> unsigned long *pageblock_flags;
>> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> +#ifdef CONFIG_PAGE_CGROUP
>> /*
>> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
>> * section. (see memcontrol.h/page_cgroup.h about this.)
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index a88cdba..7bbfe37 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -12,7 +12,7 @@ enum {
>> #ifndef __GENERATING_BOUNDS_H
>> #include<generated/bounds.h>
>>
>> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> +#ifdef CONFIG_PAGE_CGROUP
>> #include<linux/bit_spinlock.h>
>>
>> /*
>> @@ -24,7 +24,7 @@ enum {
>> */
>> struct page_cgroup {
>> unsigned long flags;
>> - struct mem_cgroup *mem_cgroup;
>> + struct cgroup *cgroup;
>> };
>>
>
> This patch seems very bad.
I had to change that to
struct page_cgroup {
unsigned long flags;
struct cgroup_subsys_state *css;
};
to get memcg to work. We end up changing css.cgroup on cgroupfs mount/umount.
>
> - What is the performance impact to memcg ? Doesn't this add extra overheads
> to memcg lookup ?
Considering that we are stashing cgroup_subsys_state, it should be a
simple addition. I haven't measured the exact numbers. Do you have any
suggestion on the tests I can run ?
> - Hugetlb reuquires much more smaller number of tracking information rather
> than memcg requires. I guess you can record the information into page->private
> if you want.
So If we end up tracking page cgroup in struct page all these extra over
head will go away. And in most case we would have both memcg and hugetlb
enabled by default.
> - This may prevent us from the work 'reducing size of page_cgroup'
>
by reducing you mean moving struct page_cgroup info to struct page
itself ? If so this should not have any impact right ? Most of the
requirement of hugetlb should be similar to memcg.
> So, strong Nack to this. I guess you can use page->private or some entries in
> struct page, you have many pages per accounting units. Please make an effort
> to avoid using page_cgroup.
>
HugeTLB already use page->private of compound page head to track subpool
pointer. So we won't be able to use page->private.
-aneesh
(2012/06/05 11:53), Aneesh Kumar K.V wrote:
> Kamezawa Hiroyuki<[email protected]> writes:
>
>> (2012/05/30 23:38), Aneesh Kumar K.V wrote:
>>> From: "Aneesh Kumar K.V"<[email protected]>
>>>
>>> We will use it later to make page_cgroup track the hugetlb cgroup information.
>>>
>>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>>> ---
>>> include/linux/mmzone.h | 2 +-
>>> include/linux/page_cgroup.h | 8 ++++----
>>> init/Kconfig | 4 ++++
>>> mm/Makefile | 3 ++-
>>> mm/memcontrol.c | 42 +++++++++++++++++++++++++-----------------
>>> 5 files changed, 36 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 2427706..2483cc5 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -1052,7 +1052,7 @@ struct mem_section {
>>>
>>> /* See declaration of similar field in struct zone */
>>> unsigned long *pageblock_flags;
>>> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>>> +#ifdef CONFIG_PAGE_CGROUP
>>> /*
>>> * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
>>> * section. (see memcontrol.h/page_cgroup.h about this.)
>>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>>> index a88cdba..7bbfe37 100644
>>> --- a/include/linux/page_cgroup.h
>>> +++ b/include/linux/page_cgroup.h
>>> @@ -12,7 +12,7 @@ enum {
>>> #ifndef __GENERATING_BOUNDS_H
>>> #include<generated/bounds.h>
>>>
>>> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>>> +#ifdef CONFIG_PAGE_CGROUP
>>> #include<linux/bit_spinlock.h>
>>>
>>> /*
>>> @@ -24,7 +24,7 @@ enum {
>>> */
>>> struct page_cgroup {
>>> unsigned long flags;
>>> - struct mem_cgroup *mem_cgroup;
>>> + struct cgroup *cgroup;
>>> };
>>>
>>
>> This patch seems very bad.
>
> I had to change that to
>
> struct page_cgroup {
> unsigned long flags;
> struct cgroup_subsys_state *css;
> };
>
> to get memcg to work. We end up changing css.cgroup on cgroupfs mount/umount.
>
Hmm, then pointer to memcg can be calculated by this *css.
Ok to this.
>>
>> - What is the performance impact to memcg ? Doesn't this add extra overheads
>> to memcg lookup ?
>
> Considering that we are stashing cgroup_subsys_state, it should be a
> simple addition. I haven't measured the exact numbers. Do you have any
> suggestion on the tests I can run ?
>
copy-on-write, parallel page fault, file creation/deletion etc..
>> - Hugetlb reuquires much more smaller number of tracking information rather
>> than memcg requires. I guess you can record the information into page->private
>> if you want.
>
> So If we end up tracking page cgroup in struct page all these extra over
> head will go away. And in most case we would have both memcg and hugetlb
> enabled by default.
>
>> - This may prevent us from the work 'reducing size of page_cgroup'
>>
>
> by reducing you mean moving struct page_cgroup info to struct page
> itself ? If so this should not have any impact right ?
I'm not sure but....doesn't this change bring impact to rules around
(un)lock_page_cgroup() and pc->memcg overwriting algorithm ?
Let me think....but maybe discussing without patch was wrong. sorry.
>Most of the requirement of hugetlb should be similar to memcg.
>
Yes and No. hugetlb just requires 1/HUGEPAGE_SIZE of tracking information.
So, as Michal pointed out, if the user _really_ want to avoid
overheads of memcg, the effect cgroup_disable=memory should be kept.
If you use page_cgroup, you cannot save memory by the boot option.
This makes the points 'creating hugetlb only subsys for avoiding memcg overheads'
unclear. You don't need tracking information per page and it can be dynamically
allocated. Or please range-tracking as Michal proposed.
>> So, strong Nack to this. I guess you can use page->private or some entries in
>> struct page, you have many pages per accounting units. Please make an effort
>> to avoid using page_cgroup.
>>
>
> HugeTLB already use page->private of compound page head to track subpool
> pointer. So we won't be able to use page->private.
>
You can use other pages than head/tails.
For example,I think you have 512 pages per 2M pages.
Thanks,
-Kame
Kamezawa Hiroyuki <[email protected]> writes:
> You can use other pages than head/tails.
> For example,I think you have 512 pages per 2M pages.
How about the below. This limit the usage to hugetlb cgroup to only
hugepages with more than 3 normal pages. I guess that is an acceptable limitation.
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
{
if (!PageHuge(page))
return NULL;
if (compound_order(page) < 3)
return NULL;
return (struct hugetlb_cgroup *)page[2].lru.next;
}
static inline
int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
{
if (!PageHuge(page))
return -1;
if (compound_order(page) < 3)
return -1;
page[2].lru.next = (void *)h_cg;
return 0;
}
-aneesh