2012-06-13 10:28:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 00/15] hugetlb: Add HugeTLB controller to control HugeTLB allocation

Hi,

This patchset implements a cgroup resource controller for HugeTLB
pages. The controller allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since
HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if
it tries to access HugeTLB pages beyond its limit. This requires
the application to know beforehand how much HugeTLB pages it would
require for its use.

The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met,
job schedulers like (SLURM) will schedule the job. We need to make sure
that the jobs won't consume more resources than requested. If they do
we should either error out or kill the application.

Patches are on top of v3.5-rc2

Changes from V8:
* Address review feedback

Changes from V7:
* Remove dependency on page_cgroup.
* Use page[2].lru.next to store HugeTLB cgroup information.

Changes from V6:
* Implement the controller as a seperate HugeTLB cgroup.
* Folded fixup patches in -mm to the original patches

Changes from V5:
* Address review feedback.

Changes from V4:
* Add support for charge/uncharge during page migration
* Drop the usage of page->lru in unmap_hugepage_range.

Changes from v3:
* Address review feedback.
* Fix a bug in cgroup removal related parent charging with use_hierarchy set

Changes from V2:
* Changed the implementation to limit the HugeTLB usage during page
fault time. This simplifies the extension and keep it closer to
memcg design. This also allows to support cgroup removal with less
complexity. Only caveat is the application should ensure its HugeTLB
usage doesn't cross the cgroup limit.

Changes from V1:
* Changed the implementation as a memcg extension. We still use
the same logic to track the cgroup and range.

Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes

-aneesh


2012-06-13 10:28:25

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

From: "Aneesh Kumar K.V" <[email protected]>

Use a mmu_gather instead of a temporary linked list for accumulating
pages when we unmap a hugepage range

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
include/linux/hugetlb.h | 22 ++++++++++++++----
mm/hugetlb.c | 59 ++++++++++++++++++++++++++++-------------------
mm/memory.c | 7 ++++--
4 files changed, 59 insertions(+), 33 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cc9281b..ff233e4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
else
v_offset = 0;

- __unmap_hugepage_range(vma,
- vma->vm_start + v_offset, vma->vm_end, NULL);
+ unmap_hugepage_range(vma, vma->vm_start + v_offset,
+ vma->vm_end, NULL);
}
}

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 217f528..0f23c18 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -7,6 +7,7 @@

struct ctl_table;
struct user_struct;
+struct mmu_gather;

#ifdef CONFIG_HUGETLB_PAGE

@@ -40,9 +41,10 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
struct page **, struct vm_area_struct **,
unsigned long *, int *, int, unsigned int flags);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(int, char *);
@@ -98,7 +100,6 @@ static inline unsigned long hugetlb_total_pages(void)
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end, page) BUG()
static inline void hugetlb_report_meminfo(struct seq_file *m)
{
}
@@ -112,13 +113,24 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
#define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
#define huge_pte_offset(mm, address) 0
-#define dequeue_hwpoisoned_huge_page(page) 0
+static inline int dequeue_hwpoisoned_huge_page(struct page *page)
+{
+ return 0;
+}
+
static inline void copy_huge_page(struct page *dst, struct page *src)
{
}

#define hugetlb_change_protection(vma, address, end, newprot)

+static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, struct page *ref_page)
+{
+ BUG();
+}
+
#endif /* !CONFIG_HUGETLB_PAGE */

#define HUGETLB_ANON_FILE "anon_hugepage"
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b1e0ed1..e54b695 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,8 +24,9 @@

#include <asm/page.h>
#include <asm/pgtable.h>
-#include <linux/io.h>
+#include <asm/tlb.h>

+#include <linux/io.h>
#include <linux/hugetlb.h>
#include <linux/node.h>
#include "internal.h"
@@ -2310,30 +2311,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
return 0;
}

-void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page)
{
+ int force_flush = 0;
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
pte_t *ptep;
pte_t pte;
struct page *page;
- struct page *tmp;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);

- /*
- * A page gathering list, protected by per file i_mmap_mutex. The
- * lock is used to avoid list corruption from multiple unmapping
- * of the same page since we are using page->lru.
- */
- LIST_HEAD(page_list);
-
WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));

+ tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, start, end);
+again:
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -2372,30 +2369,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
}

pte = huge_ptep_get_and_clear(mm, address, ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address);
if (pte_dirty(pte))
set_page_dirty(page);
- list_add(&page->lru, &page_list);

+ page_remove_rmap(page);
+ force_flush = !__tlb_remove_page(tlb, page);
+ if (force_flush)
+ break;
/* Bail out after unmapping reference page if supplied */
if (ref_page)
break;
}
- flush_tlb_range(vma, start, end);
spin_unlock(&mm->page_table_lock);
- mmu_notifier_invalidate_range_end(mm, start, end);
- list_for_each_entry_safe(page, tmp, &page_list, lru) {
- page_remove_rmap(page);
- list_del(&page->lru);
- put_page(page);
+ /*
+ * mmu_gather ran out of room to batch pages, we break out of
+ * the PTE lock to avoid doing the potential expensive TLB invalidate
+ * and page-free while holding it.
+ */
+ if (force_flush) {
+ force_flush = 0;
+ tlb_flush_mmu(tlb);
+ if (address < end && !ref_page)
+ goto again;
}
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_end_vma(tlb, vma);
}

void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page)
{
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- __unmap_hugepage_range(vma, start, end, ref_page);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ struct mm_struct *mm;
+ struct mmu_gather tlb;
+
+ mm = vma->vm_mm;
+
+ tlb_gather_mmu(&tlb, mm, 0);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ tlb_finish_mmu(&tlb, start, end);
}

/*
@@ -2440,9 +2452,8 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* from the time of fork. This would look like data corruption
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
- __unmap_hugepage_range(iter_vma,
- address, address + huge_page_size(h),
- page);
+ unmap_hugepage_range(iter_vma, address,
+ address + huge_page_size(h), page);
}
mutex_unlock(&mapping->i_mmap_mutex);

diff --git a/mm/memory.c b/mm/memory.c
index 1b7dc66..545e18a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file)
- unmap_hugepage_range(vma, start, end, NULL);
+ if (vma->vm_file) {
+ mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ __unmap_hugepage_range(tlb, vma, start, end, NULL);
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10

2012-06-13 10:28:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 08/15] hugetlb: Make some static variables global

From: "Aneesh Kumar K.V" <[email protected]>

We will use them later in hugetlb_cgroup.c

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 7 ++-----
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ed550d8..4aca057 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -21,6 +21,11 @@ struct hugepage_subpool {
long max_hpages, used_hpages;
};

+extern spinlock_t hugetlb_lock;
+extern int hugetlb_max_hstate;
+#define for_each_hstate(h) \
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
+
struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
void hugepage_put_subpool(struct hugepage_subpool *spool);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b5b6e15..e899a2d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;

-static int hugetlb_max_hstate;
+int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];

@@ -46,13 +46,10 @@ static struct hstate * __initdata parsed_hstate;
static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;

-#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
-
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
*/
-static DEFINE_SPINLOCK(hugetlb_lock);
+DEFINE_SPINLOCK(hugetlb_lock);

static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
--
1.7.10

2012-06-13 10:28:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 09/15] mm/hugetlb: Add new HugeTLB cgroup

From: "Aneesh Kumar K.V" <[email protected]>

This patch implements a new controller that allows us to control HugeTLB
allocations. The extension allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to access HugeTLB
pages beyond its limit. This requires the application to know beforehand
how much HugeTLB pages it would require for its use.

The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/cgroup_subsys.h | 6 ++
include/linux/hugetlb_cgroup.h | 37 ++++++++++++
init/Kconfig | 15 +++++
mm/Makefile | 1 +
mm/hugetlb_cgroup.c | 122 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 181 insertions(+)
create mode 100644 include/linux/hugetlb_cgroup.h
create mode 100644 mm/hugetlb_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..895923a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,9 @@ SUBSYS(net_prio)
#endif

/* */
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+SUBSYS(hugetlb)
+#endif
+
+/* */
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
new file mode 100644
index 0000000..e9944b4
--- /dev/null
+++ b/include/linux/hugetlb_cgroup.h
@@ -0,0 +1,37 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_HUGETLB_CGROUP_H
+#define _LINUX_HUGETLB_CGROUP_H
+
+#include <linux/res_counter.h>
+
+struct hugetlb_cgroup;
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ if (hugetlb_subsys.disabled)
+ return true;
+ return false;
+}
+
+#else
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ return true;
+}
+
+#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index d07dcf9..da05fae 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -751,6 +751,21 @@ config CGROUP_MEM_RES_CTLR_KMEM
the kmem extension can use it to guarantee that no group of processes
will ever exhaust kernel resources alone.

+config CGROUP_HUGETLB_RES_CTLR
+ bool "HugeTLB Resource Controller for Control Groups"
+ depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
+ default n
+ help
+ Provides a cgroup Resource Controller for HugeTLB pages.
+ When you enable this, you can put a per cgroup limit on HugeTLB usage.
+ The limit is enforced during page fault. Since HugeTLB doesn't
+ support page reclaim, enforcing the limit at page fault time implies
+ that, the application will get SIGBUS signal if it tries to access
+ HugeTLB pages beyond its limit. This requires the application to know
+ beforehand how much HugeTLB pages it would require for its use. The
+ control group is tracked in the third page lru pointer. This means
+ that we cannot use the controller with huge page less than 3 pages.
+
config CGROUP_PERF
bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
depends on PERF_EVENTS && CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..25e8002 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
new file mode 100644
index 0000000..5a4e71c
--- /dev/null
+++ b/mm/hugetlb_cgroup.c
@@ -0,0 +1,122 @@
+/*
+ *
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
+
+struct hugetlb_cgroup {
+ struct cgroup_subsys_state css;
+ /*
+ * the counter to account for hugepages from hugetlb.
+ */
+ struct res_counter hugepage[HUGE_MAX_HSTATE];
+};
+
+struct cgroup_subsys hugetlb_subsys __read_mostly;
+struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
+{
+ if (s)
+ return container_of(s, struct hugetlb_cgroup, css);
+ return NULL;
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
+{
+ return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
+ hugetlb_subsys_id));
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
+{
+ return hugetlb_cgroup_from_css(task_subsys_state(task,
+ hugetlb_subsys_id));
+}
+
+static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
+{
+ return (h_cg == root_h_cgroup);
+}
+
+static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
+{
+ if (!cg->parent)
+ return NULL;
+ return hugetlb_cgroup_from_cgroup(cg->parent);
+}
+
+static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
+{
+ int idx;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
+
+ for (idx = 0; idx < hugetlb_max_hstate; idx++) {
+ if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+ return true;
+ }
+ return false;
+}
+
+static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
+{
+ int idx;
+ struct cgroup *parent_cgroup;
+ struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
+
+ h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+ if (!h_cgroup)
+ return ERR_PTR(-ENOMEM);
+
+ parent_cgroup = cgroup->parent;
+ if (parent_cgroup) {
+ parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx],
+ &parent_h_cgroup->hugepage[idx]);
+ } else {
+ root_h_cgroup = h_cgroup;
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx], NULL);
+ }
+ return &h_cgroup->css;
+}
+
+static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
+{
+ struct hugetlb_cgroup *h_cgroup;
+
+ h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
+ kfree(h_cgroup);
+}
+
+static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
+{
+ /* We will add the cgroup removal support in later patches */
+ return -EBUSY;
+}
+
+struct cgroup_subsys hugetlb_subsys = {
+ .name = "hugetlb",
+ .create = hugetlb_cgroup_create,
+ .pre_destroy = hugetlb_cgroup_pre_destroy,
+ .destroy = hugetlb_cgroup_destroy,
+ .subsys_id = hugetlb_subsys_id,
+};
--
1.7.10

2012-06-13 10:28:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 07/15] hugetlb: add a list for tracking in-use HugeTLB pages

From: "Aneesh Kumar K.V" <[email protected]>

hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 12 +++++++-----
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0f23c18..ed550d8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -211,6 +211,7 @@ struct hstate {
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
+ struct list_head hugepage_activelist;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e54b695..b5b6e15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -510,7 +510,7 @@ void copy_huge_page(struct page *dst, struct page *src)
static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
- list_add(&page->lru, &h->hugepage_freelists[nid]);
+ list_move(&page->lru, &h->hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
}
@@ -522,7 +522,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
if (list_empty(&h->hugepage_freelists[nid]))
return NULL;
page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
- list_del(&page->lru);
+ list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
@@ -626,10 +626,11 @@ static void free_huge_page(struct page *page)
page->mapping = NULL;
BUG_ON(page_count(page));
BUG_ON(page_mapcount(page));
- INIT_LIST_HEAD(&page->lru);

spin_lock(&hugetlb_lock);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+ /* remove the page from active list */
+ list_del(&page->lru);
update_and_free_page(h, page);
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
@@ -642,6 +643,7 @@ static void free_huge_page(struct page *page)

static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
+ INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
h->nr_huge_pages++;
@@ -890,6 +892,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)

spin_lock(&hugetlb_lock);
if (page) {
+ INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
/*
@@ -994,7 +997,6 @@ retry:
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
if ((--needed) < 0)
break;
- list_del(&page->lru);
/*
* This page is now managed by the hugetlb allocator and has
* no users -- drop the buddy allocator's reference.
@@ -1009,7 +1011,6 @@ free:
/* Free unnecessary surplus pages to the buddy allocator */
if (!list_empty(&surplus_list)) {
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- list_del(&page->lru);
put_page(page);
}
}
@@ -1909,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+ INIT_LIST_HEAD(&h->hugepage_activelist);
h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
--
1.7.10

2012-06-13 10:28:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 10/15] hugetlb/cgroup: Add the cgroup pointer to page lru

From: "Aneesh Kumar K.V" <[email protected]>

Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
the usage to hugetlb cgroup to only hugepages with 3 or more
normal pages. I guess that is an acceptable limitation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 37 +++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 4 ++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index e9944b4..be1a9f8 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -20,6 +20,32 @@
struct hugetlb_cgroup;

#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+/*
+ * Minimum page order trackable by hugetlb cgroup.
+ * At least 3 pages are necessary for all the tracking information.
+ */
+#define HUGETLB_CGROUP_MIN_ORDER 2
+
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ VM_BUG_ON(!PageHuge(page));
+
+ if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
+ return NULL;
+ return (struct hugetlb_cgroup *)page[2].lru.next;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ VM_BUG_ON(!PageHuge(page));
+
+ if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
+ return -1;
+ page[2].lru.next = (void *)h_cg;
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
if (hugetlb_subsys.disabled)
@@ -28,6 +54,17 @@ static inline bool hugetlb_cgroup_disabled(void)
}

#else
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
return true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e899a2d..6a449c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,7 @@

#include <linux/io.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
#include "internal.h"

@@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_reserved |
1 << PG_private | 1 << PG_writeback);
}
+ VM_BUG_ON(hugetlb_cgroup_from_page(page));
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
arch_release_hugepage(page);
@@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
+ set_hugetlb_cgroup(page, NULL);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
@@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
+ set_hugetlb_cgroup(page, NULL);
/*
* We incremented the global counters already
*/
--
1.7.10

2012-06-13 10:28:33

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

From: "Aneesh Kumar K.V" <[email protected]>

This patchset add the charge and uncharge routines for hugetlb cgroup.
We do cgroup charging in page alloc and uncharge in compound page
destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 38 +++++++++++++++++++
mm/hugetlb.c | 16 +++++++-
mm/hugetlb_cgroup.c | 80 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index be1a9f8..e05871c 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -53,6 +53,16 @@ static inline bool hugetlb_cgroup_disabled(void)
return false;
}

+extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr);
+extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg);
+
#else
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
{
@@ -70,5 +80,33 @@ static inline bool hugetlb_cgroup_disabled(void)
return true;
}

+static inline int
+hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline void
+hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ return;
+}
+
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a449c5..59720b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -627,6 +627,8 @@ static void free_huge_page(struct page *page)
BUG_ON(page_mapcount(page));

spin_lock(&hugetlb_lock);
+ hugetlb_cgroup_uncharge_page(hstate_index(h),
+ pages_per_huge_page(h), page);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
/* remove the page from active list */
list_del(&page->lru);
@@ -1115,7 +1117,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *page;
long chg;
+ int ret, idx;
+ struct hugetlb_cgroup *h_cg;

+ idx = hstate_index(h);
/*
* Processes that did not create the mapping will have no
* reserves and will not have accounted against subpool
@@ -1131,6 +1136,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (hugepage_subpool_get_pages(spool, chg))
return ERR_PTR(-ENOSPC);

+ ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+ if (ret) {
+ hugepage_subpool_put_pages(spool, chg);
+ return ERR_PTR(-ENOSPC);
+ }
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
spin_unlock(&hugetlb_lock);
@@ -1138,6 +1148,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
+ hugetlb_cgroup_uncharge_cgroup(idx,
+ pages_per_huge_page(h),
+ h_cg);
hugepage_subpool_put_pages(spool, chg);
return ERR_PTR(-ENOSPC);
}
@@ -1146,7 +1159,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long)spool);

vma_commit_reservation(h, vma, addr);
-
+ /* update page cgroup details */
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
return page;
}

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 5a4e71c..0f2f6ac 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -113,6 +113,86 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
return -EBUSY;
}

+int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ int ret = 0;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *h_cg = NULL;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ goto done;
+ /*
+ * We don't charge any cgroup if the compound page have less
+ * than 3 pages.
+ */
+ if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
+ goto done;
+again:
+ rcu_read_lock();
+ h_cg = hugetlb_cgroup_from_task(current);
+ if (!h_cg)
+ h_cg = root_h_cgroup;
+
+ if (!css_tryget(&h_cg->css)) {
+ rcu_read_unlock();
+ goto again;
+ }
+ rcu_read_unlock();
+
+ ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+ css_put(&h_cg->css);
+done:
+ *ptr = h_cg;
+ return ret;
+}
+
+void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ if (hugetlb_cgroup_disabled() || !h_cg)
+ return;
+
+ spin_lock(&hugetlb_lock);
+ set_hugetlb_cgroup(page, h_cg);
+ spin_unlock(&hugetlb_lock);
+ return;
+}
+
+/*
+ * Should be called with hugetlb_lock held
+ */
+void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page)
+{
+ struct hugetlb_cgroup *h_cg;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+ VM_BUG_ON(!spin_is_locked(&hugetlb_lock));
+ h_cg = hugetlb_cgroup_from_page(page);
+ if (unlikely(!h_cg))
+ return;
+ set_hugetlb_cgroup(page, NULL);
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
+void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled() || !h_cg)
+ return;
+
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
--
1.7.10

2012-06-13 10:29:53

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 06/15] hugetlb: simplify migrate_huge_page()

From: "Aneesh Kumar K.V" <[email protected]>

Since we migrate only one hugepage, don't use linked list for passing the
page around. Directly pass the page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/migrate.h | 4 +--
mm/memory-failure.c | 13 ++--------
mm/migrate.c | 65 +++++++++++++++--------------------------------
3 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 855c337..ce7e667 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
-extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+extern int migrate_huge_page(struct page *, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);

@@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
-static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab1e714..53a1495 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
int ret;
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);
- LIST_HEAD(pagelist);

ret = get_any_page(page, pfn, flags);
if (ret < 0)
@@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
}

/* Keep page count to indicate a given hugepage is isolated. */
-
- list_add(&hpage->lru, &pagelist);
- ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
- true);
+ ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
+ put_page(hpage);
if (ret) {
- struct page *page1, *page2;
- list_for_each_entry_safe(page1, page2, &pagelist, lru)
- put_page(page1);
-
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
- if (ret > 0)
- ret = -EIO;
return ret;
}
done:
diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..fdce3a2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -932,15 +932,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (anon_vma)
put_anon_vma(anon_vma);
unlock_page(hpage);
-
out:
- if (rc != -EAGAIN) {
- list_del(&hpage->lru);
- put_page(hpage);
- }
-
put_page(new_hpage);
-
if (result) {
if (rc)
*result = rc;
@@ -1016,48 +1009,32 @@ out:
return nr_failed + retry;
}

-int migrate_huge_pages(struct list_head *from,
- new_page_t get_new_page, unsigned long private, bool offlining,
- enum migrate_mode mode)
+int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
+ unsigned long private, bool offlining,
+ enum migrate_mode mode)
{
- int retry = 1;
- int nr_failed = 0;
- int pass = 0;
- struct page *page;
- struct page *page2;
- int rc;
-
- for (pass = 0; pass < 10 && retry; pass++) {
- retry = 0;
-
- list_for_each_entry_safe(page, page2, from, lru) {
+ int pass, rc;
+
+ for (pass = 0; pass < 10; pass++) {
+ rc = unmap_and_move_huge_page(get_new_page,
+ private, hpage, pass > 2, offlining,
+ mode);
+ switch (rc) {
+ case -ENOMEM:
+ goto out;
+ case -EAGAIN:
+ /* try again */
cond_resched();
-
- rc = unmap_and_move_huge_page(get_new_page,
- private, page, pass > 2, offlining,
- mode);
-
- switch(rc) {
- case -ENOMEM:
- goto out;
- case -EAGAIN:
- retry++;
- break;
- case 0:
- break;
- default:
- /* Permanent failure */
- nr_failed++;
- break;
- }
+ break;
+ case 0:
+ goto out;
+ default:
+ rc = -EIO;
+ goto out;
}
}
- rc = 0;
out:
- if (rc)
- return rc;
-
- return nr_failed + retry;
+ return rc;
}

#ifdef CONFIG_NUMA
--
1.7.10

2012-06-13 10:28:23

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 02/15] hugetlb: don't use ERR_PTR with VM_FAULT* values

From: "Aneesh Kumar K.V" <[email protected]>

The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.

Acked-by: Hillf Danton <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c868309..34a7e23 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
*/
chg = vma_needs_reservation(h, vma, addr);
if (chg < 0)
- return ERR_PTR(-VM_FAULT_OOM);
+ return ERR_PTR(-ENOMEM);
if (chg)
if (hugepage_subpool_get_pages(spool, chg))
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);

spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
hugepage_subpool_put_pages(spool, chg);
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);
}
}

@@ -2496,6 +2496,7 @@ retry_avoidcopy:
new_page = alloc_huge_page(vma, address, outside_reserve);

if (IS_ERR(new_page)) {
+ long err = PTR_ERR(new_page);
page_cache_release(old_page);

/*
@@ -2524,7 +2525,10 @@ retry_avoidcopy:

/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
- return -PTR_ERR(new_page);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ else
+ return VM_FAULT_SIGBUS;
}

/*
@@ -2642,7 +2646,11 @@ retry:
goto out;
page = alloc_huge_page(vma, address, 0);
if (IS_ERR(page)) {
- ret = -PTR_ERR(page);
+ ret = PTR_ERR(page);
+ if (ret == -ENOMEM)
+ ret = VM_FAULT_OOM;
+ else
+ ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
--
1.7.10

2012-06-13 10:30:16

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 13/15] hugetlb/cgroup: add hugetlb cgroup control files

From: "Aneesh Kumar K.V" <[email protected]>

Add the control files for hugetlb controller

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 ++
include/linux/hugetlb_cgroup.h | 6 ++
mm/hugetlb.c | 8 +++
mm/hugetlb_cgroup.c | 129 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 148 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4aca057..9650bb1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
#include <linux/mm_types.h>
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>

struct ctl_table;
struct user_struct;
@@ -221,6 +222,10 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+ /* cgroup control files */
+ struct cftype cgroup_files[5];
+#endif
char name[HSTATE_NAME_LEN];
};

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index e05871c..bd8bc98 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page);
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
+extern int hugetlb_cgroup_file_init(int idx) __init;

#else
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -108,5 +109,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}

+static inline int __init hugetlb_cgroup_file_init(int idx)
+{
+ return 0;
+}
+
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59720b1..a5a30bf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -30,6 +30,7 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
+#include <linux/hugetlb_cgroup.h>
#include "internal.h"

const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1930,6 +1931,13 @@ void __init hugetlb_add_hstate(unsigned order)
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
+ /*
+ * Add cgroup control files only if the huge page consists
+ * of more than two normal pages. This is because we use
+ * page[2].lru.next for storing cgoup details.
+ */
+ if (order >= HUGETLB_CGROUP_MIN_ORDER)
+ hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);

parsed_hstate = h;
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a3a68a4..64e93e0 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -26,6 +26,10 @@ struct hugetlb_cgroup {
struct res_counter hugepage[HUGE_MAX_HSTATE];
};

+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
struct cgroup_subsys hugetlb_subsys __read_mostly;
struct hugetlb_cgroup *root_h_cgroup __read_mostly;

@@ -259,6 +263,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}

+static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ u64 val;
+ char str[64];
+ int idx, name, len;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ val = res_counter_read_u64(&h_cg->hugepage[idx], name);
+ len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
+ return simple_read_from_buffer(buf, nbytes, ppos, str, len);
+}
+
+static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+ const char *buffer)
+{
+ int idx, name, ret;
+ unsigned long long val;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ switch (name) {
+ case RES_LIMIT:
+ if (hugetlb_cgroup_is_root(h_cg)) {
+ /* Can't set limit on root */
+ ret = -EINVAL;
+ break;
+ }
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
+{
+ int idx, name, ret = 0;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(event);
+ name = MEMFILE_ATTR(event);
+
+ switch (name) {
+ case RES_MAX_USAGE:
+ res_counter_reset_max(&h_cg->hugepage[idx]);
+ break;
+ case RES_FAILCNT:
+ res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static char *mem_fmt(char *buf, int size, unsigned long hsize)
+{
+ if (hsize >= (1UL << 30))
+ snprintf(buf, size, "%luGB", hsize >> 30);
+ else if (hsize >= (1UL << 20))
+ snprintf(buf, size, "%luMB", hsize >> 20);
+ else
+ snprintf(buf, size, "%luKB", hsize >> 10);
+ return buf;
+}
+
+int __init hugetlb_cgroup_file_init(int idx)
+{
+ char buf[32];
+ struct cftype *cft;
+ struct hstate *h = &hstates[idx];
+
+ /* format the size */
+ mem_fmt(buf, 32, huge_page_size(h));
+
+ /* Add the limit file */
+ cft = &h->cgroup_files[0];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->read = hugetlb_cgroup_read;
+ cft->write_string = hugetlb_cgroup_write;
+
+ /* Add the usage file */
+ cft = &h->cgroup_files[1];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the MAX usage file */
+ cft = &h->cgroup_files[2];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the failcntfile */
+ cft = &h->cgroup_files[3];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* NULL terminate the last cft */
+ cft = &h->cgroup_files[4];
+ memset(cft, 0, sizeof(*cft));
+
+ WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
+
+ return 0;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
--
1.7.10

2012-06-13 10:30:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 15/15] hugetlb/cgroup: add HugeTLB controller documentation

From: "Aneesh Kumar K.V" <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
Documentation/cgroups/hugetlb.txt | 45 +++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
create mode 100644 Documentation/cgroups/hugetlb.txt

diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
new file mode 100644
index 0000000..a9faaca
--- /dev/null
+++ b/Documentation/cgroups/hugetlb.txt
@@ -0,0 +1,45 @@
+HugeTLB Controller
+-------------------
+
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
+HugeTLB controller can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -o hugetlb none /sys/fs/cgroup
+
+With the above step, the initial or the parent HugeTLB group becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
+
+New groups can be created under the parent group /sys/fs/cgroup.
+
+# cd /sys/fs/cgroup
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it.
+
+Brief summary of control files
+
+ hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
+ hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
+
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+hugetlb.16GB.limit_in_bytes
+hugetlb.16GB.max_usage_in_bytes
+hugetlb.16GB.usage_in_bytes
+hugetlb.16GB.failcnt
+hugetlb.16MB.limit_in_bytes
+hugetlb.16MB.max_usage_in_bytes
+hugetlb.16MB.usage_in_bytes
+hugetlb.16MB.failcnt
--
1.7.10

2012-06-13 10:28:21

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 03/15] hugetlb: add an inline helper for finding hstate index

From: "Aneesh Kumar K.V" <[email protected]>

Add an inline helper and use it in the code.

Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 20 +++++++++++---------
2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5d6bbe..217f528 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -302,6 +302,11 @@ static inline unsigned hstate_index_to_shift(unsigned index)
return hstates[index].order + PAGE_SHIFT;
}

+static inline int hstate_index(struct hstate *h)
+{
+ return h - hstates;
+}
+
#else
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -320,6 +325,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
return 1;
}
#define hstate_index_to_shift(index) 0
+#define hstate_index(h) 0
#endif

#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 34a7e23..b1e0ed1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1646,7 +1646,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct attribute_group *hstate_attr_group)
{
int retval;
- int hi = h - hstates;
+ int hi = hstate_index(h);

hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
if (!hstate_kobjs[hi])
@@ -1741,11 +1741,13 @@ void hugetlb_unregister_node(struct node *node)
if (!nhs->hugepages_kobj)
return; /* no hstate attributes */

- for_each_hstate(h)
- if (nhs->hstate_kobjs[h - hstates]) {
- kobject_put(nhs->hstate_kobjs[h - hstates]);
- nhs->hstate_kobjs[h - hstates] = NULL;
+ for_each_hstate(h) {
+ int idx = hstate_index(h);
+ if (nhs->hstate_kobjs[idx]) {
+ kobject_put(nhs->hstate_kobjs[idx]);
+ nhs->hstate_kobjs[idx] = NULL;
}
+ }

kobject_put(nhs->hugepages_kobj);
nhs->hugepages_kobj = NULL;
@@ -1848,7 +1850,7 @@ static void __exit hugetlb_exit(void)
hugetlb_unregister_all_nodes();

for_each_hstate(h) {
- kobject_put(hstate_kobjs[h - hstates]);
+ kobject_put(hstate_kobjs[hstate_index(h)]);
}

kobject_put(hugepages_kobj);
@@ -1869,7 +1871,7 @@ static int __init hugetlb_init(void)
if (!size_to_hstate(default_hstate_size))
hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
}
- default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size));
if (default_hstate_max_huge_pages)
default_hstate.max_huge_pages = default_hstate_max_huge_pages;

@@ -2687,7 +2689,7 @@ retry:
*/
if (unlikely(PageHWPoison(page))) {
ret = VM_FAULT_HWPOISON |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto backout_unlocked;
}
}
@@ -2760,7 +2762,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
}

ptep = huge_pte_alloc(mm, address, huge_page_size(h));
--
1.7.10

2012-06-13 10:30:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 12/15] hugetlb/cgroup: Add support for cgroup removal

From: "Aneesh Kumar K.V" <[email protected]>

This patch add support for cgroup removal. If we don't have parent
cgroup, the charges are moved to root cgroup.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb_cgroup.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 0f2f6ac..a3a68a4 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -107,10 +107,76 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
kfree(h_cgroup);
}

+
+/*
+ * Should be called with hugetlb_lock held.
+ * Since we are holding hugetlb_lock, pages cannot get moved from
+ * active list or uncharged from the cgroup, So no need to get
+ * page reference and test for page active here. This function
+ * cannot fail.
+ */
+static void hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
+ struct page *page)
+{
+ int csize;
+ struct res_counter *counter;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *page_hcg;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+ struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
+
+ page_hcg = hugetlb_cgroup_from_page(page);
+ /*
+ * We can have pages in active list without any cgroup
+ * ie, hugepage with less than 3 pages. We can safely
+ * ignore those pages.
+ */
+ if (!page_hcg || page_hcg != h_cg)
+ goto out;
+
+ csize = PAGE_SIZE << compound_order(page);
+ if (!parent) {
+ parent = root_h_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);
+ }
+ counter = &h_cg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
+ set_hugetlb_cgroup(page, parent);
+out:
+ return;
+}
+
+/*
+ * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
+ * the parent cgroup.
+ */
static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
{
- /* We will add the cgroup removal support in later patches */
- return -EBUSY;
+ struct hstate *h;
+ struct page *page;
+ int ret = 0, idx = 0;
+
+ do {
+ if (cgroup_task_count(cgroup) ||
+ !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
+ goto out;
+ }
+ for_each_hstate(h) {
+ spin_lock(&hugetlb_lock);
+ list_for_each_entry(page, &h->hugepage_activelist, lru)
+ hugetlb_cgroup_move_parent(idx, cgroup, page);
+
+ spin_unlock(&hugetlb_lock);
+ idx++;
+ }
+ cond_resched();
+ } while (hugetlb_cgroup_have_usage(cgroup));
+out:
+ return ret;
}

int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
--
1.7.10

2012-06-13 10:31:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 14/15] hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

From: "Aneesh Kumar K.V" <[email protected]>

With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
we are holding a hugepage reference, we can be sure that old page won't
get uncharged till the last put_page().

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 8 ++++++++
mm/hugetlb_cgroup.c | 20 ++++++++++++++++++++
mm/migrate.c | 5 +++++
3 files changed, 33 insertions(+)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index bd8bc98..e9e6d74 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
extern int hugetlb_cgroup_file_init(int idx) __init;
+extern void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage);

#else
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -114,5 +116,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
return 0;
}

+static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage)
+{
+ return;
+}
+
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 64e93e0..8e7ca0a 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -388,6 +388,26 @@ int __init hugetlb_cgroup_file_init(int idx)
return 0;
}

+void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
+{
+ struct hugetlb_cgroup *h_cg;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ VM_BUG_ON(!PageHuge(oldhpage));
+ spin_lock(&hugetlb_lock);
+ h_cg = hugetlb_cgroup_from_page(oldhpage);
+ set_hugetlb_cgroup(oldhpage, NULL);
+ cgroup_exclude_rmdir(&h_cg->css);
+
+ /* move the h_cg details to new cgroup */
+ set_hugetlb_cgroup(newhpage, h_cg);
+ spin_unlock(&hugetlb_lock);
+ cgroup_release_and_wakeup_rmdir(&h_cg->css);
+ return;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
diff --git a/mm/migrate.c b/mm/migrate.c
index fdce3a2..6c37c51 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -33,6 +33,7 @@
#include <linux/memcontrol.h>
#include <linux/syscalls.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>

#include <asm/tlbflush.h>
@@ -931,6 +932,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,

if (anon_vma)
put_anon_vma(anon_vma);
+
+ if (!rc)
+ hugetlb_cgroup_migrate(hpage, new_hpage);
+
unlock_page(hpage);
out:
put_page(new_hpage);
--
1.7.10

2012-06-13 10:28:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 05/15] hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb

From: "Aneesh Kumar K.V" <[email protected]>

i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
fix linked list corruption in unmap_hugepage_range()") but we don't use
page->lru in unmap_hugepage_range any more. Also the lock was taken
higher up in the stack in some code path. That would result in deadlock.

unmap_mapping_range (i_mmap_mutex)
-> unmap_mapping_range_tree
-> unmap_mapping_range_vma
-> zap_page_range_single
-> unmap_single_vma
-> unmap_hugepage_range (i_mmap_mutex)

For shared pagetable support for huge pages, since pagetable pages are ref
counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
(39dde65c9940c97f ("shared page table for hugetlb page")).

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/memory.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 545e18a..f6bc04f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file) {
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ if (vma->vm_file)
__unmap_hugepage_range(tlb, vma, start, end, NULL);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
- }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10

2012-06-13 10:28:16

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 01/15] hugetlb: rename max_hstate to hugetlb_max_hstate

From: "Aneesh Kumar K.V" <[email protected]>

Rename max_hstate to hugetlb_max_hstate. We will be using this from other
subsystems like hugetlb controller in later patches.

Acked-by: David Rientjes <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e198831..c868309 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;

-static int max_hstate;
+static int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];

@@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;

#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)

/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
return;
}
- BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+ BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
- h = &hstates[max_hstate++];
+ h = &hstates[hugetlb_max_hstate++];
h->order = order;
h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
h->nr_huge_pages = 0;
@@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
static unsigned long *last_mhp;

/*
- * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+ * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
* so this hugepages= parameter goes to the "default hstate".
*/
- if (!max_hstate)
+ if (!hugetlb_max_hstate)
mhp = &default_hstate_max_huge_pages;
else
mhp = &parsed_hstate->max_huge_pages;
@@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
* But we need to allocate >= MAX_ORDER hstates here early to still
* use the bootmem allocator.
*/
- if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+ if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
hugetlb_hstate_alloc_pages(parsed_hstate);

last_mhp = mhp;
--
1.7.10

2012-06-13 11:32:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 10/15] hugetlb/cgroup: Add the cgroup pointer to page lru


Need this patch for hugetlb cgroup disabled. I will send an updated patch in
reply.

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index e9e6d74..bc30413 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -18,14 +18,14 @@
#include <linux/res_counter.h>

struct hugetlb_cgroup;
-
-#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
/*
* Minimum page order trackable by hugetlb cgroup.
* At least 3 pages are necessary for all the tracking information.
*/
#define HUGETLB_CGROUP_MIN_ORDER 2

+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
{
VM_BUG_ON(!PageHuge(page));

2012-06-13 11:34:43

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V9 [updated] 10/15] hugetlb/cgroup: Add the cgroup pointer to page lru

From: "Aneesh Kumar K.V" <[email protected]>

Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
the usage to hugetlb cgroup to only hugepages with 3 or more
normal pages. I guess that is an acceptable limitation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 37 +++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 4 ++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index e9944b4..2e4cb6b 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -18,8 +18,34 @@
#include <linux/res_counter.h>

struct hugetlb_cgroup;
+/*
+ * Minimum page order trackable by hugetlb cgroup.
+ * At least 3 pages are necessary for all the tracking information.
+ */
+#define HUGETLB_CGROUP_MIN_ORDER 2

#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ VM_BUG_ON(!PageHuge(page));
+
+ if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
+ return NULL;
+ return (struct hugetlb_cgroup *)page[2].lru.next;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ VM_BUG_ON(!PageHuge(page));
+
+ if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
+ return -1;
+ page[2].lru.next = (void *)h_cg;
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
if (hugetlb_subsys.disabled)
@@ -28,6 +54,17 @@ static inline bool hugetlb_cgroup_disabled(void)
}

#else
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
return true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e899a2d..6a449c5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,7 @@

#include <linux/io.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
#include "internal.h"

@@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_reserved |
1 << PG_private | 1 << PG_writeback);
}
+ VM_BUG_ON(hugetlb_cgroup_from_page(page));
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
arch_release_hugepage(page);
@@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
+ set_hugetlb_cgroup(page, NULL);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
@@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
+ set_hugetlb_cgroup(page, NULL);
/*
* We incremented the global counters already
*/
--
1.7.10

2012-06-13 14:59:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Use a mmu_gather instead of a temporary linked list for accumulating
> pages when we unmap a hugepage range

Sorry for coming up with the comment that late but you owe us an
explanation _why_ you are doing this.

I assume that this fixes a real problem when we take i_mmap_mutex
already up in
unmap_mapping_range
mutex_lock(&mapping->i_mmap_mutex);
unmap_mapping_range_tree | unmap_mapping_range_list
unmap_mapping_range_vma
zap_page_range_single
unmap_single_vma
unmap_hugepage_range
mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);

And that this should have been marked for stable as well (I haven't
checked when this has been introduced).

But then I do not see how this help when you still do this:
[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index 1b7dc66..545e18a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> * Since no pte has actually been setup, it is
> * safe to do nothing in this case.
> */
> - if (vma->vm_file)
> - unmap_hugepage_range(vma, start, end, NULL);
> + if (vma->vm_file) {
> + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> + __unmap_hugepage_range(tlb, vma, start, end, NULL);
> + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> + }
> } else
> unmap_page_range(tlb, vma, start, end, details);
> }

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-13 15:03:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

On Wed 13-06-12 16:59:23, Michal Hocko wrote:
> On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > Use a mmu_gather instead of a temporary linked list for accumulating
> > pages when we unmap a hugepage range
>
> Sorry for coming up with the comment that late but you owe us an
> explanation _why_ you are doing this.
>
> I assume that this fixes a real problem when we take i_mmap_mutex
> already up in
> unmap_mapping_range
> mutex_lock(&mapping->i_mmap_mutex);
> unmap_mapping_range_tree | unmap_mapping_range_list
> unmap_mapping_range_vma
> zap_page_range_single
> unmap_single_vma
> unmap_hugepage_range
> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>
> And that this should have been marked for stable as well (I haven't
> checked when this has been introduced).
>
> But then I do not see how this help when you still do this:
> [...]
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 1b7dc66..545e18a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > * Since no pte has actually been setup, it is
> > * safe to do nothing in this case.
> > */
> > - if (vma->vm_file)
> > - unmap_hugepage_range(vma, start, end, NULL);
> > + if (vma->vm_file) {
> > + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > + __unmap_hugepage_range(tlb, vma, start, end, NULL);
> > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> > + }
> > } else
> > unmap_page_range(tlb, vma, start, end, details);
> > }

Ahhh, you are removing the lock in the next patch. Really confusing and
not nice for the stable backport.
Could you merge those two patches and add Cc: stable?
Then you can add my
Reviewed-by: Michal Hocko <[email protected]>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-13 16:37:18

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Michal Hocko <[email protected]> writes:

> On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> Use a mmu_gather instead of a temporary linked list for accumulating
>> pages when we unmap a hugepage range
>
> Sorry for coming up with the comment that late but you owe us an
> explanation _why_ you are doing this.
>
> I assume that this fixes a real problem when we take i_mmap_mutex
> already up in
> unmap_mapping_range
> mutex_lock(&mapping->i_mmap_mutex);
> unmap_mapping_range_tree | unmap_mapping_range_list
> unmap_mapping_range_vma
> zap_page_range_single
> unmap_single_vma
> unmap_hugepage_range
> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>
> And that this should have been marked for stable as well (I haven't
> checked when this has been introduced).

Switch to mmu_gather is to get rid of the use of page->lru so that i can use it for
active list.


-aneesh

2012-06-13 16:43:38

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

Michal Hocko <[email protected]> writes:

> On Wed 13-06-12 16:59:23, Michal Hocko wrote:
>> On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
>> > From: "Aneesh Kumar K.V" <[email protected]>
>> >
>> > Use a mmu_gather instead of a temporary linked list for accumulating
>> > pages when we unmap a hugepage range
>>
>> Sorry for coming up with the comment that late but you owe us an
>> explanation _why_ you are doing this.
>>
>> I assume that this fixes a real problem when we take i_mmap_mutex
>> already up in
>> unmap_mapping_range
>> mutex_lock(&mapping->i_mmap_mutex);
>> unmap_mapping_range_tree | unmap_mapping_range_list
>> unmap_mapping_range_vma
>> zap_page_range_single
>> unmap_single_vma
>> unmap_hugepage_range
>> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>>
>> And that this should have been marked for stable as well (I haven't
>> checked when this has been introduced).
>>
>> But then I do not see how this help when you still do this:
>> [...]
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 1b7dc66..545e18a 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>> > * Since no pte has actually been setup, it is
>> > * safe to do nothing in this case.
>> > */
>> > - if (vma->vm_file)
>> > - unmap_hugepage_range(vma, start, end, NULL);
>> > + if (vma->vm_file) {
>> > + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
>> > + __unmap_hugepage_range(tlb, vma, start, end, NULL);
>> > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
>> > + }
>> > } else
>> > unmap_page_range(tlb, vma, start, end, details);
>> > }
>
> Ahhh, you are removing the lock in the next patch. Really confusing and
> not nice for the stable backport.
> Could you merge those two patches and add Cc: stable?
> Then you can add my
> Reviewed-by: Michal Hocko <[email protected]>
>

In the last review cycle I was asked to see if we can get a lockdep
report for the above and what I found was we don't really cause the
above deadlock with the current codebase because for hugetlb we don't
directly call unmap_mapping_range. But still it is good to remove the
i_mmap_mutex, because we don't need that protection now. I didn't
mark it for stable because of the above reason.

-aneesh

2012-06-14 03:11:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 05/15] hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> fix linked list corruption in unmap_hugepage_range()") but we don't use
> page->lru in unmap_hugepage_range any more. Also the lock was taken
> higher up in the stack in some code path. That would result in deadlock.
>
> unmap_mapping_range (i_mmap_mutex)
> -> unmap_mapping_range_tree
> -> unmap_mapping_range_vma
> -> zap_page_range_single
> -> unmap_single_vma
> -> unmap_hugepage_range (i_mmap_mutex)
>
> For shared pagetable support for huge pages, since pagetable pages are ref
> counted we don't need any lock during huge_pmd_unshare. We do take
> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> (39dde65c9940c97f ("shared page table for hugetlb page")).
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 03:13:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 08/15] hugetlb: Make some static variables global

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> We will use them later in hugetlb_cgroup.c
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 04:07:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 [updated] 10/15] hugetlb/cgroup: Add the cgroup pointer to page lru

(2012/06/13 20:34), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 04:09:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> This patchset add the charge and uncharge routines for hugetlb cgroup.
> We do cgroup charging in page alloc and uncharge in compound page
> destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 04:11:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 12/15] hugetlb/cgroup: Add support for cgroup removal

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> This patch add support for cgroup removal. If we don't have parent
> cgroup, the charges are moved to root cgroup.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 04:12:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 13/15] hugetlb/cgroup: add hugetlb cgroup control files

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Add the control files for hugetlb controller
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 04:15:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH -V9 14/15] hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

(2012/06/13 19:27), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
> we are holding a hugepage reference, we can be sure that old page won't
> get uncharged till the last put_page().
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 07:14:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

On Wed 13-06-12 22:13:00, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Wed 13-06-12 16:59:23, Michal Hocko wrote:
> >> On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
> >> > From: "Aneesh Kumar K.V" <[email protected]>
> >> >
> >> > Use a mmu_gather instead of a temporary linked list for accumulating
> >> > pages when we unmap a hugepage range
> >>
> >> Sorry for coming up with the comment that late but you owe us an
> >> explanation _why_ you are doing this.
> >>
> >> I assume that this fixes a real problem when we take i_mmap_mutex
> >> already up in
> >> unmap_mapping_range
> >> mutex_lock(&mapping->i_mmap_mutex);
> >> unmap_mapping_range_tree | unmap_mapping_range_list
> >> unmap_mapping_range_vma
> >> zap_page_range_single
> >> unmap_single_vma
> >> unmap_hugepage_range
> >> mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >>
> >> And that this should have been marked for stable as well (I haven't
> >> checked when this has been introduced).
> >>
> >> But then I do not see how this help when you still do this:
> >> [...]
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 1b7dc66..545e18a 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >> > * Since no pte has actually been setup, it is
> >> > * safe to do nothing in this case.
> >> > */
> >> > - if (vma->vm_file)
> >> > - unmap_hugepage_range(vma, start, end, NULL);
> >> > + if (vma->vm_file) {
> >> > + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >> > + __unmap_hugepage_range(tlb, vma, start, end, NULL);
> >> > + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >> > + }
> >> > } else
> >> > unmap_page_range(tlb, vma, start, end, details);
> >> > }
> >
> > Ahhh, you are removing the lock in the next patch. Really confusing and
> > not nice for the stable backport.
> > Could you merge those two patches and add Cc: stable?
> > Then you can add my
> > Reviewed-by: Michal Hocko <[email protected]>
> >
>
> In the last review cycle I was asked to see if we can get a lockdep
> report for the above and what I found was we don't really cause the
> above deadlock with the current codebase because for hugetlb we don't
> directly call unmap_mapping_range.

Ahh, ok I missed that.

> But still it is good to remove the i_mmap_mutex, because we don't need
> that protection now. I didn't mark it for stable because of the above
> reason.

Thanks for clarification

>
> -aneesh
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 07:16:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 04/15] hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages

On Wed 13-06-12 22:07:06, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Wed 13-06-12 15:57:23, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> Use a mmu_gather instead of a temporary linked list for accumulating
> >> pages when we unmap a hugepage range
> >
> > Sorry for coming up with the comment that late but you owe us an
> > explanation _why_ you are doing this.
> >
> > I assume that this fixes a real problem when we take i_mmap_mutex
> > already up in
> > unmap_mapping_range
> > mutex_lock(&mapping->i_mmap_mutex);
> > unmap_mapping_range_tree | unmap_mapping_range_list
> > unmap_mapping_range_vma
> > zap_page_range_single
> > unmap_single_vma
> > unmap_hugepage_range
> > mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> >
> > And that this should have been marked for stable as well (I haven't
> > checked when this has been introduced).
>
> Switch to mmu_gather is to get rid of the use of page->lru so that i can use it for
> active list.

So can we get this to the changelog please?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 07:20:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 05/15] hugetlb: avoid taking i_mmap_mutex in unmap_single_vma() for hugetlb

On Wed 13-06-12 15:57:24, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> fix linked list corruption in unmap_hugepage_range()") but we don't use
> page->lru in unmap_hugepage_range any more. Also the lock was taken
> higher up in the stack in some code path. That would result in deadlock.

This sounds like the deadlock is real but in the other email you wrote
that the deadlock cannot happen so it would be good to mention it here.

> unmap_mapping_range (i_mmap_mutex)
> -> unmap_mapping_range_tree
> -> unmap_mapping_range_vma
> -> zap_page_range_single
> -> unmap_single_vma
> -> unmap_hugepage_range (i_mmap_mutex)
>
> For shared pagetable support for huge pages, since pagetable pages are ref
> counted we don't need any lock during huge_pmd_unshare. We do take
> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> (39dde65c9940c97f ("shared page table for hugetlb page")).
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/memory.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 545e18a..f6bc04f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> * Since no pte has actually been setup, it is
> * safe to do nothing in this case.
> */
> - if (vma->vm_file) {
> - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
> + if (vma->vm_file)
> __unmap_hugepage_range(tlb, vma, start, end, NULL);
> - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
> - }
> } else
> unmap_page_range(tlb, vma, start, end, details);
> }
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 07:28:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 06/15] hugetlb: simplify migrate_huge_page()

On Wed 13-06-12 15:57:25, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Since we migrate only one hugepage, don't use linked list for passing the
> page around. Directly pass the page that need to be migrated as argument.
> This also remove the usage page->lru in migrate path.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Yes nice.
Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/migrate.h | 4 +--
> mm/memory-failure.c | 13 ++--------
> mm/migrate.c | 65 +++++++++++++++--------------------------------
> 3 files changed, 25 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 855c337..ce7e667 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
> extern int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode);
> -extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> +extern int migrate_huge_page(struct page *, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode);
>
> @@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
> static inline int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode) { return -ENOSYS; }
> -static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> +static inline int migrate_huge_page(struct page *page, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode) { return -ENOSYS; }
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ab1e714..53a1495 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
> int ret;
> unsigned long pfn = page_to_pfn(page);
> struct page *hpage = compound_head(page);
> - LIST_HEAD(pagelist);
>
> ret = get_any_page(page, pfn, flags);
> if (ret < 0)
> @@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
> }
>
> /* Keep page count to indicate a given hugepage is isolated. */
> -
> - list_add(&hpage->lru, &pagelist);
> - ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
> - true);
> + ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
> + put_page(hpage);
> if (ret) {
> - struct page *page1, *page2;
> - list_for_each_entry_safe(page1, page2, &pagelist, lru)
> - put_page(page1);
> -
> pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> pfn, ret, page->flags);
> - if (ret > 0)
> - ret = -EIO;
> return ret;
> }
> done:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index be26d5c..fdce3a2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -932,15 +932,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> if (anon_vma)
> put_anon_vma(anon_vma);
> unlock_page(hpage);
> -
> out:
> - if (rc != -EAGAIN) {
> - list_del(&hpage->lru);
> - put_page(hpage);
> - }
> -
> put_page(new_hpage);
> -
> if (result) {
> if (rc)
> *result = rc;
> @@ -1016,48 +1009,32 @@ out:
> return nr_failed + retry;
> }
>
> -int migrate_huge_pages(struct list_head *from,
> - new_page_t get_new_page, unsigned long private, bool offlining,
> - enum migrate_mode mode)
> +int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
> + unsigned long private, bool offlining,
> + enum migrate_mode mode)
> {
> - int retry = 1;
> - int nr_failed = 0;
> - int pass = 0;
> - struct page *page;
> - struct page *page2;
> - int rc;
> -
> - for (pass = 0; pass < 10 && retry; pass++) {
> - retry = 0;
> -
> - list_for_each_entry_safe(page, page2, from, lru) {
> + int pass, rc;
> +
> + for (pass = 0; pass < 10; pass++) {
> + rc = unmap_and_move_huge_page(get_new_page,
> + private, hpage, pass > 2, offlining,
> + mode);
> + switch (rc) {
> + case -ENOMEM:
> + goto out;
> + case -EAGAIN:
> + /* try again */
> cond_resched();
> -
> - rc = unmap_and_move_huge_page(get_new_page,
> - private, page, pass > 2, offlining,
> - mode);
> -
> - switch(rc) {
> - case -ENOMEM:
> - goto out;
> - case -EAGAIN:
> - retry++;
> - break;
> - case 0:
> - break;
> - default:
> - /* Permanent failure */
> - nr_failed++;
> - break;
> - }
> + break;
> + case 0:
> + goto out;
> + default:
> + rc = -EIO;
> + goto out;
> }
> }
> - rc = 0;
> out:
> - if (rc)
> - return rc;
> -
> - return nr_failed + retry;
> + return rc;
> }
>
> #ifdef CONFIG_NUMA
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 07:33:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 07/15] hugetlb: add a list for tracking in-use HugeTLB pages

On Wed 13-06-12 15:57:26, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> hugepage_activelist will be used to track currently used HugeTLB pages.
> We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
> On cgroup removal we update the page's HugeTLB cgroup to point to parent
> cgroup.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/hugetlb.h | 1 +
> mm/hugetlb.c | 12 +++++++-----
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0f23c18..ed550d8 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -211,6 +211,7 @@ struct hstate {
> unsigned long resv_huge_pages;
> unsigned long surplus_huge_pages;
> unsigned long nr_overcommit_huge_pages;
> + struct list_head hugepage_activelist;
> struct list_head hugepage_freelists[MAX_NUMNODES];
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> unsigned int free_huge_pages_node[MAX_NUMNODES];
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e54b695..b5b6e15 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -510,7 +510,7 @@ void copy_huge_page(struct page *dst, struct page *src)
> static void enqueue_huge_page(struct hstate *h, struct page *page)
> {
> int nid = page_to_nid(page);
> - list_add(&page->lru, &h->hugepage_freelists[nid]);
> + list_move(&page->lru, &h->hugepage_freelists[nid]);
> h->free_huge_pages++;
> h->free_huge_pages_node[nid]++;
> }
> @@ -522,7 +522,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> if (list_empty(&h->hugepage_freelists[nid]))
> return NULL;
> page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
> - list_del(&page->lru);
> + list_move(&page->lru, &h->hugepage_activelist);
> set_page_refcounted(page);
> h->free_huge_pages--;
> h->free_huge_pages_node[nid]--;
> @@ -626,10 +626,11 @@ static void free_huge_page(struct page *page)
> page->mapping = NULL;
> BUG_ON(page_count(page));
> BUG_ON(page_mapcount(page));
> - INIT_LIST_HEAD(&page->lru);
>
> spin_lock(&hugetlb_lock);
> if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
> + /* remove the page from active list */
> + list_del(&page->lru);
> update_and_free_page(h, page);
> h->surplus_huge_pages--;
> h->surplus_huge_pages_node[nid]--;
> @@ -642,6 +643,7 @@ static void free_huge_page(struct page *page)
>
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> {
> + INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> h->nr_huge_pages++;
> @@ -890,6 +892,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>
> spin_lock(&hugetlb_lock);
> if (page) {
> + INIT_LIST_HEAD(&page->lru);
> r_nid = page_to_nid(page);
> set_compound_page_dtor(page, free_huge_page);
> /*
> @@ -994,7 +997,6 @@ retry:
> list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> if ((--needed) < 0)
> break;
> - list_del(&page->lru);
> /*
> * This page is now managed by the hugetlb allocator and has
> * no users -- drop the buddy allocator's reference.
> @@ -1009,7 +1011,6 @@ free:
> /* Free unnecessary surplus pages to the buddy allocator */
> if (!list_empty(&surplus_list)) {
> list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> - list_del(&page->lru);
> put_page(page);
> }
> }
> @@ -1909,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
> h->free_huge_pages = 0;
> for (i = 0; i < MAX_NUMNODES; ++i)
> INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> + INIT_LIST_HEAD(&h->hugepage_activelist);
> h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 07:38:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 08/15] hugetlb: Make some static variables global

On Wed 13-06-12 15:57:27, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> We will use them later in hugetlb_cgroup.c
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

Just a nit
[...]
> +extern int hugetlb_max_hstate;

Maybe we can mark it __read_mostly as it is modified only during
initialization and then it is just a constant.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 08:24:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 09/15] mm/hugetlb: Add new HugeTLB cgroup

On Wed 13-06-12 15:57:28, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patch implements a new controller that allows us to control HugeTLB
> allocations. The extension allows to limit the HugeTLB usage per control
> group and enforces the controller limit during page fault. Since HugeTLB
> doesn't support page reclaim, enforcing the limit at page fault time implies
> that, the application will get SIGBUS signal if it tries to access HugeTLB
> pages beyond its limit. This requires the application to know beforehand
> how much HugeTLB pages it would require for its use.
>
> The charge/uncharge calls will be added to HugeTLB code in later patch.
> Support for cgroup removal will be added in later patches.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Looks good
Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/cgroup_subsys.h | 6 ++
> include/linux/hugetlb_cgroup.h | 37 ++++++++++++
> init/Kconfig | 15 +++++
> mm/Makefile | 1 +
> mm/hugetlb_cgroup.c | 122 ++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 181 insertions(+)
> create mode 100644 include/linux/hugetlb_cgroup.h
> create mode 100644 mm/hugetlb_cgroup.c
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 0bd390c..895923a 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -72,3 +72,9 @@ SUBSYS(net_prio)
> #endif
>
> /* */
> +
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +SUBSYS(hugetlb)
> +#endif
> +
> +/* */
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> new file mode 100644
> index 0000000..e9944b4
> --- /dev/null
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -0,0 +1,37 @@
> +/*
> + * Copyright IBM Corporation, 2012
> + * Author Aneesh Kumar K.V <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +#ifndef _LINUX_HUGETLB_CGROUP_H
> +#define _LINUX_HUGETLB_CGROUP_H
> +
> +#include <linux/res_counter.h>
> +
> +struct hugetlb_cgroup;
> +
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline bool hugetlb_cgroup_disabled(void)
> +{
> + if (hugetlb_subsys.disabled)
> + return true;
> + return false;
> +}
> +
> +#else
> +static inline bool hugetlb_cgroup_disabled(void)
> +{
> + return true;
> +}
> +
> +#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> +#endif
> diff --git a/init/Kconfig b/init/Kconfig
> index d07dcf9..da05fae 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -751,6 +751,21 @@ config CGROUP_MEM_RES_CTLR_KMEM
> the kmem extension can use it to guarantee that no group of processes
> will ever exhaust kernel resources alone.
>
> +config CGROUP_HUGETLB_RES_CTLR
> + bool "HugeTLB Resource Controller for Control Groups"
> + depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
> + default n
> + help
> + Provides a cgroup Resource Controller for HugeTLB pages.
> + When you enable this, you can put a per cgroup limit on HugeTLB usage.
> + The limit is enforced during page fault. Since HugeTLB doesn't
> + support page reclaim, enforcing the limit at page fault time implies
> + that, the application will get SIGBUS signal if it tries to access
> + HugeTLB pages beyond its limit. This requires the application to know
> + beforehand how much HugeTLB pages it would require for its use. The
> + control group is tracked in the third page lru pointer. This means
> + that we cannot use the controller with huge page less than 3 pages.
> +
> config CGROUP_PERF
> bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> depends on PERF_EVENTS && CGROUPS
> diff --git a/mm/Makefile b/mm/Makefile
> index 2e2fbbe..25e8002 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -49,6 +49,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> new file mode 100644
> index 0000000..5a4e71c
> --- /dev/null
> +++ b/mm/hugetlb_cgroup.c
> @@ -0,0 +1,122 @@
> +/*
> + *
> + * Copyright IBM Corporation, 2012
> + * Author Aneesh Kumar K.V <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <linux/hugetlb_cgroup.h>
> +
> +struct hugetlb_cgroup {
> + struct cgroup_subsys_state css;
> + /*
> + * the counter to account for hugepages from hugetlb.
> + */
> + struct res_counter hugepage[HUGE_MAX_HSTATE];
> +};
> +
> +struct cgroup_subsys hugetlb_subsys __read_mostly;
> +struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> +{
> + if (s)
> + return container_of(s, struct hugetlb_cgroup, css);
> + return NULL;
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
> +{
> + return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
> + hugetlb_subsys_id));
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
> +{
> + return hugetlb_cgroup_from_css(task_subsys_state(task,
> + hugetlb_subsys_id));
> +}
> +
> +static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
> +{
> + return (h_cg == root_h_cgroup);
> +}
> +
> +static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
> +{
> + if (!cg->parent)
> + return NULL;
> + return hugetlb_cgroup_from_cgroup(cg->parent);
> +}
> +
> +static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
> +{
> + int idx;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
> +
> + for (idx = 0; idx < hugetlb_max_hstate; idx++) {
> + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
> + return true;
> + }
> + return false;
> +}
> +
> +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
> +{
> + int idx;
> + struct cgroup *parent_cgroup;
> + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
> +
> + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> + if (!h_cgroup)
> + return ERR_PTR(-ENOMEM);
> +
> + parent_cgroup = cgroup->parent;
> + if (parent_cgroup) {
> + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
> + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> + res_counter_init(&h_cgroup->hugepage[idx],
> + &parent_h_cgroup->hugepage[idx]);
> + } else {
> + root_h_cgroup = h_cgroup;
> + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
> + res_counter_init(&h_cgroup->hugepage[idx], NULL);
> + }
> + return &h_cgroup->css;
> +}
> +
> +static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> +{
> + struct hugetlb_cgroup *h_cgroup;
> +
> + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
> + kfree(h_cgroup);
> +}
> +
> +static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> +{
> + /* We will add the cgroup removal support in later patches */
> + return -EBUSY;
> +}
> +
> +struct cgroup_subsys hugetlb_subsys = {
> + .name = "hugetlb",
> + .create = hugetlb_cgroup_create,
> + .pre_destroy = hugetlb_cgroup_pre_destroy,
> + .destroy = hugetlb_cgroup_destroy,
> + .subsys_id = hugetlb_subsys_id,
> +};
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 08:44:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 [updated] 10/15] hugetlb/cgroup: Add the cgroup pointer to page lru

On Wed 13-06-12 17:04:30, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

I would be happier if you explicitely mentioned that both
hugetlb_cgroup_from_page and set_hugetlb_cgroup need hugetlb_lock held,
but

Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/hugetlb_cgroup.h | 37 +++++++++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index e9944b4..2e4cb6b 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -18,8 +18,34 @@
> #include <linux/res_counter.h>
>
> struct hugetlb_cgroup;
> +/*
> + * Minimum page order trackable by hugetlb cgroup.
> + * At least 3 pages are necessary for all the tracking information.
> + */
> +#define HUGETLB_CGROUP_MIN_ORDER 2
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + VM_BUG_ON(!PageHuge(page));
> +
> + if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
> + return NULL;
> + return (struct hugetlb_cgroup *)page[2].lru.next;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + VM_BUG_ON(!PageHuge(page));
> +
> + if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
> + return -1;
> + page[2].lru.next = (void *)h_cg;
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> if (hugetlb_subsys.disabled)
> @@ -28,6 +54,17 @@ static inline bool hugetlb_cgroup_disabled(void)
> }
>
> #else
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> return true;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e899a2d..6a449c5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -28,6 +28,7 @@
>
> #include <linux/io.h>
> #include <linux/hugetlb.h>
> +#include <linux/hugetlb_cgroup.h>
> #include <linux/node.h>
> #include "internal.h"
>
> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> 1 << PG_active | 1 << PG_reserved |
> 1 << PG_private | 1 << PG_writeback);
> }
> + VM_BUG_ON(hugetlb_cgroup_from_page(page));
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> arch_release_hugepage(page);
> @@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> + set_hugetlb_cgroup(page, NULL);
> h->nr_huge_pages++;
> h->nr_huge_pages_node[nid]++;
> spin_unlock(&hugetlb_lock);
> @@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> INIT_LIST_HEAD(&page->lru);
> r_nid = page_to_nid(page);
> set_compound_page_dtor(page, free_huge_page);
> + set_hugetlb_cgroup(page, NULL);
> /*
> * We incremented the global counters already
> */
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 08:58:16

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH -V9 09/15] mm/hugetlb: Add new HugeTLB cgroup

> +static inline

> +struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> +{
> + if (s)


Neither cgroup_subsys_state() or task_subsys_state() will ever return NULL,
so here 's' won't be NULL.

> + return container_of(s, struct hugetlb_cgroup, css);
> + return NULL;
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
> +{
> + return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
> + hugetlb_subsys_id));
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
> +{
> + return hugetlb_cgroup_from_css(task_subsys_state(task,
> + hugetlb_subsys_id));
> +}

2012-06-14 09:03:09

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

> +int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,

> + struct hugetlb_cgroup **ptr)
> +{
> + int ret = 0;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *h_cg = NULL;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled())
> + goto done;
> + /*
> + * We don't charge any cgroup if the compound page have less
> + * than 3 pages.
> + */
> + if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> + goto done;
> +again:
> + rcu_read_lock();
> + h_cg = hugetlb_cgroup_from_task(current);
> + if (!h_cg)


In no circumstances should h_cg be NULL.

> + h_cg = root_h_cgroup;
> +
> + if (!css_tryget(&h_cg->css)) {
> + rcu_read_unlock();
> + goto again;
> + }
> + rcu_read_unlock();
> +
> + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> + css_put(&h_cg->css);
> +done:
> + *ptr = h_cg;
> + return ret;
> +}

2012-06-14 09:25:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

On Wed 13-06-12 15:57:30, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patchset add the charge and uncharge routines for hugetlb cgroup.
> We do cgroup charging in page alloc and uncharge in compound page
> destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

One minor comment
[...]
> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg,
> + struct page *page)
> +{
> + if (hugetlb_cgroup_disabled() || !h_cg)
> + return;
> +
> + spin_lock(&hugetlb_lock);
> + set_hugetlb_cgroup(page, h_cg);
> + spin_unlock(&hugetlb_lock);
> + return;
> +}

I guess we can remove the lock here because nobody can see the page yet,
right?

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 09:31:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 12/15] hugetlb/cgroup: Add support for cgroup removal

On Wed 13-06-12 15:57:31, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patch add support for cgroup removal. If we don't have parent
> cgroup, the charges are moved to root cgroup.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> mm/hugetlb_cgroup.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 0f2f6ac..a3a68a4 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -107,10 +107,76 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> kfree(h_cgroup);
> }
>
> +
> +/*
> + * Should be called with hugetlb_lock held.
> + * Since we are holding hugetlb_lock, pages cannot get moved from
> + * active list or uncharged from the cgroup, So no need to get
> + * page reference and test for page active here. This function
> + * cannot fail.
> + */
> +static void hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> + struct page *page)
> +{
> + int csize;
> + struct res_counter *counter;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *page_hcg;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> +
> + page_hcg = hugetlb_cgroup_from_page(page);
> + /*
> + * We can have pages in active list without any cgroup
> + * ie, hugepage with less than 3 pages. We can safely
> + * ignore those pages.
> + */
> + if (!page_hcg || page_hcg != h_cg)
> + goto out;
> +
> + csize = PAGE_SIZE << compound_order(page);
> + if (!parent) {
> + parent = root_h_cgroup;
> + /* root has no limit */
> + res_counter_charge_nofail(&parent->hugepage[idx],
> + csize, &fail_res);
> + }
> + counter = &h_cg->hugepage[idx];
> + res_counter_uncharge_until(counter, counter->parent, csize);
> +
> + set_hugetlb_cgroup(page, parent);
> +out:
> + return;
> +}
> +
> +/*
> + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
> + * the parent cgroup.
> + */
> static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> {
> - /* We will add the cgroup removal support in later patches */
> - return -EBUSY;
> + struct hstate *h;
> + struct page *page;
> + int ret = 0, idx = 0;
> +
> + do {
> + if (cgroup_task_count(cgroup) ||
> + !list_empty(&cgroup->children)) {
> + ret = -EBUSY;
> + goto out;
> + }
> + for_each_hstate(h) {
> + spin_lock(&hugetlb_lock);
> + list_for_each_entry(page, &h->hugepage_activelist, lru)
> + hugetlb_cgroup_move_parent(idx, cgroup, page);
> +
> + spin_unlock(&hugetlb_lock);
> + idx++;
> + }
> + cond_resched();
> + } while (hugetlb_cgroup_have_usage(cgroup));
> +out:
> + return ret;
> }
>
> int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 09:36:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 13/15] hugetlb/cgroup: add hugetlb cgroup control files

On Wed 13-06-12 15:57:32, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the control files for hugetlb controller
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/hugetlb.h | 5 ++
> include/linux/hugetlb_cgroup.h | 6 ++
> mm/hugetlb.c | 8 +++
> mm/hugetlb_cgroup.c | 129 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 148 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4aca057..9650bb1 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -4,6 +4,7 @@
> #include <linux/mm_types.h>
> #include <linux/fs.h>
> #include <linux/hugetlb_inline.h>
> +#include <linux/cgroup.h>
>
> struct ctl_table;
> struct user_struct;
> @@ -221,6 +222,10 @@ struct hstate {
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> unsigned int free_huge_pages_node[MAX_NUMNODES];
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> + /* cgroup control files */
> + struct cftype cgroup_files[5];
> +#endif
> char name[HSTATE_NAME_LEN];
> };
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index e05871c..bd8bc98 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> struct page *page);
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> +extern int hugetlb_cgroup_file_init(int idx) __init;
>
> #else
> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> @@ -108,5 +109,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> return;
> }
>
> +static inline int __init hugetlb_cgroup_file_init(int idx)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 59720b1..a5a30bf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -30,6 +30,7 @@
> #include <linux/hugetlb.h>
> #include <linux/hugetlb_cgroup.h>
> #include <linux/node.h>
> +#include <linux/hugetlb_cgroup.h>
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1930,6 +1931,13 @@ void __init hugetlb_add_hstate(unsigned order)
> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
> + /*
> + * Add cgroup control files only if the huge page consists
> + * of more than two normal pages. This is because we use
> + * page[2].lru.next for storing cgoup details.
> + */
> + if (order >= HUGETLB_CGROUP_MIN_ORDER)
> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>
> parsed_hstate = h;
> }
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index a3a68a4..64e93e0 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -26,6 +26,10 @@ struct hugetlb_cgroup {
> struct res_counter hugepage[HUGE_MAX_HSTATE];
> };
>
> +#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
> +#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
> +#define MEMFILE_ATTR(val) ((val) & 0xffff)
> +
> struct cgroup_subsys hugetlb_subsys __read_mostly;
> struct hugetlb_cgroup *root_h_cgroup __read_mostly;
>
> @@ -259,6 +263,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> return;
> }
>
> +static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
> + struct file *file, char __user *buf,
> + size_t nbytes, loff_t *ppos)
> +{
> + u64 val;
> + char str[64];
> + int idx, name, len;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + val = res_counter_read_u64(&h_cg->hugepage[idx], name);
> + len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> + return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> +}
> +
> +static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
> + const char *buffer)
> +{
> + int idx, name, ret;
> + unsigned long long val;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + switch (name) {
> + case RES_LIMIT:
> + if (hugetlb_cgroup_is_root(h_cg)) {
> + /* Can't set limit on root */
> + ret = -EINVAL;
> + break;
> + }
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + break;
> + ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
> +{
> + int idx, name, ret = 0;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(event);
> + name = MEMFILE_ATTR(event);
> +
> + switch (name) {
> + case RES_MAX_USAGE:
> + res_counter_reset_max(&h_cg->hugepage[idx]);
> + break;
> + case RES_FAILCNT:
> + res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static char *mem_fmt(char *buf, int size, unsigned long hsize)
> +{
> + if (hsize >= (1UL << 30))
> + snprintf(buf, size, "%luGB", hsize >> 30);
> + else if (hsize >= (1UL << 20))
> + snprintf(buf, size, "%luMB", hsize >> 20);
> + else
> + snprintf(buf, size, "%luKB", hsize >> 10);
> + return buf;
> +}
> +
> +int __init hugetlb_cgroup_file_init(int idx)
> +{
> + char buf[32];
> + struct cftype *cft;
> + struct hstate *h = &hstates[idx];
> +
> + /* format the size */
> + mem_fmt(buf, 32, huge_page_size(h));
> +
> + /* Add the limit file */
> + cft = &h->cgroup_files[0];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> + cft->read = hugetlb_cgroup_read;
> + cft->write_string = hugetlb_cgroup_write;
> +
> + /* Add the usage file */
> + cft = &h->cgroup_files[1];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the MAX usage file */
> + cft = &h->cgroup_files[2];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the failcntfile */
> + cft = &h->cgroup_files[3];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* NULL terminate the last cft */
> + cft = &h->cgroup_files[4];
> + memset(cft, 0, sizeof(*cft));
> +
> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> +
> + return 0;
> +}
> +
> struct cgroup_subsys hugetlb_subsys = {
> .name = "hugetlb",
> .create = hugetlb_cgroup_create,
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 10:04:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 14/15] hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

On Wed 13-06-12 15:57:33, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
> we are holding a hugepage reference, we can be sure that old page won't
> get uncharged till the last put_page().
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

One question below
[...]
> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
> +{
> + struct hugetlb_cgroup *h_cg;
> +
> + if (hugetlb_cgroup_disabled())
> + return;
> +
> + VM_BUG_ON(!PageHuge(oldhpage));
> + spin_lock(&hugetlb_lock);
> + h_cg = hugetlb_cgroup_from_page(oldhpage);
> + set_hugetlb_cgroup(oldhpage, NULL);
> + cgroup_exclude_rmdir(&h_cg->css);
> +
> + /* move the h_cg details to new cgroup */
> + set_hugetlb_cgroup(newhpage, h_cg);
> + spin_unlock(&hugetlb_lock);
> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
> + return;
> +}
> +

The changelog says that the old page won't get uncharged - which means
that the the cgroup cannot go away (even if we raced with the move
parent, hugetlb_lock makes sure we either see old or new cgroup) so why
do we need to play with css ref. counting?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 10:07:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH -V9 15/15] hugetlb/cgroup: add HugeTLB controller documentation

On Wed 13-06-12 15:57:34, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

Minor nid below
> ---
> Documentation/cgroups/hugetlb.txt | 45 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
> create mode 100644 Documentation/cgroups/hugetlb.txt
>
> diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
> new file mode 100644
> index 0000000..a9faaca
> --- /dev/null
> +++ b/Documentation/cgroups/hugetlb.txt
[...]
> +With the above step, the initial or the parent HugeTLB group becomes
> +visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
> +the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
> +
> +New groups can be created under the parent group /sys/fs/cgroup.
> +
> +# cd /sys/fs/cgroup
> +# mkdir g1
> +# echo $$ > g1/tasks
> +
> +The above steps create a new group g1 and move the current shell
> +process (bash) into it.

This is probably not needed as it is already described in the generic
cgroups description

> +
> +Brief summary of control files
> +
> + hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
> + hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
> + hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
> +
> +For a system supporting two hugepage size (16M and 16G) the control
> +files include:
> +
> +hugetlb.16GB.limit_in_bytes
> +hugetlb.16GB.max_usage_in_bytes
> +hugetlb.16GB.usage_in_bytes
> +hugetlb.16GB.failcnt
> +hugetlb.16MB.limit_in_bytes
> +hugetlb.16MB.max_usage_in_bytes
> +hugetlb.16MB.usage_in_bytes
> +hugetlb.16MB.failcnt
> --
> 1.7.10
>

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-15 06:21:04

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 09/15] mm/hugetlb: Add new HugeTLB cgroup

Li Zefan <[email protected]> writes:

>> +static inline
>
>> +struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
>> +{
>> + if (s)
>
>
> Neither cgroup_subsys_state() or task_subsys_state() will ever return NULL,
> so here 's' won't be NULL.
>

That is a change that didn't get updated when i dropped page_cgroup
changes. I had a series that tracked in page_cgroup
cgroup_subsys_state. I will send an fix on top.

Thanks for the review.
-aneesh

2012-06-15 10:06:23

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

Michal Hocko <[email protected]> writes:

> On Wed 13-06-12 15:57:30, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> This patchset add the charge and uncharge routines for hugetlb cgroup.
>> We do cgroup charging in page alloc and uncharge in compound page
>> destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock.
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>
>
> One minor comment
> [...]
>> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg,
>> + struct page *page)
>> +{
>> + if (hugetlb_cgroup_disabled() || !h_cg)
>> + return;
>> +
>> + spin_lock(&hugetlb_lock);
>> + set_hugetlb_cgroup(page, h_cg);
>> + spin_unlock(&hugetlb_lock);
>> + return;
>> +}
>
> I guess we can remove the lock here because nobody can see the page yet,
> right?
>

We need that to make sure when we remove cgroup we find correct page
hugetlb cgroup values. But i guess we have a bug here. How about the
below ?

NOTE: We also need another patch to update active list during soft
offline. I will send that in reply.

commit e4c3fd3cc0f0faa30ea283cb48ba478a5c0d3e74
Author: Aneesh Kumar K.V <[email protected]>
Date: Fri Jun 15 14:42:27 2012 +0530

hugetlb/cgroup: Assign the page hugetlb cgroup when we move the page to active list.

page's hugetlb cgroup assign and moving to active list should happen with
hugetlb_lock held. Otherwise when we remove the hugetlb cgroup we would
iterate the active list and will find page with NULL hugetlb cgroup values.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ee4da3b..b90dfb4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1146,9 +1146,12 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
}
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
- spin_unlock(&hugetlb_lock);
-
- if (!page) {
+ if (page) {
+ /* update page cgroup details */
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+ spin_unlock(&hugetlb_lock);
+ } else {
+ spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
hugetlb_cgroup_uncharge_cgroup(idx,
@@ -1159,14 +1162,13 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
}
spin_lock(&hugetlb_lock);
list_move(&page->lru, &h->hugepage_activelist);
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
spin_unlock(&hugetlb_lock);
}

set_page_private(page, (unsigned long)spool);

vma_commit_reservation(h, vma, addr);
- /* update page cgroup details */
- hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
return page;
}

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 8e7ca0a..d4f3f7b 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -218,6 +218,7 @@ done:
return ret;
}

+/* Should be called with hugetlb_lock held */
void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page)
@@ -225,9 +226,7 @@ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
if (hugetlb_cgroup_disabled() || !h_cg)
return;

- spin_lock(&hugetlb_lock);
set_hugetlb_cgroup(page, h_cg);
- spin_unlock(&hugetlb_lock);
return;
}

2012-06-15 10:50:42

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 14/15] hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during migration

Michal Hocko <[email protected]> writes:

> On Wed 13-06-12 15:57:33, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
>> we are holding a hugepage reference, we can be sure that old page won't
>> get uncharged till the last put_page().
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>
>
> One question below
> [...]
>> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>> +{
>> + struct hugetlb_cgroup *h_cg;
>> +
>> + if (hugetlb_cgroup_disabled())
>> + return;
>> +
>> + VM_BUG_ON(!PageHuge(oldhpage));
>> + spin_lock(&hugetlb_lock);
>> + h_cg = hugetlb_cgroup_from_page(oldhpage);
>> + set_hugetlb_cgroup(oldhpage, NULL);
>> + cgroup_exclude_rmdir(&h_cg->css);
>> +
>> + /* move the h_cg details to new cgroup */
>> + set_hugetlb_cgroup(newhpage, h_cg);
>> + spin_unlock(&hugetlb_lock);
>> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
>> + return;
>> +}
>> +
>
> The changelog says that the old page won't get uncharged - which means
> that the the cgroup cannot go away (even if we raced with the move
> parent, hugetlb_lock makes sure we either see old or new cgroup) so why
> do we need to play with css ref. counting?

Ok hugetlb_lock should be sufficient here i guess. I will send a patch
on top to remove the exclude_rmdir and release_and_wakeup_rmdir

-aneesh

2012-06-22 22:11:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup

On Thu, 14 Jun 2012 16:58:05 +0800
Li Zefan <[email protected]> wrote:

> > +int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>
> > + struct hugetlb_cgroup **ptr)
> > +{
> > + int ret = 0;
> > + struct res_counter *fail_res;
> > + struct hugetlb_cgroup *h_cg = NULL;
> > + unsigned long csize = nr_pages * PAGE_SIZE;
> > +
> > + if (hugetlb_cgroup_disabled())
> > + goto done;
> > + /*
> > + * We don't charge any cgroup if the compound page have less
> > + * than 3 pages.
> > + */
> > + if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
> > + goto done;
> > +again:
> > + rcu_read_lock();
> > + h_cg = hugetlb_cgroup_from_task(current);
> > + if (!h_cg)
>
>
> In no circumstances should h_cg be NULL.
>

Aneesh?

2012-06-24 16:45:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V9 11/15] hugetlb/cgroup: Add charge/uncharge routines for hugetlb cgroup



Hi Andrew,

Andrew Morton <[email protected]> writes:

> On Thu, 14 Jun 2012 16:58:05 +0800
> Li Zefan <[email protected]> wrote:
>
>> > +int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
>>
>> > + struct hugetlb_cgroup **ptr)
>> > +{
>> > + int ret = 0;
>> > + struct res_counter *fail_res;
>> > + struct hugetlb_cgroup *h_cg = NULL;
>> > + unsigned long csize = nr_pages * PAGE_SIZE;
>> > +
>> > + if (hugetlb_cgroup_disabled())
>> > + goto done;
>> > + /*
>> > + * We don't charge any cgroup if the compound page have less
>> > + * than 3 pages.
>> > + */
>> > + if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
>> > + goto done;
>> > +again:
>> > + rcu_read_lock();
>> > + h_cg = hugetlb_cgroup_from_task(current);
>> > + if (!h_cg)
>>
>>
>> In no circumstances should h_cg be NULL.
>>
>
> Aneesh?

I missed this in the last review. Thanks for reminding. I will send a
patch addressing this and another related comment in
[email protected] as a separate mail.

-aneesh