Hi,
This patchset implements a cgroup resource controller for HugeTLB
pages. The controller allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since
HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if
it tries to access HugeTLB pages beyond its limit. This requires
the application to know beforehand how much HugeTLB pages it would
require for its use.
The goal is to control how many HugeTLB pages a group of task can
allocate. It can be looked at as an extension of the existing quota
interface which limits the number of HugeTLB pages per hugetlbfs
superblock. HPC job scheduler requires jobs to specify their resource
requirements in the job file. Once their requirements can be met,
job schedulers like (SLURM) will schedule the job. We need to make sure
that the jobs won't consume more resources than requested. If they do
we should either error out or kill the application.
Changes from V7:
* Remove dependency on page cgroup.
* Use page[2].lru.next to store HugeTLB cgroup information.
Changes from V6:
* Implement the controller as a seperate HugeTLB cgroup.
* Folded fixup patches in -mm to the original patches
Changes from V5:
* Address review feedback.
Changes from V4:
* Add support for charge/uncharge during page migration
* Drop the usage of page->lru in unmap_hugepage_range.
Changes from v3:
* Address review feedback.
* Fix a bug in cgroup removal related parent charging with use_hierarchy set
Changes from V2:
* Changed the implementation to limit the HugeTLB usage during page
fault time. This simplifies the extension and keep it closer to
memcg design. This also allows to support cgroup removal with less
complexity. Only caveat is the application should ensure its HugeTLB
usage doesn't cross the cgroup limit.
Changes from V1:
* Changed the implementation as a memcg extension. We still use
the same logic to track the cgroup and range.
Changes from RFC post:
* Added support for HugeTLB cgroup hierarchy
* Added support for task migration
* Added documentation patch
* Other bug fixes
-aneesh
From: "Aneesh Kumar K.V" <[email protected]>
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
VM_FAULT_* values from MAX_ERRNO.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c868309..34a7e23 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
*/
chg = vma_needs_reservation(h, vma, addr);
if (chg < 0)
- return ERR_PTR(-VM_FAULT_OOM);
+ return ERR_PTR(-ENOMEM);
if (chg)
if (hugepage_subpool_get_pages(spool, chg))
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
@@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
hugepage_subpool_put_pages(spool, chg);
- return ERR_PTR(-VM_FAULT_SIGBUS);
+ return ERR_PTR(-ENOSPC);
}
}
@@ -2496,6 +2496,7 @@ retry_avoidcopy:
new_page = alloc_huge_page(vma, address, outside_reserve);
if (IS_ERR(new_page)) {
+ long err = PTR_ERR(new_page);
page_cache_release(old_page);
/*
@@ -2524,7 +2525,10 @@ retry_avoidcopy:
/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
- return -PTR_ERR(new_page);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ else
+ return VM_FAULT_SIGBUS;
}
/*
@@ -2642,7 +2646,11 @@ retry:
goto out;
page = alloc_huge_page(vma, address, 0);
if (IS_ERR(page)) {
- ret = -PTR_ERR(page);
+ ret = PTR_ERR(page);
+ if (ret == -ENOMEM)
+ ret = VM_FAULT_OOM;
+ else
+ ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Use an mmu_gather instead of a temporary linked list for accumulating
pages when we unmap a hugepage range. This also allows us to get rid of
i_mmap_mutex unmap_hugepage_range in the following patch.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
include/linux/hugetlb.h | 22 ++++++++++++++----
mm/hugetlb.c | 59 ++++++++++++++++++++++++++++-------------------
mm/memory.c | 7 ++++--
4 files changed, 59 insertions(+), 33 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cc9281b..ff233e4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -416,8 +416,8 @@ hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
else
v_offset = 0;
- __unmap_hugepage_range(vma,
- vma->vm_start + v_offset, vma->vm_end, NULL);
+ unmap_hugepage_range(vma, vma->vm_start + v_offset,
+ vma->vm_end, NULL);
}
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 217f528..0f23c18 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -7,6 +7,7 @@
struct ctl_table;
struct user_struct;
+struct mmu_gather;
#ifdef CONFIG_HUGETLB_PAGE
@@ -40,9 +41,10 @@ int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
struct page **, struct vm_area_struct **,
unsigned long *, int *, int, unsigned int flags);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
-void __unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *);
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page);
int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(int, char *);
@@ -98,7 +100,6 @@ static inline unsigned long hugetlb_total_pages(void)
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
-#define unmap_hugepage_range(vma, start, end, page) BUG()
static inline void hugetlb_report_meminfo(struct seq_file *m)
{
}
@@ -112,13 +113,24 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
#define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; })
#define huge_pte_offset(mm, address) 0
-#define dequeue_hwpoisoned_huge_page(page) 0
+static inline int dequeue_hwpoisoned_huge_page(struct page *page)
+{
+ return 0;
+}
+
static inline void copy_huge_page(struct page *dst, struct page *src)
{
}
#define hugetlb_change_protection(vma, address, end, newprot)
+static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, struct page *ref_page)
+{
+ BUG();
+}
+
#endif /* !CONFIG_HUGETLB_PAGE */
#define HUGETLB_ANON_FILE "anon_hugepage"
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b1e0ed1..e54b695 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,8 +24,9 @@
#include <asm/page.h>
#include <asm/pgtable.h>
-#include <linux/io.h>
+#include <asm/tlb.h>
+#include <linux/io.h>
#include <linux/hugetlb.h>
#include <linux/node.h>
#include "internal.h"
@@ -2310,30 +2311,26 @@ static int is_hugetlb_entry_hwpoisoned(pte_t pte)
return 0;
}
-void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page)
{
+ int force_flush = 0;
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
pte_t *ptep;
pte_t pte;
struct page *page;
- struct page *tmp;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- /*
- * A page gathering list, protected by per file i_mmap_mutex. The
- * lock is used to avoid list corruption from multiple unmapping
- * of the same page since we are using page->lru.
- */
- LIST_HEAD(page_list);
-
WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));
+ tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, start, end);
+again:
spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
@@ -2372,30 +2369,45 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
}
pte = huge_ptep_get_and_clear(mm, address, ptep);
+ tlb_remove_tlb_entry(tlb, ptep, address);
if (pte_dirty(pte))
set_page_dirty(page);
- list_add(&page->lru, &page_list);
+ page_remove_rmap(page);
+ force_flush = !__tlb_remove_page(tlb, page);
+ if (force_flush)
+ break;
/* Bail out after unmapping reference page if supplied */
if (ref_page)
break;
}
- flush_tlb_range(vma, start, end);
spin_unlock(&mm->page_table_lock);
- mmu_notifier_invalidate_range_end(mm, start, end);
- list_for_each_entry_safe(page, tmp, &page_list, lru) {
- page_remove_rmap(page);
- list_del(&page->lru);
- put_page(page);
+ /*
+ * mmu_gather ran out of room to batch pages, we break out of
+ * the PTE lock to avoid doing the potential expensive TLB invalidate
+ * and page-free while holding it.
+ */
+ if (force_flush) {
+ force_flush = 0;
+ tlb_flush_mmu(tlb);
+ if (address < end && !ref_page)
+ goto again;
}
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_end_vma(tlb, vma);
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page)
{
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- __unmap_hugepage_range(vma, start, end, ref_page);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ struct mm_struct *mm;
+ struct mmu_gather tlb;
+
+ mm = vma->vm_mm;
+
+ tlb_gather_mmu(&tlb, mm, 0);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ tlb_finish_mmu(&tlb, start, end);
}
/*
@@ -2440,9 +2452,8 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* from the time of fork. This would look like data corruption
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
- __unmap_hugepage_range(iter_vma,
- address, address + huge_page_size(h),
- page);
+ unmap_hugepage_range(iter_vma, address,
+ address + huge_page_size(h), page);
}
mutex_unlock(&mapping->i_mmap_mutex);
diff --git a/mm/memory.c b/mm/memory.c
index 1b7dc66..545e18a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,8 +1326,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file)
- unmap_hugepage_range(vma, start, end, NULL);
+ if (vma->vm_file) {
+ mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ __unmap_hugepage_range(tlb, vma, start, end, NULL);
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This patch add support for cgroup removal. If we don't have parent
cgroup, the charges are moved to root cgroup.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb_cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 79 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 48efd5a..9458fe3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -99,10 +99,87 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
kfree(h_cgroup);
}
+
+static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
+ struct page *page)
+{
+ int csize;
+ struct res_counter *counter;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *page_hcg;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+ struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
+
+ if (!get_page_unless_zero(page))
+ goto out;
+
+ page_hcg = hugetlb_cgroup_from_page(page);
+ /*
+ * We can have pages in active list without any cgroup
+ * ie, hugepage with less than 3 pages. We can safely
+ * ignore those pages.
+ */
+ if (!page_hcg || page_hcg != h_cg)
+ goto err_out;
+
+ csize = PAGE_SIZE << compound_order(page);
+ if (!parent) {
+ parent = root_h_cgroup;
+ /* root has no limit */
+ res_counter_charge_nofail(&parent->hugepage[idx],
+ csize, &fail_res);
+ }
+ counter = &h_cg->hugepage[idx];
+ res_counter_uncharge_until(counter, counter->parent, csize);
+
+ set_hugetlb_cgroup(page, parent);
+err_out:
+ put_page(page);
+out:
+ return 0;
+}
+
+/*
+ * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
+ * the parent cgroup.
+ */
static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
{
- /* We will add the cgroup removal support in later patches */
- return -EBUSY;
+ struct hstate *h;
+ struct page *page;
+ int ret = 0, idx = 0;
+
+ do {
+ if (cgroup_task_count(cgroup) ||
+ !list_empty(&cgroup->children)) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /*
+ * If the task doing the cgroup_rmdir got a signal
+ * we don't really need to loop till the hugetlb resource
+ * usage become zero.
+ */
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+ for_each_hstate(h) {
+ spin_lock(&hugetlb_lock);
+ list_for_each_entry(page, &h->hugepage_activelist, lru) {
+ ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
+ if (ret) {
+ spin_unlock(&hugetlb_lock);
+ goto out;
+ }
+ }
+ spin_unlock(&hugetlb_lock);
+ idx++;
+ }
+ cond_resched();
+ } while (hugetlb_cgroup_have_usage(cgroup));
+out:
+ return ret;
}
int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Since we migrate only one hugepage don't use linked list for passing the
page around. Directly pass page that need to be migrated as argument.
This also remove the usage page->lru in migrate path.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/migrate.h | 4 +--
mm/memory-failure.c | 13 ++--------
mm/migrate.c | 65 +++++++++++++++--------------------------------
3 files changed, 25 insertions(+), 57 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 855c337..ce7e667 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
-extern int migrate_huge_pages(struct list_head *l, new_page_t x,
+extern int migrate_huge_page(struct page *, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
@@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
-static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
+static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab1e714..53a1495 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
int ret;
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);
- LIST_HEAD(pagelist);
ret = get_any_page(page, pfn, flags);
if (ret < 0)
@@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
}
/* Keep page count to indicate a given hugepage is isolated. */
-
- list_add(&hpage->lru, &pagelist);
- ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
- true);
+ ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
+ put_page(hpage);
if (ret) {
- struct page *page1, *page2;
- list_for_each_entry_safe(page1, page2, &pagelist, lru)
- put_page(page1);
-
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
- if (ret > 0)
- ret = -EIO;
return ret;
}
done:
diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..fdce3a2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -932,15 +932,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (anon_vma)
put_anon_vma(anon_vma);
unlock_page(hpage);
-
out:
- if (rc != -EAGAIN) {
- list_del(&hpage->lru);
- put_page(hpage);
- }
-
put_page(new_hpage);
-
if (result) {
if (rc)
*result = rc;
@@ -1016,48 +1009,32 @@ out:
return nr_failed + retry;
}
-int migrate_huge_pages(struct list_head *from,
- new_page_t get_new_page, unsigned long private, bool offlining,
- enum migrate_mode mode)
+int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
+ unsigned long private, bool offlining,
+ enum migrate_mode mode)
{
- int retry = 1;
- int nr_failed = 0;
- int pass = 0;
- struct page *page;
- struct page *page2;
- int rc;
-
- for (pass = 0; pass < 10 && retry; pass++) {
- retry = 0;
-
- list_for_each_entry_safe(page, page2, from, lru) {
+ int pass, rc;
+
+ for (pass = 0; pass < 10; pass++) {
+ rc = unmap_and_move_huge_page(get_new_page,
+ private, hpage, pass > 2, offlining,
+ mode);
+ switch (rc) {
+ case -ENOMEM:
+ goto out;
+ case -EAGAIN:
+ /* try again */
cond_resched();
-
- rc = unmap_and_move_huge_page(get_new_page,
- private, page, pass > 2, offlining,
- mode);
-
- switch(rc) {
- case -ENOMEM:
- goto out;
- case -EAGAIN:
- retry++;
- break;
- case 0:
- break;
- default:
- /* Permanent failure */
- nr_failed++;
- break;
- }
+ break;
+ case 0:
+ goto out;
+ default:
+ rc = -EIO;
+ goto out;
}
}
- rc = 0;
out:
- if (rc)
- return rc;
-
- return nr_failed + retry;
+ return rc;
}
#ifdef CONFIG_NUMA
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This adds necessary charge/uncharge calls in the HugeTLB code. We do
hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb.c | 16 +++++++++++++++-
mm/hugetlb_cgroup.c | 7 +------
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf79131..4ca92a9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
BUG_ON(page_mapcount(page));
spin_lock(&hugetlb_lock);
+ hugetlb_cgroup_uncharge_page(hstate_index(h),
+ pages_per_huge_page(h), page);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
/* remove the page from active list */
list_del(&page->lru);
@@ -1116,7 +1118,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *page;
long chg;
+ int ret, idx;
+ struct hugetlb_cgroup *h_cg;
+ idx = hstate_index(h);
/*
* Processes that did not create the mapping will have no
* reserves and will not have accounted against subpool
@@ -1132,6 +1137,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (hugepage_subpool_get_pages(spool, chg))
return ERR_PTR(-ENOSPC);
+ ret = hugetlb_cgroup_charge_page(idx, pages_per_huge_page(h), &h_cg);
+ if (ret) {
+ hugepage_subpool_put_pages(spool, chg);
+ return ERR_PTR(-ENOSPC);
+ }
spin_lock(&hugetlb_lock);
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
spin_unlock(&hugetlb_lock);
@@ -1139,6 +1149,9 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (!page) {
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
if (!page) {
+ hugetlb_cgroup_uncharge_cgroup(idx,
+ pages_per_huge_page(h),
+ h_cg);
hugepage_subpool_put_pages(spool, chg);
return ERR_PTR(-ENOSPC);
}
@@ -1147,7 +1160,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long)spool);
vma_commit_reservation(h, vma, addr);
-
+ /* update page cgroup details */
+ hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
return page;
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 2a4881d..c2b7b8e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -249,15 +249,10 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
if (hugetlb_cgroup_disabled())
return;
- spin_lock(&hugetlb_lock);
h_cg = hugetlb_cgroup_from_page(page);
- if (unlikely(!h_cg)) {
- spin_unlock(&hugetlb_lock);
+ if (unlikely(!h_cg))
return;
- }
set_hugetlb_cgroup(page, NULL);
- spin_unlock(&hugetlb_lock);
-
res_counter_uncharge(&h_cg->hugepage[idx], csize);
return;
}
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
Documentation/cgroups/hugetlb.txt | 45 +++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
create mode 100644 Documentation/cgroups/hugetlb.txt
diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
new file mode 100644
index 0000000..a9faaca
--- /dev/null
+++ b/Documentation/cgroups/hugetlb.txt
@@ -0,0 +1,45 @@
+HugeTLB Controller
+-------------------
+
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
+HugeTLB controller can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -o hugetlb none /sys/fs/cgroup
+
+With the above step, the initial or the parent HugeTLB group becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
+
+New groups can be created under the parent group /sys/fs/cgroup.
+
+# cd /sys/fs/cgroup
+# mkdir g1
+# echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it.
+
+Brief summary of control files
+
+ hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
+ hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
+
+For a system supporting two hugepage size (16M and 16G) the control
+files include:
+
+hugetlb.16GB.limit_in_bytes
+hugetlb.16GB.max_usage_in_bytes
+hugetlb.16GB.usage_in_bytes
+hugetlb.16GB.failcnt
+hugetlb.16MB.limit_in_bytes
+hugetlb.16MB.max_usage_in_bytes
+hugetlb.16MB.usage_in_bytes
+hugetlb.16MB.failcnt
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
we are holding a hugepage reference, we can be sure that old page won't
get uncharged till the last put_page().
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 8 ++++++++
mm/hugetlb_cgroup.c | 21 +++++++++++++++++++++
mm/migrate.c | 5 +++++
3 files changed, 34 insertions(+)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index ba4836f..b64d067 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
extern int hugetlb_cgroup_file_init(int idx) __init;
+extern void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage);
#else
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
{
@@ -112,5 +114,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
{
return 0;
}
+
+static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
+ struct page *newhpage)
+{
+ return;
+}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index c2b7b8e..2d384fe 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -394,6 +394,27 @@ int __init hugetlb_cgroup_file_init(int idx)
return 0;
}
+void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
+{
+ struct hugetlb_cgroup *h_cg;
+
+ VM_BUG_ON(!PageHuge(oldhpage));
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ spin_lock(&hugetlb_lock);
+ h_cg = hugetlb_cgroup_from_page(oldhpage);
+ set_hugetlb_cgroup(oldhpage, NULL);
+ cgroup_exclude_rmdir(&h_cg->css);
+
+ /* move the h_cg details to new cgroup */
+ set_hugetlb_cgroup(newhpage, h_cg);
+ spin_unlock(&hugetlb_lock);
+ cgroup_release_and_wakeup_rmdir(&h_cg->css);
+ return;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
diff --git a/mm/migrate.c b/mm/migrate.c
index fdce3a2..6c37c51 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -33,6 +33,7 @@
#include <linux/memcontrol.h>
#include <linux/syscalls.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/gfp.h>
#include <asm/tlbflush.h>
@@ -931,6 +932,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
if (anon_vma)
put_anon_vma(anon_vma);
+
+ if (!rc)
+ hugetlb_cgroup_migrate(hpage, new_hpage);
+
unlock_page(hpage);
out:
put_page(new_hpage);
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
fix linked list corruption in unmap_hugepage_range()") but we don't use
page->lru in unmap_hugepage_range any more. Also the lock was taken
higher up in the stack in some code path. That would result in deadlock.
unmap_mapping_range (i_mmap_mutex)
-> unmap_mapping_range_tree
-> unmap_mapping_range_vma
-> zap_page_range_single
-> unmap_single_vma
-> unmap_hugepage_range (i_mmap_mutex)
For shared pagetable support for huge pages, since pagetable pages are ref
counted we don't need any lock during huge_pmd_unshare. We do take
i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
(39dde65c9940c97f ("shared page table for hugetlb page")).
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/memory.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 545e18a..f6bc04f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,11 +1326,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* Since no pte has actually been setup, it is
* safe to do nothing in this case.
*/
- if (vma->vm_file) {
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ if (vma->vm_file)
__unmap_hugepage_range(tlb, vma, start, end, NULL);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
- }
} else
unmap_page_range(tlb, vma, start, end, details);
}
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Add the control files for hugetlb controller
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 ++
include/linux/hugetlb_cgroup.h | 6 ++
mm/hugetlb.c | 8 +++
mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 149 insertions(+)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4aca057..9650bb1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -4,6 +4,7 @@
#include <linux/mm_types.h>
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
+#include <linux/cgroup.h>
struct ctl_table;
struct user_struct;
@@ -221,6 +222,10 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+ /* cgroup control files */
+ struct cftype cgroup_files[5];
+#endif
char name[HSTATE_NAME_LEN];
};
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index ceff1d5..ba4836f 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page);
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
+extern int hugetlb_cgroup_file_init(int idx) __init;
#else
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
{
@@ -106,5 +107,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
{
return;
}
+
+static inline int __init hugetlb_cgroup_file_init(int idx)
+{
+ return 0;
+}
#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1ca2d8f..bf79131 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -30,6 +30,7 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
+#include <linux/hugetlb_cgroup.h>
#include "internal.h"
const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
@@ -1916,6 +1917,13 @@ void __init hugetlb_add_hstate(unsigned order)
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
+ /*
+ * Add cgroup control files only if the huge page consists
+ * of more than two normal pages. This is because we use
+ * page[2].lru.next for storing cgoup details.
+ */
+ if (order >= 2)
+ hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
parsed_hstate = h;
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 9458fe3..2a4881d 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -18,6 +18,11 @@
#include <linux/hugetlb.h>
#include <linux/hugetlb_cgroup.h>
+/* lifted from mem control */
+#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
+#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
+#define MEMFILE_ATTR(val) ((val) & 0xffff)
+
struct cgroup_subsys hugetlb_subsys __read_mostly;
struct hugetlb_cgroup *root_h_cgroup __read_mostly;
@@ -269,6 +274,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
return;
}
+static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ u64 val;
+ char str[64];
+ int idx, name, len;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ val = res_counter_read_u64(&h_cg->hugepage[idx], name);
+ len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
+ return simple_read_from_buffer(buf, nbytes, ppos, str, len);
+}
+
+static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
+ const char *buffer)
+{
+ int idx, name, ret;
+ unsigned long long val;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(cft->private);
+ name = MEMFILE_ATTR(cft->private);
+
+ switch (name) {
+ case RES_LIMIT:
+ if (hugetlb_cgroup_is_root(h_cg)) {
+ /* Can't set limit on root */
+ ret = -EINVAL;
+ break;
+ }
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
+{
+ int idx, name, ret = 0;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
+
+ idx = MEMFILE_IDX(event);
+ name = MEMFILE_ATTR(event);
+
+ switch (name) {
+ case RES_MAX_USAGE:
+ res_counter_reset_max(&h_cg->hugepage[idx]);
+ break;
+ case RES_FAILCNT:
+ res_counter_reset_failcnt(&h_cg->hugepage[idx]);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+ return ret;
+}
+
+static char *mem_fmt(char *buf, int size, unsigned long hsize)
+{
+ if (hsize >= (1UL << 30))
+ snprintf(buf, size, "%luGB", hsize >> 30);
+ else if (hsize >= (1UL << 20))
+ snprintf(buf, size, "%luMB", hsize >> 20);
+ else
+ snprintf(buf, size, "%luKB", hsize >> 10);
+ return buf;
+}
+
+int __init hugetlb_cgroup_file_init(int idx)
+{
+ char buf[32];
+ struct cftype *cft;
+ struct hstate *h = &hstates[idx];
+
+ /* format the size */
+ mem_fmt(buf, 32, huge_page_size(h));
+
+ /* Add the limit file */
+ cft = &h->cgroup_files[0];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+ cft->read = hugetlb_cgroup_read;
+ cft->write_string = hugetlb_cgroup_write;
+
+ /* Add the usage file */
+ cft = &h->cgroup_files[1];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the MAX usage file */
+ cft = &h->cgroup_files[2];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* Add the failcntfile */
+ cft = &h->cgroup_files[3];
+ snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
+ cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+ cft->trigger = hugetlb_cgroup_reset;
+ cft->read = hugetlb_cgroup_read;
+
+ /* NULL terminate the last cft */
+ cft = &h->cgroup_files[4];
+ memset(cft, 0, sizeof(*cft));
+
+ WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
+
+ return 0;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This patchset add the charge and uncharge routines for hugetlb cgroup.
This will be used in later patches when we allocate/free HugeTLB
pages.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 20a32c5..48efd5a 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
return -EBUSY;
}
+int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ int ret = 0;
+ struct res_counter *fail_res;
+ struct hugetlb_cgroup *h_cg = NULL;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ goto done;
+ /*
+ * We don't charge any cgroup if the compound page have less
+ * than 3 pages.
+ */
+ if (hstates[idx].order < 2)
+ goto done;
+again:
+ rcu_read_lock();
+ h_cg = hugetlb_cgroup_from_task(current);
+ if (!h_cg)
+ h_cg = root_h_cgroup;
+
+ if (!css_tryget(&h_cg->css)) {
+ rcu_read_unlock();
+ goto again;
+ }
+ rcu_read_unlock();
+
+ ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
+ css_put(&h_cg->css);
+done:
+ *ptr = h_cg;
+ return ret;
+}
+
+void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ if (hugetlb_cgroup_disabled() || !h_cg)
+ return;
+
+ spin_lock(&hugetlb_lock);
+ if (hugetlb_cgroup_from_page(page)) {
+ hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
+ goto done;
+ }
+ set_hugetlb_cgroup(page, h_cg);
+done:
+ spin_unlock(&hugetlb_lock);
+ return;
+}
+
+void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page)
+{
+ struct hugetlb_cgroup *h_cg;
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled())
+ return;
+
+ spin_lock(&hugetlb_lock);
+ h_cg = hugetlb_cgroup_from_page(page);
+ if (unlikely(!h_cg)) {
+ spin_unlock(&hugetlb_lock);
+ return;
+ }
+ set_hugetlb_cgroup(page, NULL);
+ spin_unlock(&hugetlb_lock);
+
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
+void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ unsigned long csize = nr_pages * PAGE_SIZE;
+
+ if (hugetlb_cgroup_disabled() || !h_cg)
+ return;
+
+ res_counter_uncharge(&h_cg->hugepage[idx], csize);
+ return;
+}
+
struct cgroup_subsys hugetlb_subsys = {
.name = "hugetlb",
.create = hugetlb_cgroup_create,
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Add an inline helper and use it in the code.
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 20 +++++++++++---------
2 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d5d6bbe..217f528 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -302,6 +302,11 @@ static inline unsigned hstate_index_to_shift(unsigned index)
return hstates[index].order + PAGE_SHIFT;
}
+static inline int hstate_index(struct hstate *h)
+{
+ return h - hstates;
+}
+
#else
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -320,6 +325,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h)
return 1;
}
#define hstate_index_to_shift(index) 0
+#define hstate_index(h) 0
#endif
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 34a7e23..b1e0ed1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1646,7 +1646,7 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct attribute_group *hstate_attr_group)
{
int retval;
- int hi = h - hstates;
+ int hi = hstate_index(h);
hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
if (!hstate_kobjs[hi])
@@ -1741,11 +1741,13 @@ void hugetlb_unregister_node(struct node *node)
if (!nhs->hugepages_kobj)
return; /* no hstate attributes */
- for_each_hstate(h)
- if (nhs->hstate_kobjs[h - hstates]) {
- kobject_put(nhs->hstate_kobjs[h - hstates]);
- nhs->hstate_kobjs[h - hstates] = NULL;
+ for_each_hstate(h) {
+ int idx = hstate_index(h);
+ if (nhs->hstate_kobjs[idx]) {
+ kobject_put(nhs->hstate_kobjs[idx]);
+ nhs->hstate_kobjs[idx] = NULL;
}
+ }
kobject_put(nhs->hugepages_kobj);
nhs->hugepages_kobj = NULL;
@@ -1848,7 +1850,7 @@ static void __exit hugetlb_exit(void)
hugetlb_unregister_all_nodes();
for_each_hstate(h) {
- kobject_put(hstate_kobjs[h - hstates]);
+ kobject_put(hstate_kobjs[hstate_index(h)]);
}
kobject_put(hugepages_kobj);
@@ -1869,7 +1871,7 @@ static int __init hugetlb_init(void)
if (!size_to_hstate(default_hstate_size))
hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
}
- default_hstate_idx = size_to_hstate(default_hstate_size) - hstates;
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size));
if (default_hstate_max_huge_pages)
default_hstate.max_huge_pages = default_hstate_max_huge_pages;
@@ -2687,7 +2689,7 @@ retry:
*/
if (unlikely(PageHWPoison(page))) {
ret = VM_FAULT_HWPOISON |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
goto backout_unlocked;
}
}
@@ -2760,7 +2762,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(h - hstates);
+ VM_FAULT_SET_HINDEX(hstate_index(h));
}
ptep = huge_pte_alloc(mm, address, huge_page_size(h));
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
the usage to hugetlb cgroup to only hugepages with 3 or more
normal pages. I guess that is an acceptable limitation.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
mm/hugetlb.c | 4 ++++
2 files changed, 35 insertions(+)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 5794be4..ceff1d5 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -26,6 +26,26 @@ struct hugetlb_cgroup {
};
#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ if (!PageHuge(page))
+ return NULL;
+ if (compound_order(page) < 3)
+ return NULL;
+ return (struct hugetlb_cgroup *)page[2].lru.next;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ if (!PageHuge(page))
+ return -1;
+ if (compound_order(page) < 3)
+ return -1;
+ page[2].lru.next = (void *)h_cg;
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
if (hugetlb_subsys.disabled)
@@ -43,6 +63,17 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
#else
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+static inline
+int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+{
+ return 0;
+}
+
static inline bool hugetlb_cgroup_disabled(void)
{
return true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e899a2d..1ca2d8f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,7 @@
#include <linux/io.h>
#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
#include "internal.h"
@@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_reserved |
1 << PG_private | 1 << PG_writeback);
}
+ BUG_ON(hugetlb_cgroup_from_page(page));
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
arch_release_hugepage(page);
@@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
+ set_hugetlb_cgroup(page, NULL);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
@@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
+ set_hugetlb_cgroup(page, NULL);
/*
* We incremented the global counters already
*/
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
This patch implements a new controller that allows us to control HugeTLB
allocations. The extension allows to limit the HugeTLB usage per control
group and enforces the controller limit during page fault. Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to access HugeTLB
pages beyond its limit. This requires the application to know beforehand
how much HugeTLB pages it would require for its use.
The charge/uncharge calls will be added to HugeTLB code in later patch.
Support for cgroup removal will be added in later patches.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/cgroup_subsys.h | 6 +++
include/linux/hugetlb_cgroup.h | 79 ++++++++++++++++++++++++++++
init/Kconfig | 16 ++++++
mm/Makefile | 1 +
mm/hugetlb_cgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 216 insertions(+)
create mode 100644 include/linux/hugetlb_cgroup.h
create mode 100644 mm/hugetlb_cgroup.c
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0bd390c..895923a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -72,3 +72,9 @@ SUBSYS(net_prio)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+SUBSYS(hugetlb)
+#endif
+
+/* */
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
new file mode 100644
index 0000000..5794be4
--- /dev/null
+++ b/include/linux/hugetlb_cgroup.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#ifndef _LINUX_HUGETLB_CGROUP_H
+#define _LINUX_HUGETLB_CGROUP_H
+
+#include <linux/res_counter.h>
+
+struct hugetlb_cgroup {
+ struct cgroup_subsys_state css;
+ /*
+ * the counter to account for hugepages from hugetlb.
+ */
+ struct res_counter hugepage[HUGE_MAX_HSTATE];
+};
+
+#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ if (hugetlb_subsys.disabled)
+ return true;
+ return false;
+}
+
+extern int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr);
+extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+ struct page *page);
+extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg);
+#else
+static inline bool hugetlb_cgroup_disabled(void)
+{
+ return true;
+}
+
+static inline int
+hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup **ptr)
+{
+ return 0;
+}
+
+static inline void
+hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
+{
+ return;
+}
+
+static inline void
+hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg)
+{
+ return;
+}
+#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index d07dcf9..b9a0d0a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -751,6 +751,22 @@ config CGROUP_MEM_RES_CTLR_KMEM
the kmem extension can use it to guarantee that no group of processes
will ever exhaust kernel resources alone.
+config CGROUP_HUGETLB_RES_CTLR
+ bool "HugeTLB Resource Controller for Control Groups"
+ depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL
+ select PAGE_CGROUP
+ default n
+ help
+ Provides a simple cgroup Resource Controller for HugeTLB pages.
+ When you enable this, you can put a per cgroup limit on HugeTLB usage.
+ The limit is enforced during page fault. Since HugeTLB doesn't
+ support page reclaim, enforcing the limit at page fault time implies
+ that, the application will get SIGBUS signal if it tries to access
+ HugeTLB pages beyond its limit. This requires the application to know
+ beforehand how much HugeTLB pages it would require for its use. The
+ control group is tracked in the third page lru pointer. This means
+ that we cannot use the controller with huge page less than 3 pages.
+
config CGROUP_PERF
bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
depends on PERF_EVENTS && CGROUPS
diff --git a/mm/Makefile b/mm/Makefile
index a156285..a8dd8d5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
new file mode 100644
index 0000000..20a32c5
--- /dev/null
+++ b/mm/hugetlb_cgroup.c
@@ -0,0 +1,114 @@
+/*
+ *
+ * Copyright IBM Corporation, 2012
+ * Author Aneesh Kumar K.V <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
+
+struct cgroup_subsys hugetlb_subsys __read_mostly;
+struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
+{
+ if (s)
+ return container_of(s, struct hugetlb_cgroup, css);
+ return NULL;
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
+{
+ return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
+ hugetlb_subsys_id));
+}
+
+static inline
+struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
+{
+ return hugetlb_cgroup_from_css(task_subsys_state(task,
+ hugetlb_subsys_id));
+}
+
+static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
+{
+ return (h_cg == root_h_cgroup);
+}
+
+static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
+{
+ if (!cg->parent)
+ return NULL;
+ return hugetlb_cgroup_from_cgroup(cg->parent);
+}
+
+static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
+{
+ int idx;
+ struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
+
+ for (idx = 0; idx < hugetlb_max_hstate; idx++) {
+ if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0)
+ return 1;
+ }
+ return 0;
+}
+
+static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
+{
+ int idx;
+ struct cgroup *parent_cgroup;
+ struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
+
+ h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+ if (!h_cgroup)
+ return ERR_PTR(-ENOMEM);
+
+ parent_cgroup = cgroup->parent;
+ if (parent_cgroup) {
+ parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx],
+ &parent_h_cgroup->hugepage[idx]);
+ } else {
+ root_h_cgroup = h_cgroup;
+ for (idx = 0; idx < HUGE_MAX_HSTATE; idx++)
+ res_counter_init(&h_cgroup->hugepage[idx], NULL);
+ }
+ return &h_cgroup->css;
+}
+
+static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
+{
+ struct hugetlb_cgroup *h_cgroup;
+
+ h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
+ kfree(h_cgroup);
+}
+
+static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
+{
+ /* We will add the cgroup removal support in later patches */
+ return -EBUSY;
+}
+
+struct cgroup_subsys hugetlb_subsys = {
+ .name = "hugetlb",
+ .create = hugetlb_cgroup_create,
+ .pre_destroy = hugetlb_cgroup_pre_destroy,
+ .destroy = hugetlb_cgroup_destroy,
+ .subsys_id = hugetlb_subsys_id,
+};
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
We will use them later in hugetlb_cgroup.c
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 7 ++-----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ed550d8..4aca057 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -21,6 +21,11 @@ struct hugepage_subpool {
long max_hpages, used_hpages;
};
+extern spinlock_t hugetlb_lock;
+extern int hugetlb_max_hstate;
+#define for_each_hstate(h) \
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
+
struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
void hugepage_put_subpool(struct hugepage_subpool *spool);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b5b6e15..e899a2d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,7 +35,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-static int hugetlb_max_hstate;
+int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];
@@ -46,13 +46,10 @@ static struct hstate * __initdata parsed_hstate;
static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
-#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
-
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
*/
-static DEFINE_SPINLOCK(hugetlb_lock);
+DEFINE_SPINLOCK(hugetlb_lock);
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
hugepage_activelist will be used to track currently used HugeTLB pages.
We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
On cgroup removal we update the page's HugeTLB cgroup to point to parent
cgroup.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 12 +++++++-----
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0f23c18..ed550d8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -211,6 +211,7 @@ struct hstate {
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
+ struct list_head hugepage_activelist;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e54b695..b5b6e15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -510,7 +510,7 @@ void copy_huge_page(struct page *dst, struct page *src)
static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
- list_add(&page->lru, &h->hugepage_freelists[nid]);
+ list_move(&page->lru, &h->hugepage_freelists[nid]);
h->free_huge_pages++;
h->free_huge_pages_node[nid]++;
}
@@ -522,7 +522,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
if (list_empty(&h->hugepage_freelists[nid]))
return NULL;
page = list_entry(h->hugepage_freelists[nid].next, struct page, lru);
- list_del(&page->lru);
+ list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
h->free_huge_pages--;
h->free_huge_pages_node[nid]--;
@@ -626,10 +626,11 @@ static void free_huge_page(struct page *page)
page->mapping = NULL;
BUG_ON(page_count(page));
BUG_ON(page_mapcount(page));
- INIT_LIST_HEAD(&page->lru);
spin_lock(&hugetlb_lock);
if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+ /* remove the page from active list */
+ list_del(&page->lru);
update_and_free_page(h, page);
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
@@ -642,6 +643,7 @@ static void free_huge_page(struct page *page)
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
+ INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
h->nr_huge_pages++;
@@ -890,6 +892,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
spin_lock(&hugetlb_lock);
if (page) {
+ INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, free_huge_page);
/*
@@ -994,7 +997,6 @@ retry:
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
if ((--needed) < 0)
break;
- list_del(&page->lru);
/*
* This page is now managed by the hugetlb allocator and has
* no users -- drop the buddy allocator's reference.
@@ -1009,7 +1011,6 @@ free:
/* Free unnecessary surplus pages to the buddy allocator */
if (!list_empty(&surplus_list)) {
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
- list_del(&page->lru);
put_page(page);
}
}
@@ -1909,6 +1910,7 @@ void __init hugetlb_add_hstate(unsigned order)
h->free_huge_pages = 0;
for (i = 0; i < MAX_NUMNODES; ++i)
INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+ INIT_LIST_HEAD(&h->hugepage_activelist);
h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]);
h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
--
1.7.10
From: "Aneesh Kumar K.V" <[email protected]>
Rename max_hstate to hugetlb_max_hstate. We will be using this from other
subsystems like hugetlb controller in later patches.
Acked-by: David Rientjes <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
mm/hugetlb.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e198831..c868309 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,7 +34,7 @@ const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-static int max_hstate;
+static int hugetlb_max_hstate;
unsigned int default_hstate_idx;
struct hstate hstates[HUGE_MAX_HSTATE];
@@ -46,7 +46,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
static unsigned long __initdata default_hstate_size;
#define for_each_hstate(h) \
- for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
+ for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -1897,9 +1897,9 @@ void __init hugetlb_add_hstate(unsigned order)
printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n");
return;
}
- BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+ BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order == 0);
- h = &hstates[max_hstate++];
+ h = &hstates[hugetlb_max_hstate++];
h->order = order;
h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
h->nr_huge_pages = 0;
@@ -1920,10 +1920,10 @@ static int __init hugetlb_nrpages_setup(char *s)
static unsigned long *last_mhp;
/*
- * !max_hstate means we haven't parsed a hugepagesz= parameter yet,
+ * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet,
* so this hugepages= parameter goes to the "default hstate".
*/
- if (!max_hstate)
+ if (!hugetlb_max_hstate)
mhp = &default_hstate_max_huge_pages;
else
mhp = &parsed_hstate->max_huge_pages;
@@ -1942,7 +1942,7 @@ static int __init hugetlb_nrpages_setup(char *s)
* But we need to allocate >= MAX_ORDER hstates here early to still
* use the bootmem allocator.
*/
- if (max_hstate && parsed_hstate->order >= MAX_ORDER)
+ if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER)
hugetlb_hstate_alloc_pages(parsed_hstate);
last_mhp = mhp;
--
1.7.10
On Sat, Jun 09, 2012 at 02:29:59PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This adds necessary charge/uncharge calls in the HugeTLB code. We do
> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/hugetlb.c | 16 +++++++++++++++-
> mm/hugetlb_cgroup.c | 7 +------
> 2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf79131..4ca92a9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
> BUG_ON(page_mapcount(page));
>
> spin_lock(&hugetlb_lock);
> + hugetlb_cgroup_uncharge_page(hstate_index(h),
> + pages_per_huge_page(h), page);
hugetlb_cgroup_uncharge_page() takes the hugetlb_lock, no?
It's quite hard to review code that is split up like this. Please
always keep the introduction of new functions in the same patch that
adds the callsite(s).
On Sat, Jun 09, 2012 at 02:29:50PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> fix linked list corruption in unmap_hugepage_range()") but we don't use
> page->lru in unmap_hugepage_range any more. Also the lock was taken
> higher up in the stack in some code path. That would result in deadlock.
>
> unmap_mapping_range (i_mmap_mutex)
> -> unmap_mapping_range_tree
> -> unmap_mapping_range_vma
> -> zap_page_range_single
> -> unmap_single_vma
> -> unmap_hugepage_range (i_mmap_mutex)
>
> For shared pagetable support for huge pages, since pagetable pages are ref
> counted we don't need any lock during huge_pmd_unshare. We do take
> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> (39dde65c9940c97f ("shared page table for hugetlb page")).
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
This patch (together with the previous one) seems like a bugfix that's
not really related to the hugetlb controller, unless I miss something.
Could you please submit the fix separately?
Maybe also fold the two patches into one and make it a single bugfix
change that gets rid of the lock by switching away from page->lru.
Thanks
On Sat, Jun 09, 2012 at 02:29:47PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> VM_FAULT_* values from MAX_ERRNO.
I see you using the -ENOMEM|-ENOSPC, but I don't see any reference in the
code to MAX_ERRNO? Can you provide a comment explaining in a tad little
bit about the interaction of MAX_ERRNO and VM_FAULT?
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/hugetlb.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c868309..34a7e23 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1123,10 +1123,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> */
> chg = vma_needs_reservation(h, vma, addr);
> if (chg < 0)
> - return ERR_PTR(-VM_FAULT_OOM);
> + return ERR_PTR(-ENOMEM);
> if (chg)
> if (hugepage_subpool_get_pages(spool, chg))
> - return ERR_PTR(-VM_FAULT_SIGBUS);
> + return ERR_PTR(-ENOSPC);
>
> spin_lock(&hugetlb_lock);
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> @@ -1136,7 +1136,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> if (!page) {
> hugepage_subpool_put_pages(spool, chg);
> - return ERR_PTR(-VM_FAULT_SIGBUS);
> + return ERR_PTR(-ENOSPC);
> }
> }
>
> @@ -2496,6 +2496,7 @@ retry_avoidcopy:
> new_page = alloc_huge_page(vma, address, outside_reserve);
>
> if (IS_ERR(new_page)) {
> + long err = PTR_ERR(new_page);
> page_cache_release(old_page);
>
> /*
> @@ -2524,7 +2525,10 @@ retry_avoidcopy:
>
> /* Caller expects lock to be held */
> spin_lock(&mm->page_table_lock);
> - return -PTR_ERR(new_page);
> + if (err == -ENOMEM)
> + return VM_FAULT_OOM;
> + else
> + return VM_FAULT_SIGBUS;
> }
>
> /*
> @@ -2642,7 +2646,11 @@ retry:
> goto out;
> page = alloc_huge_page(vma, address, 0);
> if (IS_ERR(page)) {
> - ret = -PTR_ERR(page);
> + ret = PTR_ERR(page);
> + if (ret == -ENOMEM)
> + ret = VM_FAULT_OOM;
> + else
> + ret = VM_FAULT_SIGBUS;
> goto out;
> }
> clear_huge_page(page, address, pages_per_huge_page(h));
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Sat, Jun 09, 2012 at 02:29:55PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
Can you explain in a bit more detail the reasoning behind this
please? Either in the code in the hugetlb_cgroup_from_page
or in the git commit (or better - in both).
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..ceff1d5 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> };
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + if (!PageHuge(page))
> + return NULL;
> + if (compound_order(page) < 3)
> + return NULL;
> + return (struct hugetlb_cgroup *)page[2].lru.next;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + if (!PageHuge(page))
> + return -1;
> + if (compound_order(page) < 3)
> + return -1;
> + page[2].lru.next = (void *)h_cg;
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> if (hugetlb_subsys.disabled)
> @@ -43,6 +63,17 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> #else
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> return true;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e899a2d..1ca2d8f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -28,6 +28,7 @@
>
> #include <linux/io.h>
> #include <linux/hugetlb.h>
> +#include <linux/hugetlb_cgroup.h>
> #include <linux/node.h>
> #include "internal.h"
>
> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> 1 << PG_active | 1 << PG_reserved |
> 1 << PG_private | 1 << PG_writeback);
> }
> + BUG_ON(hugetlb_cgroup_from_page(page));
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> arch_release_hugepage(page);
> @@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> + set_hugetlb_cgroup(page, NULL);
> h->nr_huge_pages++;
> h->nr_huge_pages_node[nid]++;
> spin_unlock(&hugetlb_lock);
> @@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> INIT_LIST_HEAD(&page->lru);
> r_nid = page_to_nid(page);
> set_compound_page_dtor(page, free_huge_page);
> + set_hugetlb_cgroup(page, NULL);
> /*
> * We incremented the global counters already
> */
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1916,6 +1917,13 @@ void __init hugetlb_add_hstate(unsigned order)
> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
> + /*
> + * Add cgroup control files only if the huge page consists
> + * of more than two normal pages. This is because we use
Not three? I thought the earlier patches said three?
> + * page[2].lru.next for storing cgoup details.
cgoup?
> + */
> + if (order >= 2)
> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>
> parsed_hstate = h;
> }
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 9458fe3..2a4881d 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -18,6 +18,11 @@
> #include <linux/hugetlb.h>
> #include <linux/hugetlb_cgroup.h>
>
> +/* lifted from mem control */
And you can also life the comment from mem control explaining
what these defines are good for.
> +#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val))
> +#define MEMFILE_IDX(val) (((val) >> 16) & 0xffff)
> +#define MEMFILE_ATTR(val) ((val) & 0xffff)
> +
> struct cgroup_subsys hugetlb_subsys __read_mostly;
> struct hugetlb_cgroup *root_h_cgroup __read_mostly;
>
> @@ -269,6 +274,131 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> return;
> }
>
> +static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft,
> + struct file *file, char __user *buf,
> + size_t nbytes, loff_t *ppos)
> +{
> + u64 val;
> + char str[64];
Why no #define? Wait a minute - didn't I provide a similar comment last
time? So why the reason to stick without the #define's?
> + int idx, name, len;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + val = res_counter_read_u64(&h_cg->hugepage[idx], name);
> + len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
> + return simple_read_from_buffer(buf, nbytes, ppos, str, len);
> +}
> +
> +static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft,
> + const char *buffer)
> +{
> + int idx, name, ret;
> + unsigned long long val;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(cft->private);
> + name = MEMFILE_ATTR(cft->private);
> +
> + switch (name) {
> + case RES_LIMIT:
> + if (hugetlb_cgroup_is_root(h_cg)) {
> + /* Can't set limit on root */
> + ret = -EINVAL;
> + break;
> + }
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + break;
> + ret = res_counter_set_limit(&h_cg->hugepage[idx], val);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event)
> +{
> + int idx, name, ret = 0;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> +
> + idx = MEMFILE_IDX(event);
> + name = MEMFILE_ATTR(event);
> +
> + switch (name) {
> + case RES_MAX_USAGE:
> + res_counter_reset_max(&h_cg->hugepage[idx]);
> + break;
> + case RES_FAILCNT:
> + res_counter_reset_failcnt(&h_cg->hugepage[idx]);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + }
> + return ret;
> +}
> +
> +static char *mem_fmt(char *buf, int size, unsigned long hsize)
> +{
> + if (hsize >= (1UL << 30))
> + snprintf(buf, size, "%luGB", hsize >> 30);
> + else if (hsize >= (1UL << 20))
> + snprintf(buf, size, "%luMB", hsize >> 20);
> + else
> + snprintf(buf, size, "%luKB", hsize >> 10);
> + return buf;
> +}
> +
> +int __init hugetlb_cgroup_file_init(int idx)
> +{
> + char buf[32];
Ditto.
> + struct cftype *cft;
> + struct hstate *h = &hstates[idx];
> +
> + /* format the size */
> + mem_fmt(buf, 32, huge_page_size(h));
Ditto.
> +
> + /* Add the limit file */
> + cft = &h->cgroup_files[0];
Can't this be just:
cfg = &h->cgroup_files;
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> + cft->read = hugetlb_cgroup_read;
> + cft->write_string = hugetlb_cgroup_write;
> +
> + /* Add the usage file */
> + cft = &h->cgroup_files[1];
and this be:
cft++;
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the MAX usage file */
> + cft = &h->cgroup_files[2];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the failcntfile */
> + cft = &h->cgroup_files[3];
and so for this one.
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* NULL terminate the last cft */
> + cft = &h->cgroup_files[4];
and for that one?
> + memset(cft, 0, sizeof(*cft));
> +
> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> +
> + return 0;
> +}
> +
> struct cgroup_subsys hugetlb_subsys = {
> .name = "hugetlb",
> .create = hugetlb_cgroup_create,
> --
> 1.7.10
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
Johannes Weiner <[email protected]> writes:
> On Sat, Jun 09, 2012 at 02:29:50PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
>> fix linked list corruption in unmap_hugepage_range()") but we don't use
>> page->lru in unmap_hugepage_range any more. Also the lock was taken
>> higher up in the stack in some code path. That would result in deadlock.
>>
>> unmap_mapping_range (i_mmap_mutex)
>> -> unmap_mapping_range_tree
>> -> unmap_mapping_range_vma
>> -> zap_page_range_single
>> -> unmap_single_vma
>> -> unmap_hugepage_range (i_mmap_mutex)
>>
>> For shared pagetable support for huge pages, since pagetable pages are ref
>> counted we don't need any lock during huge_pmd_unshare. We do take
>> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
>> (39dde65c9940c97f ("shared page table for hugetlb page")).
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> This patch (together with the previous one) seems like a bugfix that's
> not really related to the hugetlb controller, unless I miss something.
>
> Could you please submit the fix separately?
Patches upto 6 can really got in a separate series. I was not sure
whether I should split them. I will post that as a separate series now
>
> Maybe also fold the two patches into one and make it a single bugfix
> change that gets rid of the lock by switching away from page->lru.
I wanted to make sure the patch that drop i_mmap_mutex is a separate one
so that we understand and document the locking details separately
-aneesh
Johannes Weiner <[email protected]> writes:
> On Sat, Jun 09, 2012 at 02:29:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> This adds necessary charge/uncharge calls in the HugeTLB code. We do
>> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>> mm/hugetlb.c | 16 +++++++++++++++-
>> mm/hugetlb_cgroup.c | 7 +------
>> 2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index bf79131..4ca92a9 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
>> BUG_ON(page_mapcount(page));
>>
>> spin_lock(&hugetlb_lock);
>> + hugetlb_cgroup_uncharge_page(hstate_index(h),
>> + pages_per_huge_page(h), page);
>
> hugetlb_cgroup_uncharge_page() takes the hugetlb_lock, no?
Yes, But this patch also modifies it to not take the lock, because we
hold spin_lock just below in the call site. I didn't want to drop the
lock and take it again.
>
> It's quite hard to review code that is split up like this. Please
> always keep the introduction of new functions in the same patch that
> adds the callsite(s).
One of the reason I split the charge/uncharge routines and the callers
in separate patches is to make it easier for review. Irrespective of
the call site charge/uncharge routines should be correct with respect
to locking and other details. What I did in this patch is a small
optimization of avoiding dropping and taking the lock again. May be the
right approach would have been to name it __hugetlb_cgroup_uncharge_page
and make sure the hugetlb_cgroup_uncharge_page still takes spin_lock.
But then we don't have any callers for that.
-aneesh
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Sat, Jun 09, 2012 at 02:29:47PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
>> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
>> VM_FAULT_* values from MAX_ERRNO.
>
> I see you using the -ENOMEM|-ENOSPC, but I don't see any reference in the
> code to MAX_ERRNO? Can you provide a comment explaining in a tad little
> bit about the interaction of MAX_ERRNO and VM_FAULT?
That comes from this
#define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO)
-aneesh
On Sat, Jun 09, 2012 at 06:39:06PM +0530, Aneesh Kumar K.V wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Sat, Jun 09, 2012 at 02:29:59PM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> This adds necessary charge/uncharge calls in the HugeTLB code. We do
> >> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
> >>
> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >> ---
> >> mm/hugetlb.c | 16 +++++++++++++++-
> >> mm/hugetlb_cgroup.c | 7 +------
> >> 2 files changed, 16 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index bf79131..4ca92a9 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
> >> BUG_ON(page_mapcount(page));
> >>
> >> spin_lock(&hugetlb_lock);
> >> + hugetlb_cgroup_uncharge_page(hstate_index(h),
> >> + pages_per_huge_page(h), page);
> >
> > hugetlb_cgroup_uncharge_page() takes the hugetlb_lock, no?
>
> Yes, But this patch also modifies it to not take the lock, because we
> hold spin_lock just below in the call site. I didn't want to drop the
> lock and take it again.
Sorry, I missed that.
> > It's quite hard to review code that is split up like this. Please
> > always keep the introduction of new functions in the same patch that
> > adds the callsite(s).
>
> One of the reason I split the charge/uncharge routines and the callers
> in separate patches is to make it easier for review. Irrespective of
> the call site charge/uncharge routines should be correct with respect
> to locking and other details. What I did in this patch is a small
> optimization of avoiding dropping and taking the lock again. May be the
> right approach would have been to name it __hugetlb_cgroup_uncharge_page
> and make sure the hugetlb_cgroup_uncharge_page still takes spin_lock.
> But then we don't have any callers for that.
I think this makes it needlessly complicated and there is no correct
or incorrect locking in (initially) dead code :-)
The callsites are just a few lines. It's harder to review if you
introduce an API and then change it again mid-patchset.
If there are no callers for a function that grabs the lock itself,
don't add it. Just add a note to the kerneldoc that explains the
requirement or put VM_BUG_ON(!spin_is_locked(&hugetlb_lock)); in
there or so.
On Sat, Jun 09, 2012 at 06:33:05PM +0530, Aneesh Kumar K.V wrote:
> Johannes Weiner <[email protected]> writes:
>
> > On Sat, Jun 09, 2012 at 02:29:50PM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> i_mmap_mutex lock was added in unmap_single_vma by 502717f4e ("hugetlb:
> >> fix linked list corruption in unmap_hugepage_range()") but we don't use
> >> page->lru in unmap_hugepage_range any more. Also the lock was taken
> >> higher up in the stack in some code path. That would result in deadlock.
> >>
> >> unmap_mapping_range (i_mmap_mutex)
> >> -> unmap_mapping_range_tree
> >> -> unmap_mapping_range_vma
> >> -> zap_page_range_single
> >> -> unmap_single_vma
> >> -> unmap_hugepage_range (i_mmap_mutex)
> >>
> >> For shared pagetable support for huge pages, since pagetable pages are ref
> >> counted we don't need any lock during huge_pmd_unshare. We do take
> >> i_mmap_mutex in huge_pmd_share while walking the vma_prio_tree in mapping.
> >> (39dde65c9940c97f ("shared page table for hugetlb page")).
> >>
> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >
> > This patch (together with the previous one) seems like a bugfix that's
> > not really related to the hugetlb controller, unless I miss something.
> >
> > Could you please submit the fix separately?
>
> Patches upto 6 can really got in a separate series. I was not sure
> whether I should split them. I will post that as a separate series now
Ok, thanks, that will make it easier to upstream the controller.
> > Maybe also fold the two patches into one and make it a single bugfix
> > change that gets rid of the lock by switching away from page->lru.
>
> I wanted to make sure the patch that drop i_mmap_mutex is a separate one
> so that we understand and document the locking details separately
Nothing prevents you from writing a proper changelog :-) But changing
from page->lru to an on-stack array does not have any merit by itself,
so it just seems like a needless dependency between two patches that
fix one problem (pita for backports into stable/distro kernels).
Johannes Weiner <[email protected]> writes:
> On Sat, Jun 09, 2012 at 06:39:06PM +0530, Aneesh Kumar K.V wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > On Sat, Jun 09, 2012 at 02:29:59PM +0530, Aneesh Kumar K.V wrote:
>> >> From: "Aneesh Kumar K.V" <[email protected]>
>> >>
>> >> This adds necessary charge/uncharge calls in the HugeTLB code. We do
>> >> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
>> >>
>> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> >> ---
>> >> mm/hugetlb.c | 16 +++++++++++++++-
>> >> mm/hugetlb_cgroup.c | 7 +------
>> >> 2 files changed, 16 insertions(+), 7 deletions(-)
>> >>
>> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> >> index bf79131..4ca92a9 100644
>> >> --- a/mm/hugetlb.c
>> >> +++ b/mm/hugetlb.c
>> >> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
>> >> BUG_ON(page_mapcount(page));
>> >>
>> >> spin_lock(&hugetlb_lock);
>> >> + hugetlb_cgroup_uncharge_page(hstate_index(h),
>> >> + pages_per_huge_page(h), page);
>> >
>> > hugetlb_cgroup_uncharge_page() takes the hugetlb_lock, no?
>>
>> Yes, But this patch also modifies it to not take the lock, because we
>> hold spin_lock just below in the call site. I didn't want to drop the
>> lock and take it again.
>
> Sorry, I missed that.
>
>> > It's quite hard to review code that is split up like this. Please
>> > always keep the introduction of new functions in the same patch that
>> > adds the callsite(s).
>>
>> One of the reason I split the charge/uncharge routines and the callers
>> in separate patches is to make it easier for review. Irrespective of
>> the call site charge/uncharge routines should be correct with respect
>> to locking and other details. What I did in this patch is a small
>> optimization of avoiding dropping and taking the lock again. May be the
>> right approach would have been to name it __hugetlb_cgroup_uncharge_page
>> and make sure the hugetlb_cgroup_uncharge_page still takes spin_lock.
>> But then we don't have any callers for that.
>
> I think this makes it needlessly complicated and there is no correct
> or incorrect locking in (initially) dead code :-)
>
> The callsites are just a few lines. It's harder to review if you
> introduce an API and then change it again mid-patchset.
>
I will fold the patches.
> If there are no callers for a function that grabs the lock itself,
> don't add it. Just add a note to the kerneldoc that explains the
> requirement or put VM_BUG_ON(!spin_is_locked(&hugetlb_lock)); in
> there or so.
That is excellent. I will add kerneldoc and VM_BUG_ON.
-aneesh
"Aneesh Kumar K.V" <[email protected]> writes:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..ceff1d5 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> };
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + if (!PageHuge(page))
> + return NULL;
> + if (compound_order(page) < 3)
That should be if (compound_order(page) < 2) ? I will send an updated
patchset with this fix and other review changes.
-aneesh
On Sat, Jun 9, 2012 at 4:59 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> VM_FAULT_* values from MAX_ERRNO.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
I like this much.
Acked-by: KOSAKI Motohiro <[email protected]>
On Sat, Jun 9, 2012 at 4:59 PM, Aneesh Kumar K.V
<[email protected]> wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure
> VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the
> VM_FAULT_* values from MAX_ERRNO.
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
Thank you, AKKV.
Acked-by: Hillf Danton <[email protected]>
On Sat 09-06-12 14:29:55, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next.
Interesting and I really like the idea much more than tracking by
page_cgroup.
> This limit the usage to hugetlb cgroup to only hugepages with 3 or
> more normal pages. I guess that is an acceptable limitation.
Agreed.
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
Other than some nits I like this.
Thanks!
> ---
> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..ceff1d5 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> };
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + if (!PageHuge(page))
> + return NULL;
> + if (compound_order(page) < 3)
Why 3? I think you wanted 2 here, right?
> + return NULL;
> + return (struct hugetlb_cgroup *)page[2].lru.next;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + if (!PageHuge(page))
> + return -1;
> + if (compound_order(page) < 3)
Here as well.
> + return -1;
> + page[2].lru.next = (void *)h_cg;
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> if (hugetlb_subsys.disabled)
> @@ -43,6 +63,17 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> #else
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> return true;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e899a2d..1ca2d8f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -28,6 +28,7 @@
>
> #include <linux/io.h>
> #include <linux/hugetlb.h>
> +#include <linux/hugetlb_cgroup.h>
> #include <linux/node.h>
> #include "internal.h"
>
> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> 1 << PG_active | 1 << PG_reserved |
> 1 << PG_private | 1 << PG_writeback);
> }
> + BUG_ON(hugetlb_cgroup_from_page(page));
What about VM_BUG_ON?
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> arch_release_hugepage(page);
> @@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> + set_hugetlb_cgroup(page, NULL);
Why inside the spin lock?
> h->nr_huge_pages++;
> h->nr_huge_pages_node[nid]++;
> spin_unlock(&hugetlb_lock);
> @@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> INIT_LIST_HEAD(&page->lru);
> r_nid = page_to_nid(page);
> set_compound_page_dtor(page, free_huge_page);
> + set_hugetlb_cgroup(page, NULL);
> /*
> * We incremented the global counters already
> */
> --
> 1.7.10
>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Sat 09-06-12 14:29:56, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patchset add the charge and uncharge routines for hugetlb cgroup.
> This will be used in later patches when we allocate/free HugeTLB
> pages.
Please describe the locking rules.
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 87 insertions(+)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 20a32c5..48efd5a 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> return -EBUSY;
> }
>
> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup **ptr)
Missing doc.
> +{
> + int ret = 0;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *h_cg = NULL;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled())
> + goto done;
> + /*
> + * We don't charge any cgroup if the compound page have less
> + * than 3 pages.
> + */
> + if (hstates[idx].order < 2)
> + goto done;
huge_page_order here? Not that important because we are using order in
the code directly at many places but easier for grep and maybe worth a
separate clean up patch.
> +again:
> + rcu_read_lock();
> + h_cg = hugetlb_cgroup_from_task(current);
> + if (!h_cg)
> + h_cg = root_h_cgroup;
> +
> + if (!css_tryget(&h_cg->css)) {
> + rcu_read_unlock();
> + goto again;
> + }
> + rcu_read_unlock();
> +
> + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
> + css_put(&h_cg->css);
> +done:
> + *ptr = h_cg;
> + return ret;
> +}
> +
> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg,
> + struct page *page)
> +{
> + if (hugetlb_cgroup_disabled() || !h_cg)
> + return;
> +
> + spin_lock(&hugetlb_lock);
> + if (hugetlb_cgroup_from_page(page)) {
How can this happen? Is it possible that two CPUs are trying to charge
one page?
> + hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
> + goto done;
> + }
> + set_hugetlb_cgroup(page, h_cg);
> +done:
> + spin_unlock(&hugetlb_lock);
> + return;
> +}
> +
> +void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> + struct page *page)
> +{
> + struct hugetlb_cgroup *h_cg;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled())
> + return;
> +
> + spin_lock(&hugetlb_lock);
> + h_cg = hugetlb_cgroup_from_page(page);
> + if (unlikely(!h_cg)) {
> + spin_unlock(&hugetlb_lock);
> + return;
> + }
> + set_hugetlb_cgroup(page, NULL);
> + spin_unlock(&hugetlb_lock);
> +
> + res_counter_uncharge(&h_cg->hugepage[idx], csize);
> + return;
> +}
> +
> +void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg)
> +{
Really worth a separate function to do the same tests again?
Will have a look at the follow up patches. It would be much easier if
the functions were used in the same patch...
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled() || !h_cg)
> + return;
> +
> + res_counter_uncharge(&h_cg->hugepage[idx], csize);
> + return;
> +}
> +
> struct cgroup_subsys hugetlb_subsys = {
> .name = "hugetlb",
> .create = hugetlb_cgroup_create,
> --
> 1.7.10
>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
This approach seems much better than using page_cgroup.
> ---
> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..ceff1d5 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> };
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + if (!PageHuge(page))
> + return NULL;
> + if (compound_order(page)< 3)
> + return NULL;
> + return (struct hugetlb_cgroup *)page[2].lru.next;
> +}
As pointed out by Michal, you can have 4pages with order=2.
Thanks,
-Kame
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + if (!PageHuge(page))
> + return -1;
> + if (compound_order(page)< 3)
> + return -1;
> + page[2].lru.next = (void *)h_cg;
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> if (hugetlb_subsys.disabled)
> @@ -43,6 +63,17 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> #else
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + return 0;
> +}
> +
> static inline bool hugetlb_cgroup_disabled(void)
> {
> return true;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e899a2d..1ca2d8f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -28,6 +28,7 @@
>
> #include<linux/io.h>
> #include<linux/hugetlb.h>
> +#include<linux/hugetlb_cgroup.h>
> #include<linux/node.h>
> #include "internal.h"
>
> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> 1<< PG_active | 1<< PG_reserved |
> 1<< PG_private | 1<< PG_writeback);
> }
> + BUG_ON(hugetlb_cgroup_from_page(page));
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> arch_release_hugepage(page);
> @@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> INIT_LIST_HEAD(&page->lru);
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> + set_hugetlb_cgroup(page, NULL);
> h->nr_huge_pages++;
> h->nr_huge_pages_node[nid]++;
> spin_unlock(&hugetlb_lock);
> @@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> INIT_LIST_HEAD(&page->lru);
> r_nid = page_to_nid(page);
> set_compound_page_dtor(page, free_huge_page);
> + set_hugetlb_cgroup(page, NULL);
> /*
> * We incremented the global counters already
> */
On Sat 09-06-12 14:29:57, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This patch add support for cgroup removal. If we don't have parent
> cgroup, the charges are moved to root cgroup.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/hugetlb_cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 79 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 48efd5a..9458fe3 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -99,10 +99,87 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> kfree(h_cgroup);
> }
>
> +
> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> + struct page *page)
deserves a comment about the locking (needs to be called with
hugetlb_lock).
> +{
> + int csize;
> + struct res_counter *counter;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *page_hcg;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> +
> + if (!get_page_unless_zero(page))
> + goto out;
> +
> + page_hcg = hugetlb_cgroup_from_page(page);
> + /*
> + * We can have pages in active list without any cgroup
> + * ie, hugepage with less than 3 pages. We can safely
> + * ignore those pages.
> + */
> + if (!page_hcg || page_hcg != h_cg)
> + goto err_out;
How can we have page_hcg != NULL && page_hcg != h_cg?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Sat 09-06-12 14:29:58, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> Add the control files for hugetlb controller
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
[...]
> +int __init hugetlb_cgroup_file_init(int idx)
> +{
> + char buf[32];
> + struct cftype *cft;
> + struct hstate *h = &hstates[idx];
> +
> + /* format the size */
> + mem_fmt(buf, 32, huge_page_size(h));
> +
> + /* Add the limit file */
> + cft = &h->cgroup_files[0];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> + cft->read = hugetlb_cgroup_read;
> + cft->write_string = hugetlb_cgroup_write;
> +
> + /* Add the usage file */
> + cft = &h->cgroup_files[1];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the MAX usage file */
> + cft = &h->cgroup_files[2];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* Add the failcntfile */
> + cft = &h->cgroup_files[3];
> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> + cft->trigger = hugetlb_cgroup_reset;
> + cft->read = hugetlb_cgroup_read;
> +
> + /* NULL terminate the last cft */
> + cft = &h->cgroup_files[4];
> + memset(cft, 0, sizeof(*cft));
> +
> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> +
> + return 0;
> +}
> +
I am not so familiar with the recent changes in the generic cgroup
infrastructure but isn't this somehow automated?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:29:55, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> Add the hugetlb cgroup pointer to 3rd page lru.next.
>
> Interesting and I really like the idea much more than tracking by
> page_cgroup.
>
>> This limit the usage to hugetlb cgroup to only hugepages with 3 or
>> more normal pages. I guess that is an acceptable limitation.
>
> Agreed.
>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> Other than some nits I like this.
> Thanks!
>
>> ---
>> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
>> mm/hugetlb.c | 4 ++++
>> 2 files changed, 35 insertions(+)
>>
>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> index 5794be4..ceff1d5 100644
>> --- a/include/linux/hugetlb_cgroup.h
>> +++ b/include/linux/hugetlb_cgroup.h
>> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
>> };
>>
>> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> +{
>> + if (!PageHuge(page))
>> + return NULL;
>> + if (compound_order(page) < 3)
>
> Why 3? I think you wanted 2 here, right?
Yes that should be 2. I updated that in an earlier. Already in v9
version I have locally.
>
>> + return NULL;
>> + return (struct hugetlb_cgroup *)page[2].lru.next;
>> +}
>> +
>> +static inline
>> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
>> +{
>> + if (!PageHuge(page))
>> + return -1;
>> + if (compound_order(page) < 3)
>
> Here as well.
>
>> + return -1;
>> + page[2].lru.next = (void *)h_cg;
>> + return 0;
>> +}
>> +
>> static inline bool hugetlb_cgroup_disabled(void)
>> {
>> if (hugetlb_subsys.disabled)
>> @@ -43,6 +63,17 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> struct hugetlb_cgroup *h_cg);
>> #else
>> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline
>> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
>> +{
>> + return 0;
>> +}
>> +
>> static inline bool hugetlb_cgroup_disabled(void)
>> {
>> return true;
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index e899a2d..1ca2d8f 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -28,6 +28,7 @@
>>
>> #include <linux/io.h>
>> #include <linux/hugetlb.h>
>> +#include <linux/hugetlb_cgroup.h>
>> #include <linux/node.h>
>> #include "internal.h"
>>
>> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>> 1 << PG_active | 1 << PG_reserved |
>> 1 << PG_private | 1 << PG_writeback);
>> }
>> + BUG_ON(hugetlb_cgroup_from_page(page));
>
> What about VM_BUG_ON?
Will do, So when do one decide to choose VM_BUG_ON against BUG_ON ?
>
>> set_compound_page_dtor(page, NULL);
>> set_page_refcounted(page);
>> arch_release_hugepage(page);
>> @@ -643,6 +645,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>> INIT_LIST_HEAD(&page->lru);
>> set_compound_page_dtor(page, free_huge_page);
>> spin_lock(&hugetlb_lock);
>> + set_hugetlb_cgroup(page, NULL);
>
> Why inside the spin lock?
All page[2].lru.next update is protected by hugetlb_lock . It should not
really matter here, because the pages are not yet available to use.
>
>> h->nr_huge_pages++;
>> h->nr_huge_pages_node[nid]++;
>> spin_unlock(&hugetlb_lock);
>> @@ -892,6 +895,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
>> INIT_LIST_HEAD(&page->lru);
>> r_nid = page_to_nid(page);
>> set_compound_page_dtor(page, free_huge_page);
>> + set_hugetlb_cgroup(page, NULL);
>> /*
>> * We incremented the global counters already
>> */
>> --
-aneesh
On Mon 11-06-12 10:38:10, Michal Hocko wrote:
> On Sat 09-06-12 14:29:56, Aneesh Kumar K.V wrote:
> > From: "Aneesh Kumar K.V" <[email protected]>
> >
> > This patchset add the charge and uncharge routines for hugetlb cgroup.
> > This will be used in later patches when we allocate/free HugeTLB
> > pages.
>
> Please describe the locking rules.
>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 87 insertions(+)
> >
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 20a32c5..48efd5a 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> > return -EBUSY;
> > }
> >
> > +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> > + struct hugetlb_cgroup **ptr)
>
> Missing doc.
And now that I am looking at the patch which uses this function then I
realized that the name shouldn't mention page as we do not use any as an
argument. It is more in lines with hugetlb_cgroup_charge_cgroup
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Mon 11-06-12 14:33:52, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Sat 09-06-12 14:29:55, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> Add the hugetlb cgroup pointer to 3rd page lru.next.
> >
> > Interesting and I really like the idea much more than tracking by
> > page_cgroup.
> >
> >> This limit the usage to hugetlb cgroup to only hugepages with 3 or
> >> more normal pages. I guess that is an acceptable limitation.
> >
> > Agreed.
> >
> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >
> > Other than some nits I like this.
> > Thanks!
> >
> >> ---
> >> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> >> mm/hugetlb.c | 4 ++++
> >> 2 files changed, 35 insertions(+)
> >>
> >> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> >> index 5794be4..ceff1d5 100644
> >> --- a/include/linux/hugetlb_cgroup.h
> >> +++ b/include/linux/hugetlb_cgroup.h
> >> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> >> };
> >>
> >> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> >> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> >> +{
> >> + if (!PageHuge(page))
> >> + return NULL;
> >> + if (compound_order(page) < 3)
> >
> > Why 3? I think you wanted 2 here, right?
>
> Yes that should be 2. I updated that in an earlier. Already in v9
> version I have locally.
ohh, I should have read replies to the patch first where you already
mentioned that you are aware of that.
Maybe it would be worth something like:
/* Minimum page order trackable by hugetlb cgroup.
* At least 3 pages are necessary for all the tracking information.
*/
#define HUGETLB_CGROUP_MIN_ORDER 2
> >> @@ -591,6 +592,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> >> 1 << PG_active | 1 << PG_reserved |
> >> 1 << PG_private | 1 << PG_writeback);
> >> }
> >> + BUG_ON(hugetlb_cgroup_from_page(page));
> >
> > What about VM_BUG_ON?
>
> Will do, So when do one decide to choose VM_BUG_ON against BUG_ON ?
I think that VM_ variant is more approapriate here because it is more a
debugging thing rather than a hard failure invariant.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Sat 09-06-12 16:30:54, Johannes Weiner wrote:
> On Sat, Jun 09, 2012 at 06:39:06PM +0530, Aneesh Kumar K.V wrote:
> > Johannes Weiner <[email protected]> writes:
> >
> > > On Sat, Jun 09, 2012 at 02:29:59PM +0530, Aneesh Kumar K.V wrote:
> > >> From: "Aneesh Kumar K.V" <[email protected]>
> > >>
> > >> This adds necessary charge/uncharge calls in the HugeTLB code. We do
> > >> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
> > >>
> > >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > >> ---
> > >> mm/hugetlb.c | 16 +++++++++++++++-
> > >> mm/hugetlb_cgroup.c | 7 +------
> > >> 2 files changed, 16 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >> index bf79131..4ca92a9 100644
> > >> --- a/mm/hugetlb.c
> > >> +++ b/mm/hugetlb.c
> > >> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
> > >> BUG_ON(page_mapcount(page));
> > >>
> > >> spin_lock(&hugetlb_lock);
> > >> + hugetlb_cgroup_uncharge_page(hstate_index(h),
> > >> + pages_per_huge_page(h), page);
> > >
> > > hugetlb_cgroup_uncharge_page() takes the hugetlb_lock, no?
> >
> > Yes, But this patch also modifies it to not take the lock, because we
> > hold spin_lock just below in the call site. I didn't want to drop the
> > lock and take it again.
>
> Sorry, I missed that.
>
> > > It's quite hard to review code that is split up like this. Please
> > > always keep the introduction of new functions in the same patch that
> > > adds the callsite(s).
> >
> > One of the reason I split the charge/uncharge routines and the callers
> > in separate patches is to make it easier for review. Irrespective of
> > the call site charge/uncharge routines should be correct with respect
> > to locking and other details. What I did in this patch is a small
> > optimization of avoiding dropping and taking the lock again. May be the
> > right approach would have been to name it __hugetlb_cgroup_uncharge_page
> > and make sure the hugetlb_cgroup_uncharge_page still takes spin_lock.
> > But then we don't have any callers for that.
>
> I think this makes it needlessly complicated and there is no correct
> or incorrect locking in (initially) dead code :-)
>
> The callsites are just a few lines. It's harder to review if you
> introduce an API and then change it again mid-patchset.
Fully agreed.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Sat 09-06-12 14:29:59, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> This adds necessary charge/uncharge calls in the HugeTLB code. We do
> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> mm/hugetlb.c | 16 +++++++++++++++-
> mm/hugetlb_cgroup.c | 7 +------
> 2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf79131..4ca92a9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
> BUG_ON(page_mapcount(page));
>
> spin_lock(&hugetlb_lock);
> + hugetlb_cgroup_uncharge_page(hstate_index(h),
> + pages_per_huge_page(h), page);
> if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
> /* remove the page from active list */
> list_del(&page->lru);
> @@ -1116,7 +1118,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> struct hstate *h = hstate_vma(vma);
> struct page *page;
> long chg;
> + int ret, idx;
> + struct hugetlb_cgroup *h_cg;
>
> + idx = hstate_index(h);
> /*
> * Processes that did not create the mapping will have no
> * reserves and will not have accounted against subpool
> @@ -1132,6 +1137,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> if (hugepage_subpool_get_pages(spool, chg))
> return ERR_PTR(-ENOSPC);
>
> + ret = hugetlb_cgroup_charge_page(idx, pages_per_huge_page(h), &h_cg);
So we do not have any page yet and hugetlb_cgroup_charge_cgroup sound
more appropriate
[...]
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Sat 09-06-12 14:30:00, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <[email protected]>
>
> With HugeTLB pages, hugetlb cgroup is uncharged in compound page
> destructor. Since we are holding a hugepage reference,
Who is holding that reference? I do not see anybody calling get_page in
this patch...
> we can be sure that old page won't get uncharged till the last
> put_page().
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> include/linux/hugetlb_cgroup.h | 8 ++++++++
> mm/hugetlb_cgroup.c | 21 +++++++++++++++++++++
> mm/migrate.c | 5 +++++
> 3 files changed, 34 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index ba4836f..b64d067 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> extern int hugetlb_cgroup_file_init(int idx) __init;
> +extern void hugetlb_cgroup_migrate(struct page *oldhpage,
> + struct page *newhpage);
> #else
> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> {
> @@ -112,5 +114,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
> {
> return 0;
> }
> +
> +static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
> + struct page *newhpage)
> +{
> + return;
> +}
> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> #endif
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index c2b7b8e..2d384fe 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -394,6 +394,27 @@ int __init hugetlb_cgroup_file_init(int idx)
> return 0;
> }
>
> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
> +{
> + struct hugetlb_cgroup *h_cg;
> +
> + VM_BUG_ON(!PageHuge(oldhpage));
> +
> + if (hugetlb_cgroup_disabled())
> + return;
> +
> + spin_lock(&hugetlb_lock);
> + h_cg = hugetlb_cgroup_from_page(oldhpage);
> + set_hugetlb_cgroup(oldhpage, NULL);
> + cgroup_exclude_rmdir(&h_cg->css);
> +
> + /* move the h_cg details to new cgroup */
> + set_hugetlb_cgroup(newhpage, h_cg);
> + spin_unlock(&hugetlb_lock);
> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
> + return;
> +}
> +
> struct cgroup_subsys hugetlb_subsys = {
> .name = "hugetlb",
> .create = hugetlb_cgroup_create,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fdce3a2..6c37c51 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -33,6 +33,7 @@
> #include <linux/memcontrol.h>
> #include <linux/syscalls.h>
> #include <linux/hugetlb.h>
> +#include <linux/hugetlb_cgroup.h>
> #include <linux/gfp.h>
>
> #include <asm/tlbflush.h>
> @@ -931,6 +932,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>
> if (anon_vma)
> put_anon_vma(anon_vma);
> +
> + if (!rc)
> + hugetlb_cgroup_migrate(hpage, new_hpage);
> +
> unlock_page(hpage);
> out:
> put_page(new_hpage);
> --
> 1.7.10
>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:29:56, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> This patchset add the charge and uncharge routines for hugetlb cgroup.
>> This will be used in later patches when we allocate/free HugeTLB
>> pages.
>
> Please describe the locking rules.
All the update happen within hugetlb_lock.
>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>> mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 87 insertions(+)
>>
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> index 20a32c5..48efd5a 100644
>> --- a/mm/hugetlb_cgroup.c
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
>> return -EBUSY;
>> }
>>
>> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup **ptr)
>
> Missing doc.
>
>> +{
>> + int ret = 0;
>> + struct res_counter *fail_res;
>> + struct hugetlb_cgroup *h_cg = NULL;
>> + unsigned long csize = nr_pages * PAGE_SIZE;
>> +
>> + if (hugetlb_cgroup_disabled())
>> + goto done;
>> + /*
>> + * We don't charge any cgroup if the compound page have less
>> + * than 3 pages.
>> + */
>> + if (hstates[idx].order < 2)
>> + goto done;
>
> huge_page_order here? Not that important because we are using order in
> the code directly at many places but easier for grep and maybe worth a
> separate clean up patch.
>
Fixed.
>> +again:
>> + rcu_read_lock();
>> + h_cg = hugetlb_cgroup_from_task(current);
>> + if (!h_cg)
>> + h_cg = root_h_cgroup;
>> +
>> + if (!css_tryget(&h_cg->css)) {
>> + rcu_read_unlock();
>> + goto again;
>> + }
>> + rcu_read_unlock();
>> +
>> + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res);
>> + css_put(&h_cg->css);
>> +done:
>> + *ptr = h_cg;
>> + return ret;
>> +}
>> +
>> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg,
>> + struct page *page)
>> +{
>> + if (hugetlb_cgroup_disabled() || !h_cg)
>> + return;
>> +
>> + spin_lock(&hugetlb_lock);
>> + if (hugetlb_cgroup_from_page(page)) {
>
> How can this happen? Is it possible that two CPUs are trying to charge
> one page?
That is why I added that. I looked at the alloc_huge_page, and I
don't see we would end with same page from different CPUs but then
we have similar checks in memcg, where we drop the charge if we find
the page cgroup already used.
>
>> + hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
>> + goto done;
>> + }
>> + set_hugetlb_cgroup(page, h_cg);
>> +done:
>> + spin_unlock(&hugetlb_lock);
>> + return;
>> +}
>> +
>> +void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>> + struct page *page)
>> +{
>> + struct hugetlb_cgroup *h_cg;
>> + unsigned long csize = nr_pages * PAGE_SIZE;
>> +
>> + if (hugetlb_cgroup_disabled())
>> + return;
>> +
>> + spin_lock(&hugetlb_lock);
>> + h_cg = hugetlb_cgroup_from_page(page);
>> + if (unlikely(!h_cg)) {
>> + spin_unlock(&hugetlb_lock);
>> + return;
>> + }
>> + set_hugetlb_cgroup(page, NULL);
>> + spin_unlock(&hugetlb_lock);
>> +
>> + res_counter_uncharge(&h_cg->hugepage[idx], csize);
>> + return;
>> +}
>> +
>> +void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg)
>> +{
>
> Really worth a separate function to do the same tests again?
> Will have a look at the follow up patches. It would be much easier if
> the functions were used in the same patch...
v9 actually folded this to the patch that actually use these function.
>
>> + unsigned long csize = nr_pages * PAGE_SIZE;
>> +
>> + if (hugetlb_cgroup_disabled() || !h_cg)
>> + return;
>> +
>> + res_counter_uncharge(&h_cg->hugepage[idx], csize);
>> + return;
>> +}
>> +
>> struct cgroup_subsys hugetlb_subsys = {
>> .name = "hugetlb",
>> .create = hugetlb_cgroup_create,
>> --
-aneesh
Michal Hocko <[email protected]> writes:
> On Mon 11-06-12 14:33:52, Aneesh Kumar K.V wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Sat 09-06-12 14:29:55, Aneesh Kumar K.V wrote:
>> >> From: "Aneesh Kumar K.V" <[email protected]>
>> >>
>> >> Add the hugetlb cgroup pointer to 3rd page lru.next.
>> >
>> > Interesting and I really like the idea much more than tracking by
>> > page_cgroup.
>> >
>> >> This limit the usage to hugetlb cgroup to only hugepages with 3 or
>> >> more normal pages. I guess that is an acceptable limitation.
>> >
>> > Agreed.
>> >
>> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> >
>> > Other than some nits I like this.
>> > Thanks!
>> >
>> >> ---
>> >> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
>> >> mm/hugetlb.c | 4 ++++
>> >> 2 files changed, 35 insertions(+)
>> >>
>> >> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> >> index 5794be4..ceff1d5 100644
>> >> --- a/include/linux/hugetlb_cgroup.h
>> >> +++ b/include/linux/hugetlb_cgroup.h
>> >> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
>> >> };
>> >>
>> >> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> >> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> >> +{
>> >> + if (!PageHuge(page))
>> >> + return NULL;
>> >> + if (compound_order(page) < 3)
>> >
>> > Why 3? I think you wanted 2 here, right?
>>
>> Yes that should be 2. I updated that in an earlier. Already in v9
>> version I have locally.
>
> ohh, I should have read replies to the patch first where you already
> mentioned that you are aware of that.
> Maybe it would be worth something like:
> /* Minimum page order trackable by hugetlb cgroup.
> * At least 3 pages are necessary for all the tracking information.
> */
> #define HUGETLB_CGROUP_MIN_ORDER 2
Excellent will do that.
-aneesh
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:29:57, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> This patch add support for cgroup removal. If we don't have parent
>> cgroup, the charges are moved to root cgroup.
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>> mm/hugetlb_cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 79 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> index 48efd5a..9458fe3 100644
>> --- a/mm/hugetlb_cgroup.c
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -99,10 +99,87 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
>> kfree(h_cgroup);
>> }
>>
>> +
>> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
>> + struct page *page)
>
> deserves a comment about the locking (needs to be called with
> hugetlb_lock).
will do
>
>> +{
>> + int csize;
>> + struct res_counter *counter;
>> + struct res_counter *fail_res;
>> + struct hugetlb_cgroup *page_hcg;
>> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
>> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
>> +
>> + if (!get_page_unless_zero(page))
>> + goto out;
>> +
>> + page_hcg = hugetlb_cgroup_from_page(page);
>> + /*
>> + * We can have pages in active list without any cgroup
>> + * ie, hugepage with less than 3 pages. We can safely
>> + * ignore those pages.
>> + */
>> + if (!page_hcg || page_hcg != h_cg)
>> + goto err_out;
>
> How can we have page_hcg != NULL && page_hcg != h_cg?
pages belonging to other cgroup ?
-aneesh
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:29:58, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> Add the control files for hugetlb controller
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
> [...]
>> +int __init hugetlb_cgroup_file_init(int idx)
>> +{
>> + char buf[32];
>> + struct cftype *cft;
>> + struct hstate *h = &hstates[idx];
>> +
>> + /* format the size */
>> + mem_fmt(buf, 32, huge_page_size(h));
>> +
>> + /* Add the limit file */
>> + cft = &h->cgroup_files[0];
>> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
>> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
>> + cft->read = hugetlb_cgroup_read;
>> + cft->write_string = hugetlb_cgroup_write;
>> +
>> + /* Add the usage file */
>> + cft = &h->cgroup_files[1];
>> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
>> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
>> + cft->read = hugetlb_cgroup_read;
>> +
>> + /* Add the MAX usage file */
>> + cft = &h->cgroup_files[2];
>> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
>> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
>> + cft->trigger = hugetlb_cgroup_reset;
>> + cft->read = hugetlb_cgroup_read;
>> +
>> + /* Add the failcntfile */
>> + cft = &h->cgroup_files[3];
>> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
>> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
>> + cft->trigger = hugetlb_cgroup_reset;
>> + cft->read = hugetlb_cgroup_read;
>> +
>> + /* NULL terminate the last cft */
>> + cft = &h->cgroup_files[4];
>> + memset(cft, 0, sizeof(*cft));
>> +
>> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
>> +
>> + return 0;
>> +}
>> +
>
> I am not so familiar with the recent changes in the generic cgroup
> infrastructure but isn't this somehow automated?
yes for most of the cgroups. But in the hugetlb case we have variable number
of control files. We have the above set of control files for each
hugetlb size supported by the architecture.
-aneesh
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:30:00, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> With HugeTLB pages, hugetlb cgroup is uncharged in compound page
>> destructor. Since we are holding a hugepage reference,
>
> Who is holding that reference? I do not see anybody calling get_page in
> this patch...
>
soft_offline_huge_page takes the reference. It does the final
put_page(hpage) there.
>> we can be sure that old page won't get uncharged till the last
>> put_page().
>>
-aneesh
Michal Hocko <[email protected]> writes:
> On Sat 09-06-12 14:29:59, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> This adds necessary charge/uncharge calls in the HugeTLB code. We do
>> hugetlb cgroup charge in page alloc and uncharge in compound page destructor.
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> ---
>> mm/hugetlb.c | 16 +++++++++++++++-
>> mm/hugetlb_cgroup.c | 7 +------
>> 2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index bf79131..4ca92a9 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -628,6 +628,8 @@ static void free_huge_page(struct page *page)
>> BUG_ON(page_mapcount(page));
>>
>> spin_lock(&hugetlb_lock);
>> + hugetlb_cgroup_uncharge_page(hstate_index(h),
>> + pages_per_huge_page(h), page);
>> if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
>> /* remove the page from active list */
>> list_del(&page->lru);
>> @@ -1116,7 +1118,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> struct hstate *h = hstate_vma(vma);
>> struct page *page;
>> long chg;
>> + int ret, idx;
>> + struct hugetlb_cgroup *h_cg;
>>
>> + idx = hstate_index(h);
>> /*
>> * Processes that did not create the mapping will have no
>> * reserves and will not have accounted against subpool
>> @@ -1132,6 +1137,11 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
>> if (hugepage_subpool_get_pages(spool, chg))
>> return ERR_PTR(-ENOSPC);
>>
>> + ret = hugetlb_cgroup_charge_page(idx, pages_per_huge_page(h), &h_cg);
>
> So we do not have any page yet and hugetlb_cgroup_charge_cgroup sound
> more appropriate
>
Will do
-aneesh
On Mon 11-06-12 14:58:45, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Sat 09-06-12 14:29:56, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> This patchset add the charge and uncharge routines for hugetlb cgroup.
> >> This will be used in later patches when we allocate/free HugeTLB
> >> pages.
> >
> > Please describe the locking rules.
>
> All the update happen within hugetlb_lock.
Yes, I figured but it is definitely worth mentioning in the patch
description.
[...]
> >> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> >> + struct hugetlb_cgroup *h_cg,
> >> + struct page *page)
> >> +{
> >> + if (hugetlb_cgroup_disabled() || !h_cg)
> >> + return;
> >> +
> >> + spin_lock(&hugetlb_lock);
> >> + if (hugetlb_cgroup_from_page(page)) {
> >
> > How can this happen? Is it possible that two CPUs are trying to charge
> > one page?
>
> That is why I added that. I looked at the alloc_huge_page, and I
> don't see we would end with same page from different CPUs but then
> we have similar checks in memcg, where we drop the charge if we find
> the page cgroup already used.
Yes but memcg is little bit more complicated than hugetlb which has
which doesn't have to cope with async charges. Hugetlb allocation is
serialized by hugetlb_lock so only one caller gets the page.
I do not think the check is required here or add a comment explaining
how it can happen.
[...]
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Mon 11-06-12 15:10:20, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
[...]
> >> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> >> + struct page *page)
> >
> > deserves a comment about the locking (needs to be called with
> > hugetlb_lock).
>
> will do
>
> >
> >> +{
> >> + int csize;
> >> + struct res_counter *counter;
> >> + struct res_counter *fail_res;
> >> + struct hugetlb_cgroup *page_hcg;
> >> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> >> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> >> +
> >> + if (!get_page_unless_zero(page))
> >> + goto out;
> >> +
> >> + page_hcg = hugetlb_cgroup_from_page(page);
> >> + /*
> >> + * We can have pages in active list without any cgroup
> >> + * ie, hugepage with less than 3 pages. We can safely
> >> + * ignore those pages.
> >> + */
> >> + if (!page_hcg || page_hcg != h_cg)
> >> + goto err_out;
> >
> > How can we have page_hcg != NULL && page_hcg != h_cg?
>
> pages belonging to other cgroup ?
OK, I've forgot that you are iterating over all active huge pages in
hugetlb_cgroup_pre_destroy. What prevents you from doing the filtering
in the caller?
I am also wondering why you need to play with the page reference
counting here. You are under hugetlb_lock so the page cannot disappear
in the meantime or am I missing something?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
On Mon 11-06-12 15:13:31, Aneesh Kumar K.V wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Sat 09-06-12 14:29:58, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <[email protected]>
> >>
> >> Add the control files for hugetlb controller
> >>
> >> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >> ---
> > [...]
> >> +int __init hugetlb_cgroup_file_init(int idx)
> >> +{
> >> + char buf[32];
> >> + struct cftype *cft;
> >> + struct hstate *h = &hstates[idx];
> >> +
> >> + /* format the size */
> >> + mem_fmt(buf, 32, huge_page_size(h));
> >> +
> >> + /* Add the limit file */
> >> + cft = &h->cgroup_files[0];
> >> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
> >> + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
> >> + cft->read = hugetlb_cgroup_read;
> >> + cft->write_string = hugetlb_cgroup_write;
> >> +
> >> + /* Add the usage file */
> >> + cft = &h->cgroup_files[1];
> >> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> >> + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> >> + cft->read = hugetlb_cgroup_read;
> >> +
> >> + /* Add the MAX usage file */
> >> + cft = &h->cgroup_files[2];
> >> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> >> + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> >> + cft->trigger = hugetlb_cgroup_reset;
> >> + cft->read = hugetlb_cgroup_read;
> >> +
> >> + /* Add the failcntfile */
> >> + cft = &h->cgroup_files[3];
> >> + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> >> + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> >> + cft->trigger = hugetlb_cgroup_reset;
> >> + cft->read = hugetlb_cgroup_read;
> >> +
> >> + /* NULL terminate the last cft */
> >> + cft = &h->cgroup_files[4];
> >> + memset(cft, 0, sizeof(*cft));
> >> +
> >> + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files));
> >> +
> >> + return 0;
> >> +}
> >> +
> >
> > I am not so familiar with the recent changes in the generic cgroup
> > infrastructure but isn't this somehow automated?
>
> yes for most of the cgroups. But in the hugetlb case we have variable number
> of control files. We have the above set of control files for each
> hugetlb size supported by the architecture.
OK, understood.
>
> -aneesh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
Michal Hocko <[email protected]> writes:
> On Mon 11-06-12 15:10:20, Aneesh Kumar K.V wrote:
>> Michal Hocko <[email protected]> writes:
> [...]
>> >> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
>> >> + struct page *page)
>> >
>> > deserves a comment about the locking (needs to be called with
>> > hugetlb_lock).
>>
>> will do
>>
>> >
>> >> +{
>> >> + int csize;
>> >> + struct res_counter *counter;
>> >> + struct res_counter *fail_res;
>> >> + struct hugetlb_cgroup *page_hcg;
>> >> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
>> >> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
>> >> +
>> >> + if (!get_page_unless_zero(page))
>> >> + goto out;
>> >> +
>> >> + page_hcg = hugetlb_cgroup_from_page(page);
>> >> + /*
>> >> + * We can have pages in active list without any cgroup
>> >> + * ie, hugepage with less than 3 pages. We can safely
>> >> + * ignore those pages.
>> >> + */
>> >> + if (!page_hcg || page_hcg != h_cg)
>> >> + goto err_out;
>> >
>> > How can we have page_hcg != NULL && page_hcg != h_cg?
>>
>> pages belonging to other cgroup ?
>
> OK, I've forgot that you are iterating over all active huge pages in
> hugetlb_cgroup_pre_destroy. What prevents you from doing the filtering
> in the caller?
> I am also wondering why you need to play with the page reference
> counting here. You are under hugetlb_lock so the page cannot disappear
> in the meantime or am I missing something?
That is correct. Updated the patch and also added the below comment to
the function.
+
+/*
+ * Should be called with hugetlb_lock held.
+ * Since we are holding hugetlb_lock, pages cannot get moved from
+ * active list or uncharged from the cgroup, So no need to get
+ * page reference and test for page active here.
+ */
-aneesh
Michal Hocko <[email protected]> writes:
> On Mon 11-06-12 14:58:45, Aneesh Kumar K.V wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Sat 09-06-12 14:29:56, Aneesh Kumar K.V wrote:
>> >> From: "Aneesh Kumar K.V" <[email protected]>
>> >>
>> >> This patchset add the charge and uncharge routines for hugetlb cgroup.
>> >> This will be used in later patches when we allocate/free HugeTLB
>> >> pages.
>> >
>> > Please describe the locking rules.
>>
>> All the update happen within hugetlb_lock.
>
> Yes, I figured but it is definitely worth mentioning in the patch
> description.
Done.
>
> [...]
>> >> +void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>> >> + struct hugetlb_cgroup *h_cg,
>> >> + struct page *page)
>> >> +{
>> >> + if (hugetlb_cgroup_disabled() || !h_cg)
>> >> + return;
>> >> +
>> >> + spin_lock(&hugetlb_lock);
>> >> + if (hugetlb_cgroup_from_page(page)) {
>> >
>> > How can this happen? Is it possible that two CPUs are trying to charge
>> > one page?
>>
>> That is why I added that. I looked at the alloc_huge_page, and I
>> don't see we would end with same page from different CPUs but then
>> we have similar checks in memcg, where we drop the charge if we find
>> the page cgroup already used.
>
> Yes but memcg is little bit more complicated than hugetlb which has
> which doesn't have to cope with async charges. Hugetlb allocation is
> serialized by hugetlb_lock so only one caller gets the page.
> I do not think the check is required here or add a comment explaining
> how it can happen.
>
updated.
-aneesh
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Since we migrate only one hugepage don't use linked list for passing the
> page around. Directly pass page that need to be migrated as argument.
> This also remove the usage page->lru in migrate path.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/migrate.h | 4 +--
> mm/memory-failure.c | 13 ++--------
> mm/migrate.c | 65 +++++++++++++++--------------------------------
> 3 files changed, 25 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 855c337..ce7e667 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -15,7 +15,7 @@ extern int migrate_page(struct address_space *,
> extern int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode);
> -extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> +extern int migrate_huge_page(struct page *, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode);
>
> @@ -36,7 +36,7 @@ static inline void putback_lru_pages(struct list_head *l) {}
> static inline int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode) { return -ENOSYS; }
> -static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> +static inline int migrate_huge_page(struct page *page, new_page_t x,
> unsigned long private, bool offlining,
> enum migrate_mode mode) { return -ENOSYS; }
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ab1e714..53a1495 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1414,7 +1414,6 @@ static int soft_offline_huge_page(struct page *page, int flags)
> int ret;
> unsigned long pfn = page_to_pfn(page);
> struct page *hpage = compound_head(page);
> - LIST_HEAD(pagelist);
>
> ret = get_any_page(page, pfn, flags);
> if (ret< 0)
> @@ -1429,19 +1428,11 @@ static int soft_offline_huge_page(struct page *page, int flags)
> }
>
> /* Keep page count to indicate a given hugepage is isolated. */
> -
> - list_add(&hpage->lru,&pagelist);
> - ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
> - true);
> + ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, 0, true);
> + put_page(hpage);
> if (ret) {
> - struct page *page1, *page2;
> - list_for_each_entry_safe(page1, page2,&pagelist, lru)
> - put_page(page1);
> -
> pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> pfn, ret, page->flags);
> - if (ret> 0)
> - ret = -EIO;
> return ret;
> }
> done:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index be26d5c..fdce3a2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -932,15 +932,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> if (anon_vma)
> put_anon_vma(anon_vma);
> unlock_page(hpage);
> -
> out:
> - if (rc != -EAGAIN) {
> - list_del(&hpage->lru);
> - put_page(hpage);
> - }
> -
> put_page(new_hpage);
> -
> if (result) {
> if (rc)
> *result = rc;
> @@ -1016,48 +1009,32 @@ out:
> return nr_failed + retry;
> }
>
> -int migrate_huge_pages(struct list_head *from,
> - new_page_t get_new_page, unsigned long private, bool offlining,
> - enum migrate_mode mode)
> +int migrate_huge_page(struct page *hpage, new_page_t get_new_page,
> + unsigned long private, bool offlining,
> + enum migrate_mode mode)
> {
> - int retry = 1;
> - int nr_failed = 0;
> - int pass = 0;
> - struct page *page;
> - struct page *page2;
> - int rc;
> -
> - for (pass = 0; pass< 10&& retry; pass++) {
> - retry = 0;
> -
> - list_for_each_entry_safe(page, page2, from, lru) {
> + int pass, rc;
> +
> + for (pass = 0; pass< 10; pass++) {
> + rc = unmap_and_move_huge_page(get_new_page,
> + private, hpage, pass> 2, offlining,
> + mode);
> + switch (rc) {
> + case -ENOMEM:
> + goto out;
> + case -EAGAIN:
> + /* try again */
> cond_resched();
> -
> - rc = unmap_and_move_huge_page(get_new_page,
> - private, page, pass> 2, offlining,
> - mode);
> -
> - switch(rc) {
> - case -ENOMEM:
> - goto out;
> - case -EAGAIN:
> - retry++;
> - break;
> - case 0:
> - break;
> - default:
> - /* Permanent failure */
> - nr_failed++;
> - break;
> - }
> + break;
> + case 0:
> + goto out;
> + default:
> + rc = -EIO;
> + goto out;
> }
> }
> - rc = 0;
> out:
> - if (rc)
> - return rc;
> -
> - return nr_failed + retry;
> + return rc;
> }
>
> #ifdef CONFIG_NUMA
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> We will use them later in hugetlb_cgroup.c
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> This patch implements a new controller that allows us to control HugeTLB
> allocations. The extension allows to limit the HugeTLB usage per control
> group and enforces the controller limit during page fault. Since HugeTLB
> doesn't support page reclaim, enforcing the limit at page fault time implies
> that, the application will get SIGBUS signal if it tries to access HugeTLB
> pages beyond its limit. This requires the application to know beforehand
> how much HugeTLB pages it would require for its use.
>
> The charge/uncharge calls will be added to HugeTLB code in later patch.
> Support for cgroup removal will be added in later patches.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
some nitpick below.
> ---
> include/linux/cgroup_subsys.h | 6 +++
> include/linux/hugetlb_cgroup.h | 79 ++++++++++++++++++++++++++++
> init/Kconfig | 16 ++++++
> mm/Makefile | 1 +
> mm/hugetlb_cgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 216 insertions(+)
> create mode 100644 include/linux/hugetlb_cgroup.h
> create mode 100644 mm/hugetlb_cgroup.c
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 0bd390c..895923a 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -72,3 +72,9 @@ SUBSYS(net_prio)
> #endif
>
> /* */
> +
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +SUBSYS(hugetlb)
> +#endif
> +
> +/* */
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> new file mode 100644
> index 0000000..5794be4
> --- /dev/null
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -0,0 +1,79 @@
> +/*
> + * Copyright IBM Corporation, 2012
> + * Author Aneesh Kumar K.V<[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +#ifndef _LINUX_HUGETLB_CGROUP_H
> +#define _LINUX_HUGETLB_CGROUP_H
> +
> +#include<linux/res_counter.h>
> +
> +struct hugetlb_cgroup {
> + struct cgroup_subsys_state css;
> + /*
> + * the counter to account for hugepages from hugetlb.
> + */
> + struct res_counter hugepage[HUGE_MAX_HSTATE];
> +};
> +
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline bool hugetlb_cgroup_disabled(void)
> +{
> + if (hugetlb_subsys.disabled)
> + return true;
> + return false;
> +}
> +
> +extern int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup **ptr);
> +extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg,
> + struct page *page);
> +extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> + struct page *page);
> +extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg);
> +#else
> +static inline bool hugetlb_cgroup_disabled(void)
> +{
> + return true;
> +}
> +
> +static inline int
> +hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup **ptr)
> +{
> + return 0;
> +}
> +
> +static inline void
> +hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg,
> + struct page *page)
> +{
> + return;
> +}
> +
> +static inline void
> +hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
> +{
> + return;
> +}
> +
> +static inline void
> +hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup *h_cg)
> +{
> + return;
> +}
> +#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> +#endif
> diff --git a/init/Kconfig b/init/Kconfig
> index d07dcf9..b9a0d0a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -751,6 +751,22 @@ config CGROUP_MEM_RES_CTLR_KMEM
> the kmem extension can use it to guarantee that no group of processes
> will ever exhaust kernel resources alone.
>
> +config CGROUP_HUGETLB_RES_CTLR
> + bool "HugeTLB Resource Controller for Control Groups"
> + depends on RESOURCE_COUNTERS&& HUGETLB_PAGE&& EXPERIMENTAL
> + select PAGE_CGROUP
> + default n
> + help
> + Provides a simple cgroup Resource Controller for HugeTLB pages.
what you want to say by adding 'simple' here ?
> + When you enable this, you can put a per cgroup limit on HugeTLB usage.
> + The limit is enforced during page fault. Since HugeTLB doesn't
> + support page reclaim, enforcing the limit at page fault time implies
> + that, the application will get SIGBUS signal if it tries to access
> + HugeTLB pages beyond its limit. This requires the application to know
> + beforehand how much HugeTLB pages it would require for its use. The
> + control group is tracked in the third page lru pointer. This means
> + that we cannot use the controller with huge page less than 3 pages.
> +
> config CGROUP_PERF
> bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
> depends on PERF_EVENTS&& CGROUPS
> diff --git a/mm/Makefile b/mm/Makefile
> index a156285..a8dd8d5 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -48,6 +48,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> new file mode 100644
> index 0000000..20a32c5
> --- /dev/null
> +++ b/mm/hugetlb_cgroup.c
> @@ -0,0 +1,114 @@
> +/*
> + *
> + * Copyright IBM Corporation, 2012
> + * Author Aneesh Kumar K.V<[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> + *
> + */
> +
> +#include<linux/cgroup.h>
> +#include<linux/slab.h>
> +#include<linux/hugetlb.h>
> +#include<linux/hugetlb_cgroup.h>
> +
> +struct cgroup_subsys hugetlb_subsys __read_mostly;
> +struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> +{
> + if (s)
> + return container_of(s, struct hugetlb_cgroup, css);
> + return NULL;
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
> +{
> + return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
> + hugetlb_subsys_id));
> +}
> +
> +static inline
> +struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
> +{
> + return hugetlb_cgroup_from_css(task_subsys_state(task,
> + hugetlb_subsys_id));
> +}
> +
> +static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
> +{
> + return (h_cg == root_h_cgroup);
> +}
> +
> +static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
> +{
> + if (!cg->parent)
> + return NULL;
> + return hugetlb_cgroup_from_cgroup(cg->parent);
> +}
> +
> +static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
> +{
> + int idx;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
> +
> + for (idx = 0; idx< hugetlb_max_hstate; idx++) {
> + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE))> 0)
> + return 1;
> + }
> + return 0;
> +}
How about using true/false rather than 0/1 ?
> +
> +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
> +{
> + int idx;
> + struct cgroup *parent_cgroup;
> + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
> +
> + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> + if (!h_cgroup)
> + return ERR_PTR(-ENOMEM);
> +
> + parent_cgroup = cgroup->parent;
> + if (parent_cgroup) {
> + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
> + for (idx = 0; idx< HUGE_MAX_HSTATE; idx++)
^^^
add space is better.
> + res_counter_init(&h_cgroup->hugepage[idx],
> + &parent_h_cgroup->hugepage[idx]);
> + } else {
> + root_h_cgroup = h_cgroup;
> + for (idx = 0; idx< HUGE_MAX_HSTATE; idx++)
^^^
> + res_counter_init(&h_cgroup->hugepage[idx], NULL);
> + }
> + return&h_cgroup->css;
^^^
ditto.
> +}
> +
> +static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> +{
> + struct hugetlb_cgroup *h_cgroup;
> +
> + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
> + kfree(h_cgroup);
> +}
> +
> +static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> +{
> + /* We will add the cgroup removal support in later patches */
> + return -EBUSY;
> +}
-ENOTSUP rather than -BUSY :p
> +
> +struct cgroup_subsys hugetlb_subsys = {
> + .name = "hugetlb",
> + .create = hugetlb_cgroup_create,
> + .pre_destroy = hugetlb_cgroup_pre_destroy,
> + .destroy = hugetlb_cgroup_destroy,
> + .subsys_id = hugetlb_subsys_id,
> +};
Thanks,
-Kame
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
> the usage to hugetlb cgroup to only hugepages with 3 or more
> normal pages. I guess that is an acceptable limitation.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
> ---
> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
> mm/hugetlb.c | 4 ++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index 5794be4..ceff1d5 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
> };
>
> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> +{
> + if (!PageHuge(page))
> + return NULL;
I'm not very sure but....
VM_BUG_ON(!PageHuge(page)) ??
> + if (compound_order(page)< 3)
> + return NULL;
> + return (struct hugetlb_cgroup *)page[2].lru.next;
> +}
> +
> +static inline
> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
> +{
> + if (!PageHuge(page))
> + return -1;
ditto.
> + if (compound_order(page)< 3)
> + return -1;
> + page[2].lru.next = (void *)h_cg;
> + return 0;
> +}
> +
Thanks,
-Kame
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> This patchset add the charge and uncharge routines for hugetlb cgroup.
> This will be used in later patches when we allocate/free HugeTLB
> pages.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
I'm sorry if following has been already pointed out.
> ---
> mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 87 insertions(+)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 20a32c5..48efd5a 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> return -EBUSY;
> }
>
> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
> + struct hugetlb_cgroup **ptr)
> +{
> + int ret = 0;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *h_cg = NULL;
> + unsigned long csize = nr_pages * PAGE_SIZE;
> +
> + if (hugetlb_cgroup_disabled())
> + goto done;
> + /*
> + * We don't charge any cgroup if the compound page have less
> + * than 3 pages.
> + */
> + if (hstates[idx].order< 2)
> + goto done;
> +again:
> + rcu_read_lock();
> + h_cg = hugetlb_cgroup_from_task(current);
> + if (!h_cg)
> + h_cg = root_h_cgroup;
> +
> + if (!css_tryget(&h_cg->css)) {
> + rcu_read_unlock();
> + goto again;
> + }
> + rcu_read_unlock();
> +
> + ret = res_counter_charge(&h_cg->hugepage[idx], csize,&fail_res);
> + css_put(&h_cg->css);
> +done:
> + *ptr = h_cg;
> + return ret;
> +}
> +
Memory cgroup uses very complicated 'charge' routine for handling pageout...
which gets sleep.
For hugetlbfs, it has not sleep routine, you can do charge in simple way.
I guess...get/put here is overkill.
For example, h_cg cannot be freed while it has tasks. So, if 'current' is
belongs to the cgroup, it cannot be disappear. Then, you don't need get/put,
additional atomic ops for holding cgroup.
rcu_read_lock();
h_cg = hugetlb_cgroup_from_task(current);
ret = res_counter_charge(&h_cg->hugetpage[idx], csize, &fail_res);
rcu_read_unlock();
return ret;
Thanks,
-Kame
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> This patch add support for cgroup removal. If we don't have parent
> cgroup, the charges are moved to root cgroup.
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
I'm sorry if already pointed out....
> ---
> mm/hugetlb_cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 79 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 48efd5a..9458fe3 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -99,10 +99,87 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
> kfree(h_cgroup);
> }
>
> +
> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
> + struct page *page)
> +{
> + int csize;
> + struct res_counter *counter;
> + struct res_counter *fail_res;
> + struct hugetlb_cgroup *page_hcg;
> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
> +
> + if (!get_page_unless_zero(page))
> + goto out;
It seems this doesn't necessary...this is under hugetlb_lock().
> +
> + page_hcg = hugetlb_cgroup_from_page(page);
> + /*
> + * We can have pages in active list without any cgroup
> + * ie, hugepage with less than 3 pages. We can safely
> + * ignore those pages.
> + */
> + if (!page_hcg || page_hcg != h_cg)
> + goto err_out;
> +
> + csize = PAGE_SIZE<< compound_order(page);
> + if (!parent) {
> + parent = root_h_cgroup;
> + /* root has no limit */
> + res_counter_charge_nofail(&parent->hugepage[idx],
> + csize,&fail_res);
^^^
space ?
> + }
> + counter =&h_cg->hugepage[idx];
> + res_counter_uncharge_until(counter, counter->parent, csize);
> +
> + set_hugetlb_cgroup(page, parent);
> +err_out:
> + put_page(page);
> +out:
> + return 0;
> +}
> +
> +/*
> + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
> + * the parent cgroup.
> + */
> static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> {
> - /* We will add the cgroup removal support in later patches */
> - return -EBUSY;
> + struct hstate *h;
> + struct page *page;
> + int ret = 0, idx = 0;
> +
> + do {
> + if (cgroup_task_count(cgroup) ||
> + !list_empty(&cgroup->children)) {
> + ret = -EBUSY;
> + goto out;
> + }
> + /*
> + * If the task doing the cgroup_rmdir got a signal
> + * we don't really need to loop till the hugetlb resource
> + * usage become zero.
> + */
> + if (signal_pending(current)) {
> + ret = -EINTR;
> + goto out;
> + }
I'll post a patch to remove this check from memcg because memcg's rmdir
always succeed now. So, could you remove this ?
> + for_each_hstate(h) {
> + spin_lock(&hugetlb_lock);
> + list_for_each_entry(page,&h->hugepage_activelist, lru) {
> + ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
> + if (ret) {
When 'ret' should be !0 ?
If hugetlb_cgroup_move_parent() always returns 0, the check will not be necessary.
Thanks,
-Kame
> + spin_unlock(&hugetlb_lock);
> + goto out;
> + }
> + }
> + spin_unlock(&hugetlb_lock);
> + idx++;
> + }
> + cond_resched();
> + } while (hugetlb_cgroup_have_usage(cgroup));
> +out:
> + return ret;
> }
>
> int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
(2012/06/09 17:59), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Add the control files for hugetlb controller
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
> ---
> include/linux/hugetlb.h | 5 ++
> include/linux/hugetlb_cgroup.h | 6 ++
> mm/hugetlb.c | 8 +++
> mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 149 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4aca057..9650bb1 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -4,6 +4,7 @@
> #include<linux/mm_types.h>
> #include<linux/fs.h>
> #include<linux/hugetlb_inline.h>
> +#include<linux/cgroup.h>
>
> struct ctl_table;
> struct user_struct;
> @@ -221,6 +222,10 @@ struct hstate {
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> unsigned int free_huge_pages_node[MAX_NUMNODES];
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
> + /* cgroup control files */
> + struct cftype cgroup_files[5];
> +#endif
> char name[HSTATE_NAME_LEN];
> };
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index ceff1d5..ba4836f 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> struct page *page);
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> +extern int hugetlb_cgroup_file_init(int idx) __init;
> #else
> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> {
> @@ -106,5 +107,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> {
> return;
> }
> +
> +static inline int __init hugetlb_cgroup_file_init(int idx)
> +{
> + return 0;
> +}
> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1ca2d8f..bf79131 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -30,6 +30,7 @@
> #include<linux/hugetlb.h>
> #include<linux/hugetlb_cgroup.h>
> #include<linux/node.h>
> +#include<linux/hugetlb_cgroup.h>
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1916,6 +1917,13 @@ void __init hugetlb_add_hstate(unsigned order)
> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
> + /*
> + * Add cgroup control files only if the huge page consists
> + * of more than two normal pages. This is because we use
> + * page[2].lru.next for storing cgoup details.
> + */
> + if (order>= 2)
> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>
What happens at hugetlb module exit ? please see hugetlb_exit().
BTW, module unload of hugetlbfs is restricted if hugetlb cgroup is mounted ??
Thanks,
-Kame
(2012/06/09 18:00), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
> we are holding a hugepage reference, we can be sure that old page won't
> get uncharged till the last put_page().
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
one comment.
> ---
> include/linux/hugetlb_cgroup.h | 8 ++++++++
> mm/hugetlb_cgroup.c | 21 +++++++++++++++++++++
> mm/migrate.c | 5 +++++
> 3 files changed, 34 insertions(+)
>
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index ba4836f..b64d067 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> struct hugetlb_cgroup *h_cg);
> extern int hugetlb_cgroup_file_init(int idx) __init;
> +extern void hugetlb_cgroup_migrate(struct page *oldhpage,
> + struct page *newhpage);
> #else
> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
> {
> @@ -112,5 +114,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
> {
> return 0;
> }
> +
> +static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
> + struct page *newhpage)
> +{
> + return;
> +}
> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
> #endif
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index c2b7b8e..2d384fe 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -394,6 +394,27 @@ int __init hugetlb_cgroup_file_init(int idx)
> return 0;
> }
>
> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
> +{
> + struct hugetlb_cgroup *h_cg;
> +
> + VM_BUG_ON(!PageHuge(oldhpage));
> +
> + if (hugetlb_cgroup_disabled())
> + return;
> +
> + spin_lock(&hugetlb_lock);
> + h_cg = hugetlb_cgroup_from_page(oldhpage);
> + set_hugetlb_cgroup(oldhpage, NULL);
> + cgroup_exclude_rmdir(&h_cg->css);
> +
> + /* move the h_cg details to new cgroup */
> + set_hugetlb_cgroup(newhpage, h_cg);
> + spin_unlock(&hugetlb_lock);
> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
> + return;
Why do you need cgroup_exclude/release rmdir here ? you holds hugetlb_lock()
and charges will not be empty, here.
Thanks,
-Kame
(2012/06/09 18:00), Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V"<[email protected]>
>
> Signed-off-by: Aneesh Kumar K.V<[email protected]>
Documentation in patch 1/16 will help other guy's review.
> ---
> Documentation/cgroups/hugetlb.txt | 45 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 45 insertions(+)
> create mode 100644 Documentation/cgroups/hugetlb.txt
>
> diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroups/hugetlb.txt
> new file mode 100644
> index 0000000..a9faaca
> --- /dev/null
> +++ b/Documentation/cgroups/hugetlb.txt
> @@ -0,0 +1,45 @@
> +HugeTLB Controller
> +-------------------
> +
> +The HugeTLB controller allows to limit the HugeTLB usage per control group and
> +enforces the controller limit during page fault. Since HugeTLB doesn't
> +support page reclaim, enforcing the limit at page fault time implies that,
> +the application will get SIGBUS signal if it tries to access HugeTLB pages
> +beyond its limit. This requires the application to know beforehand how much
> +HugeTLB pages it would require for its use.
> +
Isn't it better to mention hugetlb cgroup doesn't have its own free-huge-page-list,
it's just a quota. And system admin need to set up hugetlb page pool regardless
of using hugetlb cgroup.
> +HugeTLB controller can be created by first mounting the cgroup filesystem.
> +
> +# mount -t cgroup -o hugetlb none /sys/fs/cgroup
> +
> +With the above step, the initial or the parent HugeTLB group becomes
> +visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
> +the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
> +
> +New groups can be created under the parent group /sys/fs/cgroup.
> +
> +# cd /sys/fs/cgroup
> +# mkdir g1
> +# echo $$> g1/tasks
> +
> +The above steps create a new group g1 and move the current shell
> +process (bash) into it.
> +
> +Brief summary of control files
> +
> + hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
> + hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
> + hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb
> + hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
^^^^^^^^
breakage in spacing.
> +
> +For a system supporting two hugepage size (16M and 16G) the control
> +files include:
> +
> +hugetlb.16GB.limit_in_bytes
> +hugetlb.16GB.max_usage_in_bytes
> +hugetlb.16GB.usage_in_bytes
> +hugetlb.16GB.failcnt
> +hugetlb.16MB.limit_in_bytes
> +hugetlb.16MB.max_usage_in_bytes
> +hugetlb.16MB.usage_in_bytes
> +hugetlb.16MB.failcnt
seems nice.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> This patch implements a new controller that allows us to control HugeTLB
>> allocations. The extension allows to limit the HugeTLB usage per control
>> group and enforces the controller limit during page fault. Since HugeTLB
>> doesn't support page reclaim, enforcing the limit at page fault time implies
>> that, the application will get SIGBUS signal if it tries to access HugeTLB
>> pages beyond its limit. This requires the application to know beforehand
>> how much HugeTLB pages it would require for its use.
>>
>> The charge/uncharge calls will be added to HugeTLB code in later patch.
>> Support for cgroup removal will be added in later patches.
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> some nitpick below.
>
>> ---
>> include/linux/cgroup_subsys.h | 6 +++
>> include/linux/hugetlb_cgroup.h | 79 ++++++++++++++++++++++++++++
>> init/Kconfig | 16 ++++++
>> mm/Makefile | 1 +
>> mm/hugetlb_cgroup.c | 114 ++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 216 insertions(+)
>> create mode 100644 include/linux/hugetlb_cgroup.h
>> create mode 100644 mm/hugetlb_cgroup.c
>>
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index 0bd390c..895923a 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -72,3 +72,9 @@ SUBSYS(net_prio)
>> #endif
>>
>> /* */
>> +
>> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> +SUBSYS(hugetlb)
>> +#endif
>> +
>> +/* */
>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> new file mode 100644
>> index 0000000..5794be4
>> --- /dev/null
>> +++ b/include/linux/hugetlb_cgroup.h
>> @@ -0,0 +1,79 @@
>> +/*
>> + * Copyright IBM Corporation, 2012
>> + * Author Aneesh Kumar K.V<[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>> + * as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it would be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>> + *
>> + */
>> +
>> +#ifndef _LINUX_HUGETLB_CGROUP_H
>> +#define _LINUX_HUGETLB_CGROUP_H
>> +
>> +#include<linux/res_counter.h>
>> +
>> +struct hugetlb_cgroup {
>> + struct cgroup_subsys_state css;
>> + /*
>> + * the counter to account for hugepages from hugetlb.
>> + */
>> + struct res_counter hugepage[HUGE_MAX_HSTATE];
>> +};
>> +
>> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> +static inline bool hugetlb_cgroup_disabled(void)
>> +{
>> + if (hugetlb_subsys.disabled)
>> + return true;
>> + return false;
>> +}
>> +
>> +extern int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup **ptr);
>> +extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg,
>> + struct page *page);
>> +extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>> + struct page *page);
>> +extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg);
>> +#else
>> +static inline bool hugetlb_cgroup_disabled(void)
>> +{
>> + return true;
>> +}
>> +
>> +static inline int
>> +hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup **ptr)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline void
>> +hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg,
>> + struct page *page)
>> +{
>> + return;
>> +}
>> +
>> +static inline void
>> +hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
>> +{
>> + return;
>> +}
>> +
>> +static inline void
>> +hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup *h_cg)
>> +{
>> + return;
>> +}
>> +#endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>> +#endif
>> diff --git a/init/Kconfig b/init/Kconfig
>> index d07dcf9..b9a0d0a 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -751,6 +751,22 @@ config CGROUP_MEM_RES_CTLR_KMEM
>> the kmem extension can use it to guarantee that no group of processes
>> will ever exhaust kernel resources alone.
>>
>> +config CGROUP_HUGETLB_RES_CTLR
>> + bool "HugeTLB Resource Controller for Control Groups"
>> + depends on RESOURCE_COUNTERS&& HUGETLB_PAGE&& EXPERIMENTAL
>> + select PAGE_CGROUP
>> + default n
>> + help
>> + Provides a simple cgroup Resource Controller for HugeTLB pages.
>
> what you want to say by adding 'simple' here ?
>
removed
>
>> + When you enable this, you can put a per cgroup limit on HugeTLB usage.
>> + The limit is enforced during page fault. Since HugeTLB doesn't
>> + support page reclaim, enforcing the limit at page fault time implies
>> + that, the application will get SIGBUS signal if it tries to access
>> + HugeTLB pages beyond its limit. This requires the application to know
>> + beforehand how much HugeTLB pages it would require for its use. The
>> + control group is tracked in the third page lru pointer. This means
>> + that we cannot use the controller with huge page less than 3 pages.
>> +
>> config CGROUP_PERF
>> bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>> depends on PERF_EVENTS&& CGROUPS
>> diff --git a/mm/Makefile b/mm/Makefile
>> index a156285..a8dd8d5 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -48,6 +48,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>> obj-$(CONFIG_QUICKLIST) += quicklist.o
>> obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
>> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
>> +obj-$(CONFIG_CGROUP_HUGETLB_RES_CTLR) += hugetlb_cgroup.o
>> obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
>> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> new file mode 100644
>> index 0000000..20a32c5
>> --- /dev/null
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -0,0 +1,114 @@
>> +/*
>> + *
>> + * Copyright IBM Corporation, 2012
>> + * Author Aneesh Kumar K.V<[email protected]>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of version 2.1 of the GNU Lesser General Public License
>> + * as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it would be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>> + *
>> + */
>> +
>> +#include<linux/cgroup.h>
>> +#include<linux/slab.h>
>> +#include<linux/hugetlb.h>
>> +#include<linux/hugetlb_cgroup.h>
>> +
>> +struct cgroup_subsys hugetlb_subsys __read_mostly;
>> +struct hugetlb_cgroup *root_h_cgroup __read_mostly;
>> +
>> +static inline
>> +struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
>> +{
>> + if (s)
>> + return container_of(s, struct hugetlb_cgroup, css);
>> + return NULL;
>> +}
>> +
>> +static inline
>> +struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup)
>> +{
>> + return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup,
>> + hugetlb_subsys_id));
>> +}
>> +
>> +static inline
>> +struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task)
>> +{
>> + return hugetlb_cgroup_from_css(task_subsys_state(task,
>> + hugetlb_subsys_id));
>> +}
>> +
>> +static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg)
>> +{
>> + return (h_cg == root_h_cgroup);
>> +}
>> +
>> +static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg)
>> +{
>> + if (!cg->parent)
>> + return NULL;
>> + return hugetlb_cgroup_from_cgroup(cg->parent);
>> +}
>> +
>> +static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg)
>> +{
>> + int idx;
>> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg);
>> +
>> + for (idx = 0; idx< hugetlb_max_hstate; idx++) {
>> + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE))> 0)
>> + return 1;
>> + }
>> + return 0;
>> +}
>
> How about using true/false rather than 0/1 ?
done
>
>
>> +
>> +static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup)
>> +{
>> + int idx;
>> + struct cgroup *parent_cgroup;
>> + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup;
>> +
>> + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
>> + if (!h_cgroup)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + parent_cgroup = cgroup->parent;
>> + if (parent_cgroup) {
>> + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup);
>> + for (idx = 0; idx< HUGE_MAX_HSTATE; idx++)
> ^^^
> add space is better.
I don't have that in my patch. I also don't see the original
mail. checkpatch will warn about such errors and i didn't get any such
warnings. So it mostly is a error on your mail client ?
>
>> + res_counter_init(&h_cgroup->hugepage[idx],
>> + &parent_h_cgroup->hugepage[idx]);
>> + } else {
>> + root_h_cgroup = h_cgroup;
>> + for (idx = 0; idx< HUGE_MAX_HSTATE; idx++)
> ^^^
>
>> + res_counter_init(&h_cgroup->hugepage[idx], NULL);
>> + }
>> + return&h_cgroup->css;
> ^^^
> ditto.
>
>> +}
>> +
>> +static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
>> +{
>> + struct hugetlb_cgroup *h_cgroup;
>> +
>> + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup);
>> + kfree(h_cgroup);
>> +}
>> +
>> +static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
>> +{
>> + /* We will add the cgroup removal support in later patches */
>> + return -EBUSY;
>> +}
>
> -ENOTSUP rather than -BUSY :p
The later patch remove this. What I wanted in the intermediate patches
is to state that we are not ready to be removed because we have some
charges.
>
>> +
>> +struct cgroup_subsys hugetlb_subsys = {
>> + .name = "hugetlb",
>> + .create = hugetlb_cgroup_create,
>> + .pre_destroy = hugetlb_cgroup_pre_destroy,
>> + .destroy = hugetlb_cgroup_destroy,
>> + .subsys_id = hugetlb_subsys_id,
>> +};
>
>
> Thanks,
> -Kame
-aneesh
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> Add the hugetlb cgroup pointer to 3rd page lru.next. This limit
>> the usage to hugetlb cgroup to only hugepages with 3 or more
>> normal pages. I guess that is an acceptable limitation.
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>> ---
>> include/linux/hugetlb_cgroup.h | 31 +++++++++++++++++++++++++++++++
>> mm/hugetlb.c | 4 ++++
>> 2 files changed, 35 insertions(+)
>>
>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> index 5794be4..ceff1d5 100644
>> --- a/include/linux/hugetlb_cgroup.h
>> +++ b/include/linux/hugetlb_cgroup.h
>> @@ -26,6 +26,26 @@ struct hugetlb_cgroup {
>> };
>>
>> #ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> +{
>> + if (!PageHuge(page))
>> + return NULL;
>
> I'm not very sure but....
>
> VM_BUG_ON(!PageHuge(page)) ??
>
>
>
>> + if (compound_order(page)< 3)
>> + return NULL;
>> + return (struct hugetlb_cgroup *)page[2].lru.next;
>> +}
>> +
>> +static inline
>> +int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
>> +{
>> + if (!PageHuge(page))
>> + return -1;
>
> ditto.
>
>> + if (compound_order(page)< 3)
>> + return -1;
>> + page[2].lru.next = (void *)h_cg;
>> + return 0;
>> +}
>> +
done
-aneesh
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> This patchset add the charge and uncharge routines for hugetlb cgroup.
>> This will be used in later patches when we allocate/free HugeTLB
>> pages.
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>
>
> I'm sorry if following has been already pointed out.
>
>> ---
>> mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 87 insertions(+)
>>
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> index 20a32c5..48efd5a 100644
>> --- a/mm/hugetlb_cgroup.c
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
>> return -EBUSY;
>> }
>>
>> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
>> + struct hugetlb_cgroup **ptr)
>> +{
>> + int ret = 0;
>> + struct res_counter *fail_res;
>> + struct hugetlb_cgroup *h_cg = NULL;
>> + unsigned long csize = nr_pages * PAGE_SIZE;
>> +
>> + if (hugetlb_cgroup_disabled())
>> + goto done;
>> + /*
>> + * We don't charge any cgroup if the compound page have less
>> + * than 3 pages.
>> + */
>> + if (hstates[idx].order< 2)
>> + goto done;
>> +again:
>> + rcu_read_lock();
>> + h_cg = hugetlb_cgroup_from_task(current);
>> + if (!h_cg)
>> + h_cg = root_h_cgroup;
>> +
>> + if (!css_tryget(&h_cg->css)) {
>> + rcu_read_unlock();
>> + goto again;
>> + }
>> + rcu_read_unlock();
>> +
>> + ret = res_counter_charge(&h_cg->hugepage[idx], csize,&fail_res);
>> + css_put(&h_cg->css);
>> +done:
>> + *ptr = h_cg;
>> + return ret;
>> +}
>> +
>
> Memory cgroup uses very complicated 'charge' routine for handling pageout...
> which gets sleep.
>
> For hugetlbfs, it has not sleep routine, you can do charge in simple way.
> I guess...get/put here is overkill.
>
> For example, h_cg cannot be freed while it has tasks. So, if 'current' is
> belongs to the cgroup, it cannot be disappear. Then, you don't need get/put,
> additional atomic ops for holding cgroup.
>
> rcu_read_lock();
> h_cg = hugetlb_cgroup_from_task(current);
> ret = res_counter_charge(&h_cg->hugetpage[idx], csize, &fail_res);
> rcu_read_unlock();
>
> return ret;
>
What if the task got moved ot of the cgroup and cgroup got deleted by an
rmdir ?
-aneesh
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> This patch add support for cgroup removal. If we don't have parent
>> cgroup, the charges are moved to root cgroup.
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>
> I'm sorry if already pointed out....
>
>> ---
>> mm/hugetlb_cgroup.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 79 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> index 48efd5a..9458fe3 100644
>> --- a/mm/hugetlb_cgroup.c
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -99,10 +99,87 @@ static void hugetlb_cgroup_destroy(struct cgroup *cgroup)
>> kfree(h_cgroup);
>> }
>>
>> +
>> +static int hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup,
>> + struct page *page)
>> +{
>> + int csize;
>> + struct res_counter *counter;
>> + struct res_counter *fail_res;
>> + struct hugetlb_cgroup *page_hcg;
>> + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup);
>> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup);
>> +
>> + if (!get_page_unless_zero(page))
>> + goto out;
>
> It seems this doesn't necessary...this is under hugetlb_lock().
already updated.
>
>> +
>> + page_hcg = hugetlb_cgroup_from_page(page);
>> + /*
>> + * We can have pages in active list without any cgroup
>> + * ie, hugepage with less than 3 pages. We can safely
>> + * ignore those pages.
>> + */
>> + if (!page_hcg || page_hcg != h_cg)
>> + goto err_out;
>> +
>> + csize = PAGE_SIZE<< compound_order(page);
>> + if (!parent) {
>> + parent = root_h_cgroup;
>> + /* root has no limit */
>> + res_counter_charge_nofail(&parent->hugepage[idx],
>> + csize,&fail_res);
> ^^^
> space ?
I don't have code this way locally, may be a mail client error ?
>
>> + }
>> + counter =&h_cg->hugepage[idx];
>> + res_counter_uncharge_until(counter, counter->parent, csize);
>> +
>> + set_hugetlb_cgroup(page, parent);
>> +err_out:
>> + put_page(page);
>> +out:
>> + return 0;
>> +}
>> +
>> +/*
>> + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
>> + * the parent cgroup.
>> + */
>> static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
>> {
>> - /* We will add the cgroup removal support in later patches */
>> - return -EBUSY;
>> + struct hstate *h;
>> + struct page *page;
>> + int ret = 0, idx = 0;
>> +
>> + do {
>> + if (cgroup_task_count(cgroup) ||
>> + !list_empty(&cgroup->children)) {
>> + ret = -EBUSY;
>> + goto out;
>> + }
Is this check going to moved to higher levels ? Do we still need
this. Or will that happen when pred_destroy becomes void ?
>
>> + /*
>> + * If the task doing the cgroup_rmdir got a signal
>> + * we don't really need to loop till the hugetlb resource
>> + * usage become zero.
>> + */
>> + if (signal_pending(current)) {
>> + ret = -EINTR;
>> + goto out;
>> + }
>
> I'll post a patch to remove this check from memcg because memcg's rmdir
> always succeed now. So, could you remove this ?
Will drop this
>
>
>> + for_each_hstate(h) {
>> + spin_lock(&hugetlb_lock);
>> + list_for_each_entry(page,&h->hugepage_activelist, lru) {
>> + ret = hugetlb_cgroup_move_parent(idx, cgroup, page);
>> + if (ret) {
>
> When 'ret' should be !0 ?
> If hugetlb_cgroup_move_parent() always returns 0, the check will not be necessary.
>
I will make this void funciton.
-aneesh
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> Add the control files for hugetlb controller
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>> ---
>> include/linux/hugetlb.h | 5 ++
>> include/linux/hugetlb_cgroup.h | 6 ++
>> mm/hugetlb.c | 8 +++
>> mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 149 insertions(+)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 4aca057..9650bb1 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -4,6 +4,7 @@
>> #include<linux/mm_types.h>
>> #include<linux/fs.h>
>> #include<linux/hugetlb_inline.h>
>> +#include<linux/cgroup.h>
>>
>> struct ctl_table;
>> struct user_struct;
>> @@ -221,6 +222,10 @@ struct hstate {
>> unsigned int nr_huge_pages_node[MAX_NUMNODES];
>> unsigned int free_huge_pages_node[MAX_NUMNODES];
>> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
>> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>> + /* cgroup control files */
>> + struct cftype cgroup_files[5];
>> +#endif
>> char name[HSTATE_NAME_LEN];
>> };
>>
>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> index ceff1d5..ba4836f 100644
>> --- a/include/linux/hugetlb_cgroup.h
>> +++ b/include/linux/hugetlb_cgroup.h
>> @@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>> struct page *page);
>> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> struct hugetlb_cgroup *h_cg);
>> +extern int hugetlb_cgroup_file_init(int idx) __init;
>> #else
>> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> {
>> @@ -106,5 +107,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> {
>> return;
>> }
>> +
>> +static inline int __init hugetlb_cgroup_file_init(int idx)
>> +{
>> + return 0;
>> +}
>> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>> #endif
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 1ca2d8f..bf79131 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -30,6 +30,7 @@
>> #include<linux/hugetlb.h>
>> #include<linux/hugetlb_cgroup.h>
>> #include<linux/node.h>
>> +#include<linux/hugetlb_cgroup.h>
>> #include "internal.h"
>>
>> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
>> @@ -1916,6 +1917,13 @@ void __init hugetlb_add_hstate(unsigned order)
>> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
>> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>> huge_page_size(h)/1024);
>> + /*
>> + * Add cgroup control files only if the huge page consists
>> + * of more than two normal pages. This is because we use
>> + * page[2].lru.next for storing cgoup details.
>> + */
>> + if (order>= 2)
>> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>>
>
> What happens at hugetlb module exit ? please see hugetlb_exit().
>
> BTW, module unload of hugetlbfs is restricted if hugetlb cgroup is mounted ??
>
hugetlb is a binary config
config HUGETLBFS
bool "HugeTLB file system support"
config HUGETLB_PAGE
def_bool HUGETLBFS
-aneesh
Kamezawa Hiroyuki <[email protected]> writes:
> (2012/06/09 18:00), Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V"<[email protected]>
>>
>> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
>> we are holding a hugepage reference, we can be sure that old page won't
>> get uncharged till the last put_page().
>>
>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>
> one comment.
>
>> ---
>> include/linux/hugetlb_cgroup.h | 8 ++++++++
>> mm/hugetlb_cgroup.c | 21 +++++++++++++++++++++
>> mm/migrate.c | 5 +++++
>> 3 files changed, 34 insertions(+)
>>
>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>> index ba4836f..b64d067 100644
>> --- a/include/linux/hugetlb_cgroup.h
>> +++ b/include/linux/hugetlb_cgroup.h
>> @@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>> struct hugetlb_cgroup *h_cg);
>> extern int hugetlb_cgroup_file_init(int idx) __init;
>> +extern void hugetlb_cgroup_migrate(struct page *oldhpage,
>> + struct page *newhpage);
>> #else
>> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>> {
>> @@ -112,5 +114,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
>> {
>> return 0;
>> }
>> +
>> +static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
>> + struct page *newhpage)
>> +{
>> + return;
>> +}
>> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>> #endif
>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>> index c2b7b8e..2d384fe 100644
>> --- a/mm/hugetlb_cgroup.c
>> +++ b/mm/hugetlb_cgroup.c
>> @@ -394,6 +394,27 @@ int __init hugetlb_cgroup_file_init(int idx)
>> return 0;
>> }
>>
>> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>> +{
>> + struct hugetlb_cgroup *h_cg;
>> +
>> + VM_BUG_ON(!PageHuge(oldhpage));
>> +
>> + if (hugetlb_cgroup_disabled())
>> + return;
>> +
>> + spin_lock(&hugetlb_lock);
>> + h_cg = hugetlb_cgroup_from_page(oldhpage);
>> + set_hugetlb_cgroup(oldhpage, NULL);
>> + cgroup_exclude_rmdir(&h_cg->css);
>> +
>> + /* move the h_cg details to new cgroup */
>> + set_hugetlb_cgroup(newhpage, h_cg);
>> + spin_unlock(&hugetlb_lock);
>> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
>> + return;
>
>
> Why do you need cgroup_exclude/release rmdir here ? you holds hugetlb_lock()
> and charges will not be empty, here.
>
But even without empty charge, we can still remove the cgroup right ?
ie if we don't have any task but some charge in the cgroup because of
shared mmap in hugetlbfs.
-aneesh
(2012/06/12 19:58), Aneesh Kumar K.V wrote:
> Kamezawa Hiroyuki<[email protected]> writes:
>
>> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>>> From: "Aneesh Kumar K.V"<[email protected]>
>>>
>>> Add the control files for hugetlb controller
>>>
>>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>>> ---
>>> include/linux/hugetlb.h | 5 ++
>>> include/linux/hugetlb_cgroup.h | 6 ++
>>> mm/hugetlb.c | 8 +++
>>> mm/hugetlb_cgroup.c | 130 ++++++++++++++++++++++++++++++++++++++++
>>> 4 files changed, 149 insertions(+)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 4aca057..9650bb1 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -4,6 +4,7 @@
>>> #include<linux/mm_types.h>
>>> #include<linux/fs.h>
>>> #include<linux/hugetlb_inline.h>
>>> +#include<linux/cgroup.h>
>>>
>>> struct ctl_table;
>>> struct user_struct;
>>> @@ -221,6 +222,10 @@ struct hstate {
>>> unsigned int nr_huge_pages_node[MAX_NUMNODES];
>>> unsigned int free_huge_pages_node[MAX_NUMNODES];
>>> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
>>> +#ifdef CONFIG_CGROUP_HUGETLB_RES_CTLR
>>> + /* cgroup control files */
>>> + struct cftype cgroup_files[5];
>>> +#endif
>>> char name[HSTATE_NAME_LEN];
>>> };
>>>
>>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>>> index ceff1d5..ba4836f 100644
>>> --- a/include/linux/hugetlb_cgroup.h
>>> +++ b/include/linux/hugetlb_cgroup.h
>>> @@ -62,6 +62,7 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>>> struct page *page);
>>> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>>> struct hugetlb_cgroup *h_cg);
>>> +extern int hugetlb_cgroup_file_init(int idx) __init;
>>> #else
>>> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>>> {
>>> @@ -106,5 +107,10 @@ hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>>> {
>>> return;
>>> }
>>> +
>>> +static inline int __init hugetlb_cgroup_file_init(int idx)
>>> +{
>>> + return 0;
>>> +}
>>> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>>> #endif
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 1ca2d8f..bf79131 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -30,6 +30,7 @@
>>> #include<linux/hugetlb.h>
>>> #include<linux/hugetlb_cgroup.h>
>>> #include<linux/node.h>
>>> +#include<linux/hugetlb_cgroup.h>
>>> #include "internal.h"
>>>
>>> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
>>> @@ -1916,6 +1917,13 @@ void __init hugetlb_add_hstate(unsigned order)
>>> h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]);
>>> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>>> huge_page_size(h)/1024);
>>> + /*
>>> + * Add cgroup control files only if the huge page consists
>>> + * of more than two normal pages. This is because we use
>>> + * page[2].lru.next for storing cgoup details.
>>> + */
>>> + if (order>= 2)
>>> + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1);
>>>
>>
>> What happens at hugetlb module exit ? please see hugetlb_exit().
>>
>> BTW, module unload of hugetlbfs is restricted if hugetlb cgroup is mounted ??
>>
>
> hugetlb is a binary config
>
> config HUGETLBFS
> bool "HugeTLB file system support"
>
> config HUGETLB_PAGE
> def_bool HUGETLBFS
>
ok, so....hugetlb_exit() is never called ?
Thanks,
-Kame
(2012/06/12 19:50), Aneesh Kumar K.V wrote:
> Kamezawa Hiroyuki<[email protected]> writes:
>
>> (2012/06/09 17:59), Aneesh Kumar K.V wrote:
>>> From: "Aneesh Kumar K.V"<[email protected]>
>>>
>>> This patchset add the charge and uncharge routines for hugetlb cgroup.
>>> This will be used in later patches when we allocate/free HugeTLB
>>> pages.
>>>
>>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>>
>>
>> I'm sorry if following has been already pointed out.
>>
>>> ---
>>> mm/hugetlb_cgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 87 insertions(+)
>>>
>>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>>> index 20a32c5..48efd5a 100644
>>> --- a/mm/hugetlb_cgroup.c
>>> +++ b/mm/hugetlb_cgroup.c
>>> @@ -105,6 +105,93 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
>>> return -EBUSY;
>>> }
>>>
>>> +int hugetlb_cgroup_charge_page(int idx, unsigned long nr_pages,
>>> + struct hugetlb_cgroup **ptr)
>>> +{
>>> + int ret = 0;
>>> + struct res_counter *fail_res;
>>> + struct hugetlb_cgroup *h_cg = NULL;
>>> + unsigned long csize = nr_pages * PAGE_SIZE;
>>> +
>>> + if (hugetlb_cgroup_disabled())
>>> + goto done;
>>> + /*
>>> + * We don't charge any cgroup if the compound page have less
>>> + * than 3 pages.
>>> + */
>>> + if (hstates[idx].order< 2)
>>> + goto done;
>>> +again:
>>> + rcu_read_lock();
>>> + h_cg = hugetlb_cgroup_from_task(current);
>>> + if (!h_cg)
>>> + h_cg = root_h_cgroup;
>>> +
>>> + if (!css_tryget(&h_cg->css)) {
>>> + rcu_read_unlock();
>>> + goto again;
>>> + }
>>> + rcu_read_unlock();
>>> +
>>> + ret = res_counter_charge(&h_cg->hugepage[idx], csize,&fail_res);
>>> + css_put(&h_cg->css);
>>> +done:
>>> + *ptr = h_cg;
>>> + return ret;
>>> +}
>>> +
>>
>> Memory cgroup uses very complicated 'charge' routine for handling pageout...
>> which gets sleep.
>>
>> For hugetlbfs, it has not sleep routine, you can do charge in simple way.
>> I guess...get/put here is overkill.
>>
>> For example, h_cg cannot be freed while it has tasks. So, if 'current' is
>> belongs to the cgroup, it cannot be disappear. Then, you don't need get/put,
>> additional atomic ops for holding cgroup.
>>
>> rcu_read_lock();
>> h_cg = hugetlb_cgroup_from_task(current);
>> ret = res_counter_charge(&h_cg->hugetpage[idx], csize,&fail_res);
>> rcu_read_unlock();
>>
>> return ret;
>>
>
> What if the task got moved ot of the cgroup and cgroup got deleted by an
> rmdir ?
>
I think
- yes, the task, 'current', can be moved off from the cgroup.
- rcu_read_lock() prevents ->destroy() cgroup.
Then, the concern is that the cgroup may have resource usage even after
->pre_destroy() is called. We don't have any serialization between
charging <-> task_move <-> rmdir().
How about taking
write_lock(&mm->mmap_sem)
write_unlock(&mm->mmap_sem)
at moving task (->attach()) ? This will serialize task-move and charging
without any realistic performance impact. If tasks cannot move, rmdir
never happens.
Maybe you can do this later as an optimization. So, please take this as
an suggestion.
Thanks,
-Kame
(2012/06/12 20:00), Aneesh Kumar K.V wrote:
> Kamezawa Hiroyuki<[email protected]> writes:
>
>> (2012/06/09 18:00), Aneesh Kumar K.V wrote:
>>> From: "Aneesh Kumar K.V"<[email protected]>
>>>
>>> With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since
>>> we are holding a hugepage reference, we can be sure that old page won't
>>> get uncharged till the last put_page().
>>>
>>> Signed-off-by: Aneesh Kumar K.V<[email protected]>
>>
>> one comment.
>>
>>> ---
>>> include/linux/hugetlb_cgroup.h | 8 ++++++++
>>> mm/hugetlb_cgroup.c | 21 +++++++++++++++++++++
>>> mm/migrate.c | 5 +++++
>>> 3 files changed, 34 insertions(+)
>>>
>>> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
>>> index ba4836f..b64d067 100644
>>> --- a/include/linux/hugetlb_cgroup.h
>>> +++ b/include/linux/hugetlb_cgroup.h
>>> @@ -63,6 +63,8 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>>> extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
>>> struct hugetlb_cgroup *h_cg);
>>> extern int hugetlb_cgroup_file_init(int idx) __init;
>>> +extern void hugetlb_cgroup_migrate(struct page *oldhpage,
>>> + struct page *newhpage);
>>> #else
>>> static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
>>> {
>>> @@ -112,5 +114,11 @@ static inline int __init hugetlb_cgroup_file_init(int idx)
>>> {
>>> return 0;
>>> }
>>> +
>>> +static inline void hugetlb_cgroup_migrate(struct page *oldhpage,
>>> + struct page *newhpage)
>>> +{
>>> + return;
>>> +}
>>> #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */
>>> #endif
>>> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
>>> index c2b7b8e..2d384fe 100644
>>> --- a/mm/hugetlb_cgroup.c
>>> +++ b/mm/hugetlb_cgroup.c
>>> @@ -394,6 +394,27 @@ int __init hugetlb_cgroup_file_init(int idx)
>>> return 0;
>>> }
>>>
>>> +void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
>>> +{
>>> + struct hugetlb_cgroup *h_cg;
>>> +
>>> + VM_BUG_ON(!PageHuge(oldhpage));
>>> +
>>> + if (hugetlb_cgroup_disabled())
>>> + return;
>>> +
>>> + spin_lock(&hugetlb_lock);
>>> + h_cg = hugetlb_cgroup_from_page(oldhpage);
>>> + set_hugetlb_cgroup(oldhpage, NULL);
>>> + cgroup_exclude_rmdir(&h_cg->css);
>>> +
>>> + /* move the h_cg details to new cgroup */
>>> + set_hugetlb_cgroup(newhpage, h_cg);
>>> + spin_unlock(&hugetlb_lock);
>>> + cgroup_release_and_wakeup_rmdir(&h_cg->css);
>>> + return;
>>
>>
>> Why do you need cgroup_exclude/release rmdir here ? you holds hugetlb_lock()
>> and charges will not be empty, here.
>>
>
> But even without empty charge, we can still remove the cgroup right ?
> ie if we don't have any task but some charge in the cgroup because of
> shared mmap in hugetlbfs.
>
IIUC, cgroup_exclude_rmdir() is for putting rmdir() task under sleep state
and avoiding busy retries. Here, current thread is invoking rmdir() against
the cgroup.....
from kernel/cgroup.c
set RMDIR bit.
mutex_unlock(&cgroup_mutex);
call ->pre_destroy()
mutex_lock(&cgroup_mutex);
if cgroup has some refcnt, sleep and wait for
an event some thread calls cgroup_release_and_wakeup_rmdir().
So, the waiter should call ->pre_destroy() and get succeeded.
wating for a wakeup-event of cgroup_release_and_wakeup_rmdir() by some
other thread holding refcnt on the cgroup.
In memcg case, kswapd or some may hold reference count of memcg and wake
up event in (2) will be issued.
In this hugetlb case, it doesn't seem to happen.
Thanks,
-Kame