This effort is the result a recent bug report [1]. In subsequent
discussions [2], it was deemed necessary to properly fix the hugetlb
put_page path (free_huge_page). This RFC provides a possible way to
address the issue. Comments are welcome/encouraged as several attempts
at this have been made in the past.
This series is based on v5.12-rc3. At a high level, this series does:
- Patches 1 & 2 are cleanups/fixes to existing code in the areas to be
modified.
- Patches 3, 4 & 5 are aimed at reducing lock hold times. To be clear
the goal is to eliminate single lock hold times of a long duration.
Overall lock hold time is not addressed. In addition, the known
long hold time in hugetlb_cgroup_css_offline still needs to be
addressed (suggestions welcome).
- Patch 6 makes hugetlb_lock and subpool lock IRQ safe. It also reverts
the code which defers calls to a workqueue if !in_task.
- Patch 7 adds code to defer freeing of pages to a workqueue if the freeing
routines could possibly sleep.
- Patch 8 adds a flag to gigantic pages allocated from CMA so that we
only defer freeing those pages.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] http://lkml.kernel.org/r/[email protected]
Mike Kravetz (8):
hugetlb: add per-hstate mutex to synchronize user adjustments
hugetlb: recompute min_count when dropping hugetlb_lock
hugetlb: create remove_hugetlb_page() to separate functionality
hugetlb: call update_and_free_page without hugetlb_lock
hugetlb: change free_pool_huge_page to remove_pool_huge_page
hugetlb: make free_huge_page irq safe
hugetlb: add update_and_free_page_no_sleep for irq context
hugetlb: track hugetlb pages allocated via cma_alloc
include/linux/hugetlb.h | 20 +-
mm/hugetlb.c | 450 +++++++++++++++++++++++++---------------
mm/hugetlb_cgroup.c | 10 +-
3 files changed, 310 insertions(+), 170 deletions(-)
--
2.30.2
The locks acquired in free_huge_page are irq safe. However, in certain
circumstances the routine update_and_free_page could sleep. Since
free_huge_page can be called from any context, it can not sleep.
Use a waitqueue to defer freeing of pages if the operation may sleep. A
new routine update_and_free_page_no_sleep provides this functionality
and is only called from free_huge_page.
Note that any 'pages' sent to the workqueue for deferred freeing have
already been removed from the hugetlb subsystem. What is actually
deferred is returning those base pages to the low level allocator.
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 12 +++++-
mm/hugetlb.c | 86 +++++++++++++++++++++++++++++++++++++++--
2 files changed, 94 insertions(+), 4 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f42d44050548..a81ca39c06be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -666,9 +666,14 @@ static inline unsigned huge_page_shift(struct hstate *h)
return h->order + PAGE_SHIFT;
}
+static inline bool order_is_gigantic(unsigned int order)
+{
+ return order >= MAX_ORDER;
+}
+
static inline bool hstate_is_gigantic(struct hstate *h)
{
- return huge_page_order(h) >= MAX_ORDER;
+ return order_is_gigantic(huge_page_order(h));
}
static inline unsigned int pages_per_huge_page(struct hstate *h)
@@ -942,6 +947,11 @@ static inline unsigned int huge_page_shift(struct hstate *h)
return PAGE_SHIFT;
}
+static inline bool order_is_gigantic(unsigned int order)
+{
+ return false;
+}
+
static inline bool hstate_is_gigantic(struct hstate *h)
{
return false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 82614bbe7bb9..b8304b290a73 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1351,7 +1351,60 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
h->nr_huge_pages_node[nid]--;
}
-static void update_and_free_page(struct hstate *h, struct page *page)
+/*
+ * free_huge_page() can be called from any context. However, the freeing
+ * of a hugetlb page can potentially sleep. If freeing will sleep, defer
+ * the actual freeing to a workqueue to prevent sleeping in contexts where
+ * sleeping is not allowed.
+ *
+ * Use the page->mapping pointer as a llist_node structure for the lockless
+ * linked list of pages to be freeed. free_hpage_workfn() locklessly
+ * retrieves the linked list of pages to be freed and frees them one-by-one.
+ *
+ * The page passed to __free_huge_page is technically not a hugetlb page, so
+ * we can not use interfaces such as page_hstate().
+ */
+static void __free_huge_page(struct page *page)
+{
+ unsigned int order = compound_order(page);
+
+ if (order_is_gigantic(order)) {
+ destroy_compound_gigantic_page(page, order);
+ free_gigantic_page(page, order);
+ } else {
+ __free_pages(page, order);
+ }
+}
+
+static LLIST_HEAD(hpage_freelist);
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+ struct llist_node *node;
+ struct page *page;
+
+ node = llist_del_all(&hpage_freelist);
+
+ while (node) {
+ page = container_of((struct address_space **)node,
+ struct page, mapping);
+ node = node->next;
+ __free_huge_page(page);
+ }
+}
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+static bool free_page_may_sleep(struct hstate *h, struct page *page)
+{
+ /* freeing gigantic pages in CMA may sleep */
+ if (hstate_is_gigantic(h))
+ return true;
+
+ return false;
+}
+
+static void __update_and_free_page(struct hstate *h, struct page *page,
+ bool can_sleep)
{
int i;
struct page *subpage = page;
@@ -1366,6 +1419,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
+
+ if (!can_sleep && free_page_may_sleep(h, page)) {
+ /*
+ * Send page freeing to workqueue
+ *
+ * Only call schedule_work() if hpage_freelist is previously
+ * empty. Otherwise, schedule_work() had been called but the
+ * workfn hasn't retrieved the list yet.
+ */
+ if (llist_add((struct llist_node *)&page->mapping,
+ &hpage_freelist))
+ schedule_work(&free_hpage_work);
+ return;
+ }
+
if (hstate_is_gigantic(h)) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
@@ -1374,6 +1442,18 @@ static void update_and_free_page(struct hstate *h, struct page *page)
}
}
+static void update_and_free_page_no_sleep(struct hstate *h, struct page *page)
+{
+ /* can not sleep */
+ return __update_and_free_page(h, page, false);
+}
+
+static void update_and_free_page(struct hstate *h, struct page *page)
+{
+ /* can sleep */
+ return __update_and_free_page(h, page, true);
+}
+
struct hstate *size_to_hstate(unsigned long size)
{
struct hstate *h;
@@ -1436,12 +1516,12 @@ void free_huge_page(struct page *page)
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
spin_unlock_irqrestore(&hugetlb_lock, flags);
- update_and_free_page(h, page);
+ update_and_free_page_no_sleep(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
spin_unlock_irqrestore(&hugetlb_lock, flags);
- update_and_free_page(h, page);
+ update_and_free_page_no_sleep(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
--
2.30.2
free_pool_huge_page was called with hugetlb_lock held. It would remove
a hugetlb page, and then free the corresponding pages to the lower level
allocators such as buddy. free_pool_huge_page was called in a loop to
remove hugetlb pages and these loops could hold the hugetlb_lock for a
considerable time.
Create new routine remove_pool_huge_page to replace free_pool_huge_page.
remove_pool_huge_page will remove the hugetlb page, and it must be
called with the hugetlb_lock held. It will return the removed page and
it is the responsibility of the caller to free the page to the lower
level allocators. The hugetlb_lock is dropped before freeing to these
allocators which results in shorter lock hold times.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 53 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 23 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3028cf10d504..f60a24e326c2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1184,7 +1184,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
}
/*
- * helper for free_pool_huge_page() - return the previously saved
+ * helper for remove_pool_huge_page() - return the previously saved
* node ["this node"] from which to free a huge page. Advance the
* next node id whether or not we find a free huge page to free so
* that the next attempt to free addresses the next node.
@@ -1699,16 +1699,18 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
}
/*
- * Free huge page from pool from next node to free.
- * Attempt to keep persistent huge pages more or less
- * balanced over allowed nodes.
+ * Remove huge page from pool from next node to free. Attempt to keep
+ * persistent huge pages more or less balanced over allowed nodes.
+ * This routine only 'removes' the hugetlb page. The caller must make
+ * an additional call to free the page to low level allocators.
* Called with hugetlb_lock locked.
*/
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
- bool acct_surplus)
+static struct page *remove_pool_huge_page(struct hstate *h,
+ nodemask_t *nodes_allowed,
+ bool acct_surplus)
{
int nr_nodes, node;
- int ret = 0;
+ struct page *page = NULL;
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
/*
@@ -1717,23 +1719,14 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
*/
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(&h->hugepage_freelists[node])) {
- struct page *page =
- list_entry(h->hugepage_freelists[node].next,
+ page = list_entry(h->hugepage_freelists[node].next,
struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
- /*
- * unlock/lock around update_and_free_page is temporary
- * and will be removed with subsequent patch.
- */
- spin_unlock(&hugetlb_lock);
- update_and_free_page(h, page);
- spin_lock(&hugetlb_lock);
- ret = 1;
break;
}
}
- return ret;
+ return page;
}
/*
@@ -2064,6 +2057,7 @@ static void return_unused_surplus_pages(struct hstate *h,
unsigned long unused_resv_pages)
{
unsigned long nr_pages;
+ struct page *page;
/* Cannot return gigantic pages currently */
if (hstate_is_gigantic(h))
@@ -2080,7 +2074,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* evenly across all nodes with memory. Iterate across these nodes
* until we can no longer free unreserved surplus pages. This occurs
* when the nodes with surplus pages have no free pages.
- * free_pool_huge_page() will balance the freed pages across the
+ * remove_pool_huge_page() will balance the freed pages across the
* on-line nodes with memory and will handle the hstate accounting.
*
* Note that we decrement resv_huge_pages as we free the pages. If
@@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h,
while (nr_pages--) {
h->resv_huge_pages--;
unused_resv_pages--;
- if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
+ page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
+ if (!page)
goto out;
- cond_resched_lock(&hugetlb_lock);
+
+ /* Drop lock and free page to buddy as it could sleep */
+ spin_unlock(&hugetlb_lock);
+ update_and_free_page(h, page);
+ cond_resched();
+ spin_lock(&hugetlb_lock);
}
out:
@@ -2631,6 +2631,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
nodemask_t *nodes_allowed)
{
unsigned long min_count, ret;
+ struct page *page;
NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
/*
@@ -2740,9 +2741,15 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
min_count = min_hp_count(h, count);
try_to_free_low(h, count, nodes_allowed);
while (min_count < persistent_huge_pages(h)) {
- if (!free_pool_huge_page(h, nodes_allowed, 0))
+ page = remove_pool_huge_page(h, nodes_allowed, 0);
+ if (!page)
break;
- cond_resched_lock(&hugetlb_lock);
+
+ /* Drop lock as free routines may sleep */
+ spin_unlock(&hugetlb_lock);
+ update_and_free_page(h, page);
+ cond_resched();
+ spin_lock(&hugetlb_lock);
/* Recompute min_count in case hugetlb_lock was dropped */
min_count = min_hp_count(h, count);
--
2.30.2
With the introduction of remove_hugetlb_page(), there is no need for
update_and_free_page to hold the hugetlb lock. Change all callers to
drop the lock before calling.
With additional code modifications, this will allow loops which decrease
the huge page pool to drop the hugetlb_lock with each page to reduce
long hold times.
The ugly unlock/lock cycle in free_pool_huge_page will be removed in
a subsequent patch which restructures free_pool_huge_page.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ae185d3315e0..3028cf10d504 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1362,14 +1362,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
1 << PG_writeback);
}
if (hstate_is_gigantic(h)) {
- /*
- * Temporarily drop the hugetlb_lock, because
- * we might block in free_gigantic_page().
- */
- spin_unlock(&hugetlb_lock);
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
- spin_lock(&hugetlb_lock);
} else {
__free_pages(page, huge_page_order(h));
}
@@ -1435,16 +1429,18 @@ static void __free_huge_page(struct page *page)
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
+ spin_unlock(&hugetlb_lock);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
+ spin_unlock(&hugetlb_lock);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
+ spin_unlock(&hugetlb_lock);
}
- spin_unlock(&hugetlb_lock);
}
/*
@@ -1725,7 +1721,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
list_entry(h->hugepage_freelists[node].next,
struct page, lru);
remove_hugetlb_page(h, page, acct_surplus);
+ /*
+ * unlock/lock around update_and_free_page is temporary
+ * and will be removed with subsequent patch.
+ */
+ spin_unlock(&hugetlb_lock);
update_and_free_page(h, page);
+ spin_lock(&hugetlb_lock);
ret = 1;
break;
}
@@ -1794,8 +1796,9 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
+ spin_unlock(&hugetlb_lock);
update_and_free_page(h, head);
- rc = 0;
+ return 0;
}
out:
spin_unlock(&hugetlb_lock);
@@ -2572,7 +2575,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
remove_hugetlb_page(h, page, false);
h->free_huge_pages--;
h->free_huge_pages_node[page_to_nid(page)]--;
+ spin_unlock(&hugetlb_lock);
update_and_free_page(h, page);
+ spin_lock(&hugetlb_lock);
/*
* update_and_free_page could have dropped lock so
--
2.30.2
Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
non-task context") was added to address the issue of free_huge_page
being called from irq context. That commit hands off free_huge_page
processing to a workqueue if !in_task. However, as seen in [1] this
does not cover all cases. Instead, make the locks taken in the
free_huge_page irq safe.
This patch does the following:
- Make hugetlb_lock irq safe. This is mostly a simple process of
changing spin_*lock calls to spin_*lock_irq* calls.
- Make subpool lock irq safe in a similar manner.
- Revert the !in_task check and workqueue handoff.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 206 ++++++++++++++++++++------------------------
mm/hugetlb_cgroup.c | 10 ++-
2 files changed, 100 insertions(+), 116 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f60a24e326c2..82614bbe7bb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -94,9 +94,10 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
return true;
}
-static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
+static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
+ unsigned long irq_flags)
{
- spin_unlock(&spool->lock);
+ spin_unlock_irqrestore(&spool->lock, irq_flags);
/* If no pages are used, and no other handles to the subpool
* remain, give up any reservations based on minimum size and
@@ -135,10 +136,12 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
void hugepage_put_subpool(struct hugepage_subpool *spool)
{
- spin_lock(&spool->lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&spool->lock, flags);
BUG_ON(!spool->count);
spool->count--;
- unlock_or_release_subpool(spool);
+ unlock_or_release_subpool(spool, flags);
}
/*
@@ -153,11 +156,12 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
long delta)
{
long ret = delta;
+ unsigned long flags;
if (!spool)
return ret;
- spin_lock(&spool->lock);
+ spin_lock_irqsave(&spool->lock, flags);
if (spool->max_hpages != -1) { /* maximum size accounting */
if ((spool->used_hpages + delta) <= spool->max_hpages)
@@ -184,7 +188,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
}
unlock_ret:
- spin_unlock(&spool->lock);
+ spin_unlock_irqrestore(&spool->lock, flags);
return ret;
}
@@ -198,11 +202,12 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
long delta)
{
long ret = delta;
+ unsigned long flags;
if (!spool)
return delta;
- spin_lock(&spool->lock);
+ spin_lock_irqsave(&spool->lock, flags);
if (spool->max_hpages != -1) /* maximum size accounting */
spool->used_hpages -= delta;
@@ -223,7 +228,7 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
* If hugetlbfs_put_super couldn't free spool due to an outstanding
* quota reference, free it now.
*/
- unlock_or_release_subpool(spool);
+ unlock_or_release_subpool(spool, flags);
return ret;
}
@@ -1380,7 +1385,7 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
}
-static void __free_huge_page(struct page *page)
+void free_huge_page(struct page *page)
{
/*
* Can't pass hstate in here because it is called from the
@@ -1390,6 +1395,7 @@ static void __free_huge_page(struct page *page)
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
bool restore_reserve;
+ unsigned long flags;
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(page_mapcount(page), page);
@@ -1418,7 +1424,7 @@ static void __free_huge_page(struct page *page)
restore_reserve = true;
}
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
ClearHPageMigratable(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),
pages_per_huge_page(h), page);
@@ -1429,80 +1435,34 @@ static void __free_huge_page(struct page *page)
if (HPageTemporary(page)) {
remove_hugetlb_page(h, page, false);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, page);
} else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
remove_hugetlb_page(h, page, true);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, page);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
- spin_unlock(&hugetlb_lock);
- }
-}
-
-/*
- * As free_huge_page() can be called from a non-task context, we have
- * to defer the actual freeing in a workqueue to prevent potential
- * hugetlb_lock deadlock.
- *
- * free_hpage_workfn() locklessly retrieves the linked list of pages to
- * be freed and frees them one-by-one. As the page->mapping pointer is
- * going to be cleared in __free_huge_page() anyway, it is reused as the
- * llist_node structure of a lockless linked list of huge pages to be freed.
- */
-static LLIST_HEAD(hpage_freelist);
-
-static void free_hpage_workfn(struct work_struct *work)
-{
- struct llist_node *node;
- struct page *page;
-
- node = llist_del_all(&hpage_freelist);
-
- while (node) {
- page = container_of((struct address_space **)node,
- struct page, mapping);
- node = node->next;
- __free_huge_page(page);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
}
-static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
-
-void free_huge_page(struct page *page)
-{
- /*
- * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
- */
- if (!in_task()) {
- /*
- * Only call schedule_work() if hpage_freelist is previously
- * empty. Otherwise, schedule_work() had been called but the
- * workfn hasn't retrieved the list yet.
- */
- if (llist_add((struct llist_node *)&page->mapping,
- &hpage_freelist))
- schedule_work(&free_hpage_work);
- return;
- }
-
- __free_huge_page(page);
-}
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
{
+ unsigned long flags;
+
INIT_LIST_HEAD(&page->lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
hugetlb_set_page_subpool(page, NULL);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
ClearHPageFreed(page);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1742,13 +1702,14 @@ static struct page *remove_pool_huge_page(struct hstate *h,
int dissolve_free_huge_page(struct page *page)
{
int rc = -EBUSY;
+ unsigned long flags;
retry:
/* Not to disrupt normal path by vainly holding hugetlb_lock */
if (!PageHuge(page))
return 0;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (!PageHuge(page)) {
rc = 0;
goto out;
@@ -1765,7 +1726,12 @@ int dissolve_free_huge_page(struct page *page)
* when it is dissolved.
*/
if (unlikely(!HPageFreed(head))) {
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
+
+ /*
+ * ??? Does this retry make any sense now that
+ * hugetlb_lock is held with irqs disabled ???
+ */
cond_resched();
/*
@@ -1789,12 +1755,12 @@ int dissolve_free_huge_page(struct page *page)
}
remove_hugetlb_page(h, page, false);
h->max_huge_pages--;
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, head);
return 0;
}
out:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return rc;
}
@@ -1832,20 +1798,21 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
struct page *page = NULL;
+ unsigned long flags;
if (hstate_is_gigantic(h))
return NULL;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
goto out_unlock;
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
if (!page)
return NULL;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
/*
* We could have raced with the pool size change.
* Double check that and simply deallocate the new page
@@ -1855,7 +1822,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
*/
if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
SetHPageTemporary(page);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
put_page(page);
return NULL;
} else {
@@ -1864,7 +1831,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
}
out_unlock:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return page;
}
@@ -1914,17 +1881,19 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask)
{
- spin_lock(&hugetlb_lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (h->free_huge_pages - h->resv_huge_pages > 0) {
struct page *page;
page = dequeue_huge_page_nodemask(h, gfp_mask, preferred_nid, nmask);
if (page) {
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return page;
}
}
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}
@@ -1951,7 +1920,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
* Increase the hugetlb pool such that it can accommodate a reservation
* of size 'delta'.
*/
-static int gather_surplus_pages(struct hstate *h, long delta)
+static int gather_surplus_pages(struct hstate *h, long delta,
+ unsigned long *irq_flags)
__must_hold(&hugetlb_lock)
{
struct list_head surplus_list;
@@ -1972,7 +1942,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
ret = -ENOMEM;
retry:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, *irq_flags);
for (i = 0; i < needed; i++) {
page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
NUMA_NO_NODE, NULL);
@@ -1989,7 +1959,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* After retaking hugetlb_lock, we need to recalculate 'needed'
* because either resv_huge_pages or free_huge_pages may have changed.
*/
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, *irq_flags);
needed = (h->resv_huge_pages + delta) -
(h->free_huge_pages + allocated);
if (needed > 0) {
@@ -2029,12 +1999,12 @@ static int gather_surplus_pages(struct hstate *h, long delta)
enqueue_huge_page(h, page);
}
free:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, *irq_flags);
/* Free unnecessary surplus pages to the buddy allocator */
list_for_each_entry_safe(page, tmp, &surplus_list, lru)
put_page(page);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, *irq_flags);
return ret;
}
@@ -2054,7 +2024,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
* number of huge pages we plan to free when dropping the lock.
*/
static void return_unused_surplus_pages(struct hstate *h,
- unsigned long unused_resv_pages)
+ unsigned long unused_resv_pages,
+ unsigned long *irq_flags)
{
unsigned long nr_pages;
struct page *page;
@@ -2089,10 +2060,10 @@ static void return_unused_surplus_pages(struct hstate *h,
goto out;
/* Drop lock and free page to buddy as it could sleep */
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, *irq_flags);
update_and_free_page(h, page);
cond_resched();
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, *irq_flags);
}
out:
@@ -2281,6 +2252,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
int ret, idx;
struct hugetlb_cgroup *h_cg;
bool deferred_reserve;
+ unsigned long flags;
idx = hstate_index(h);
/*
@@ -2332,7 +2304,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
if (ret)
goto out_uncharge_cgroup_reservation;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
/*
* glb_chg is passed to indicate whether or not a page must be taken
* from the global free pool (global change). gbl_chg == 0 indicates
@@ -2340,7 +2312,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
*/
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
if (!page) {
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
if (!page)
goto out_uncharge_cgroup;
@@ -2348,7 +2320,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
SetHPageRestoreReserve(page);
h->resv_huge_pages--;
}
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_add(&page->lru, &h->hugepage_activelist);
/* Fall through */
}
@@ -2361,7 +2333,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
h_cg, page);
}
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
hugetlb_set_page_subpool(page, spool);
@@ -2556,7 +2528,8 @@ static inline unsigned long min_hp_count(struct hstate *h, unsigned long count)
#ifdef CONFIG_HIGHMEM
static void try_to_free_low(struct hstate *h, unsigned long count,
- nodemask_t *nodes_allowed)
+ nodemask_t *nodes_allowed,
+ unsigned long *irq_flags)
{
int i;
unsigned long min_count = min_hp_count(h, count);
@@ -2575,9 +2548,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
remove_hugetlb_page(h, page, false);
h->free_huge_pages--;
h->free_huge_pages_node[page_to_nid(page)]--;
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, *irq_flags);
update_and_free_page(h, page);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, *irq_flags);
/*
* update_and_free_page could have dropped lock so
@@ -2589,7 +2562,8 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
}
#else
static inline void try_to_free_low(struct hstate *h, unsigned long count,
- nodemask_t *nodes_allowed)
+ nodemask_t *nodes_allowed,
+ unsigned long *irq_flags)
{
}
#endif
@@ -2633,6 +2607,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
unsigned long min_count, ret;
struct page *page;
NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
+ unsigned long flags;
/*
* Bit mask controlling how hard we retry per-node allocations.
@@ -2646,7 +2621,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
/* mutex prevents concurrent adjustments for the same hstate */
mutex_lock(&h->mutex);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
/*
* Check for a node specific request.
@@ -2677,7 +2652,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
*/
if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
if (count > persistent_huge_pages(h)) {
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
mutex_unlock(&h->mutex);
NODEMASK_FREE(node_alloc_noretry);
return -EINVAL;
@@ -2707,14 +2682,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* page, free_huge_page will handle it by freeing the page
* and reducing the surplus.
*/
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
/* yield cpu to avoid soft lockup */
cond_resched();
ret = alloc_pool_huge_page(h, nodes_allowed,
node_alloc_noretry);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (!ret)
goto out;
@@ -2739,7 +2714,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* sysctls are changed, or the surplus pages go out of use.
*/
min_count = min_hp_count(h, count);
- try_to_free_low(h, count, nodes_allowed);
+ try_to_free_low(h, count, nodes_allowed, &flags);
while (min_count < persistent_huge_pages(h)) {
page = remove_pool_huge_page(h, nodes_allowed, 0);
if (!page)
@@ -2760,7 +2735,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
}
out:
h->max_huge_pages = persistent_huge_pages(h);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
mutex_unlock(&h->mutex);
NODEMASK_FREE(node_alloc_noretry);
@@ -2908,6 +2883,7 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
int err;
unsigned long input;
struct hstate *h = kobj_to_hstate(kobj, NULL);
+ unsigned long flags;
if (hstate_is_gigantic(h))
return -EINVAL;
@@ -2916,9 +2892,9 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
if (err)
return err;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
h->nr_overcommit_huge_pages = input;
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return count;
}
@@ -3490,6 +3466,7 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
struct hstate *h = &default_hstate;
unsigned long tmp;
int ret;
+ unsigned long flags;
if (!hugepages_supported())
return -EOPNOTSUPP;
@@ -3505,9 +3482,9 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
goto out;
if (write) {
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
h->nr_overcommit_huge_pages = tmp;
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
out:
return ret;
@@ -3599,11 +3576,12 @@ unsigned long hugetlb_total_pages(void)
static int hugetlb_acct_memory(struct hstate *h, long delta)
{
int ret = -ENOMEM;
+ unsigned long flags;
if (!delta)
return 0;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
/*
* When cpuset is configured, it breaks the strict hugetlb page
* reservation as the accounting is done on a global variable. Such
@@ -3628,21 +3606,21 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
* above.
*/
if (delta > 0) {
- if (gather_surplus_pages(h, delta) < 0)
+ if (gather_surplus_pages(h, delta, &flags) < 0)
goto out;
if (delta > allowed_mems_nr(h)) {
- return_unused_surplus_pages(h, delta);
+ return_unused_surplus_pages(h, delta, &flags);
goto out;
}
}
ret = 0;
if (delta < 0)
- return_unused_surplus_pages(h, (unsigned long) -delta);
+ return_unused_surplus_pages(h, (unsigned long) -delta, &flags);
out:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return ret;
}
@@ -5654,8 +5632,9 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int fla
bool isolate_huge_page(struct page *page, struct list_head *list)
{
bool ret = true;
+ unsigned long flags;
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (!PageHeadHuge(page) ||
!HPageMigratable(page) ||
!get_page_unless_zero(page)) {
@@ -5665,22 +5644,25 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
ClearHPageMigratable(page);
list_move_tail(&page->lru, list);
unlock:
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return ret;
}
void putback_active_hugepage(struct page *page)
{
- spin_lock(&hugetlb_lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&hugetlb_lock, flags);
SetHPageMigratable(page);
list_move_tail(&page->lru, &(page_hstate(page))->hugepage_activelist);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
put_page(page);
}
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
{
struct hstate *h = page_hstate(oldpage);
+ unsigned long flags;
hugetlb_cgroup_migrate(oldpage, newpage);
set_page_owner_migrate_reason(newpage, reason);
@@ -5702,12 +5684,12 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
SetHPageTemporary(oldpage);
ClearHPageTemporary(newpage);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
if (h->surplus_huge_pages_node[old_nid]) {
h->surplus_huge_pages_node[old_nid]--;
h->surplus_huge_pages_node[new_nid]++;
}
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f68b51fcda3d..b3b5759e44b4 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -200,15 +200,16 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
struct hstate *h;
struct page *page;
int idx;
+ unsigned long flags;
do {
idx = 0;
for_each_hstate(h) {
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry(page, &h->hugepage_activelist, lru)
hugetlb_cgroup_move_parent(idx, h_cg, page);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
idx++;
}
cond_resched();
@@ -774,12 +775,13 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
struct hugetlb_cgroup *h_cg;
struct hugetlb_cgroup *h_cg_rsvd;
struct hstate *h = page_hstate(oldhpage);
+ unsigned long flags;
if (hugetlb_cgroup_disabled())
return;
VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage);
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
h_cg = hugetlb_cgroup_from_page(oldhpage);
h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
set_hugetlb_cgroup(oldhpage, NULL);
@@ -789,7 +791,7 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
set_hugetlb_cgroup(newhpage, h_cg);
set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
list_move(&newhpage->lru, &h->hugepage_activelist);
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
return;
}
--
2.30.2
On 3/19/21 3:42 PM, Mike Kravetz wrote:
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context. That commit hands off free_huge_page
> processing to a workqueue if !in_task. However, as seen in [1] this
> does not cover all cases. Instead, make the locks taken in the
> free_huge_page irq safe.
>
> This patch does the following:
> - Make hugetlb_lock irq safe. This is mostly a simple process of
> changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 206 ++++++++++++++++++++------------------------
> mm/hugetlb_cgroup.c | 10 ++-
> 2 files changed, 100 insertions(+), 116 deletions(-)
I missed the following changes:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5efff5ce337f..13d77d94d185 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2803,10 +2803,10 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
break;
/* Drop lock as free routines may sleep */
- spin_unlock(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, page);
cond_resched();
- spin_lock(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
/* Recompute min_count in case hugetlb_lock was dropped */
min_count = min_hp_count(h, count);
On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> The locks acquired in free_huge_page are irq safe. However, in certain
> circumstances the routine update_and_free_page could sleep. Since
> free_huge_page can be called from any context, it can not sleep.
>
> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> new routine update_and_free_page_no_sleep provides this functionality
> and is only called from free_huge_page.
>
> Note that any 'pages' sent to the workqueue for deferred freeing have
> already been removed from the hugetlb subsystem. What is actually
> deferred is returning those base pages to the low level allocator.
So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
should be in cma_release().
Also, afaict cma_release() does free_contig_range() *first*, and then
does the 'difficult' bits. So how about you re-order
free_gigantic_page() a bit to make it unconditionally do
free_contig_range() and *then* call into CMA, which can then do a
workqueue thingy if it feels like it.
That way none of the hugetlb accounting is delayed, and only CMA gets to
suffer.
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: cd190f60f91cdd03f61aa8f52b2013ecfbb206be ("[RFC PATCH 6/8] hugetlb: make free_huge_page irq safe")
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/make-hugetlb-put_page-safe-for-all-calling-contexts/20210320-064419
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git a74e6a014c9d4d4161061f770c9b4f98372ac778
in testcase: ltp
version: ltp-x86_64-14c1f76-1_20210320
with following parameters:
disk: 1HDD
fs: btrfs
test: syscalls-03
ucode: 0xe2
test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features.
test-url: http://linux-test-project.github.io/
on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 101.462202] BUG: sleeping function called from invalid context at mm/hugetlb.c:2723
[ 101.464721]
[ 101.469409] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 4012, name: memfd_create03
[ 101.469411] CPU: 1 PID: 4012 Comm: memfd_create03 Tainted: G I 5.12.0-rc2-00302-gcd190f60f91c #1
[ 101.469413] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015
[ 101.477871] fcntl20 0 TINFO : Enter block 7
[ 101.478536] Call Trace:
[ 101.487043]
[ 101.497105] dump_stack (kbuild/src/consumer/lib/dump_stack.c:122)
[ 101.505403] fcntl20 0 TINFO : Test block 7: PASSED
[ 101.509261] ___might_sleep.cold (kbuild/src/consumer/kernel/sched/core.c:8331 kbuild/src/consumer/kernel/sched/core.c:8288)
[ 101.511708]
[ 101.513204] __nr_hugepages_store_common (kbuild/src/consumer/include/linux/sched.h:1893 kbuild/src/consumer/mm/hugetlb.c:2723 kbuild/src/consumer/mm/hugetlb.c:2813)
[ 101.517317] fcntl20 0 TINFO : Exit block 7
[ 101.521893] ? __do_proc_doulongvec_minmax (kbuild/src/consumer/kernel/sysctl.c:1181)
[ 101.525987]
[ 101.527473] hugetlb_sysctl_handler_common (kbuild/src/consumer/mm/hugetlb.c:3437)
[ 101.527475] ? alloc_huge_page (kbuild/src/consumer/mm/hugetlb.c:3445)
[ 101.532918] <<<execution_status>>>
[ 101.537131] proc_sys_call_handler (kbuild/src/consumer/fs/proc/proc_sysctl.c:591)
[ 101.542265]
[ 101.543750] new_sync_write (kbuild/src/consumer/fs/read_write.c:519 (discriminator 1))
[ 101.549244] initiation_status="ok"
[ 101.552890] vfs_write (kbuild/src/consumer/fs/read_write.c:605)
[ 101.552892] ksys_write (kbuild/src/consumer/fs/read_write.c:658)
[ 101.556286]
[ 101.560714] do_syscall_64 (kbuild/src/consumer/arch/x86/entry/common.c:46)
[ 101.560717] entry_SYSCALL_64_after_hwframe (kbuild/src/consumer/arch/x86/entry/entry_64.S:112)
[ 101.563498] duration=0 termination_type=exited termination_id=0 corefile=no
[ 101.566024] RIP: 0033:0x7f85fdfd5504
[ 101.566026] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
All code
========
0: 00 f7 add %dh,%bh
2: d8 64 89 02 fsubs 0x2(%rcx,%rcx,4)
6: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
d: eb b3 jmp 0xffffffffffffffc2
f: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
16: 48 8d 05 f9 61 0d 00 lea 0xd61f9(%rip),%rax # 0xd6216
1d: 8b 00 mov (%rax),%eax
1f: 85 c0 test %eax,%eax
21: 75 13 jne 0x36
23: b8 01 00 00 00 mov $0x1,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 54 ja 0x86
32: c3 retq
33: 0f 1f 00 nopl (%rax)
36: 41 54 push %r12
38: 49 89 d4 mov %rdx,%r12
3b: 55 push %rbp
3c: 48 89 f5 mov %rsi,%rbp
3f: 53 push %rbx
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 54 ja 0x5c
8: c3 retq
9: 0f 1f 00 nopl (%rax)
c: 41 54 push %r12
e: 49 89 d4 mov %rdx,%r12
11: 55 push %rbp
12: 48 89 f5 mov %rsi,%rbp
15: 53 push %rbx
[ 101.569418]
[ 101.572809] RSP: 002b:00007ffcae478b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 101.572811] RAX: ffffffffffffffda RBX: 00007ffcae478b50 RCX: 00007f85fdfd5504
[ 101.576476] cutime=0 cstime=0
[ 101.577602] RDX: 0000000000000001 RSI: 00007ffcae478b50 RDI: 0000000000000004
[ 101.577603] RBP: 0000000000000004 R08: 0000000000000202 R09: 00007ffcae4788d7
[ 101.577604] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[ 101.577605] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 101.578709] LTP: starting copy_file_range01
[ 101.581170]
[ 101.581474] <<<test_end>>>
[ 101.587387] loop0: detected capacity change from 0 to 524288
[ 101.593161]
[ 101.593496] <<<test_start>>>
[ 101.678564]
[ 101.683480] tag=fcntl26 stime=1616295196
[ 101.683481]
[ 101.689244] cmdline="fcntl26"
[ 101.689245]
[ 101.693951] contacts=""
[ 101.693952]
[ 101.698181] analysis=exit
[ 101.698182]
[ 101.702676] <<<test_output>>>
[ 101.702678]
[ 101.708492] fcntl26 1 TPASS : fcntl(tfile_2939, F_SETLEASE, F_WRLCK)
[ 101.708494]
[ 101.711416] /dev/zero: Can't open blockdev
[ 101.715981] <<<execution_status>>>
[ 101.721025]
[ 101.726375] initiation_status="ok"
[ 101.726377]
[ 101.732715] duration=0 termination_type=exited termination_id=0 corefile=no
[ 101.732717]
[ 101.741509] cutime=0 cstime=0
[ 101.741511]
[ 101.746229] <<<test_end>>>
[ 101.746231]
[ 101.750790] <<<test_start>>>
[ 101.750792]
[ 101.755874] tag=fcntl28_64 stime=1616295196
[ 101.755875]
[ 101.761964] cmdline="fcntl28_64"
[ 101.761965]
[ 101.766947] contacts=""
[ 101.766948]
[ 101.771137] analysis=exit
[ 101.771138]
[ 101.775575] <<<test_output>>>
[ 101.775576]
[ 101.781346] fcntl28 1 TPASS : fcntl(fd, F_SETLEASE, F_RDLCK) succeeded
[ 101.781348]
[ 101.790401] <<<execution_status>>>
[ 101.790402]
[ 101.795772] initiation_status="ok"
[ 101.795773]
[ 101.801964] duration=0 termination_type=exited termination_id=0 corefile=no
[ 101.801965]
[ 101.810823] cutime=0 cstime=0
[ 101.810824]
[ 101.815561] <<<test_end>>>
[ 101.815562]
[ 101.820074] <<<test_start>>>
[ 101.820075]
[ 101.825040] tag=fcntl31 stime=1616295196
[ 101.825042]
[ 101.830861] cmdline="fcntl31"
[ 101.830863]
[ 101.835560] contacts=""
[ 101.835561]
[ 101.839799] analysis=exit
[ 101.839800]
[ 101.844243] <<<test_output>>>
[ 101.844244]
[ 101.849900] fcntl31 0 TINFO : default io events signal is SIGIO
[ 101.849902]
[ 101.859476] fcntl31 1 TPASS : fcntl test F_GETOWN, F_SETOWN for process ID success
[ 101.859478]
[ 101.870302] fcntl31 0 TINFO : default io events signal is SIGIO
[ 101.870303]
[ 101.880058] fcntl31 2 TPASS : fcntl test F_GETOWN, F_SETOWN for process group ID success
[ 101.880060]
[ 101.891388] fcntl31 0 TINFO : default io events signal is SIGIO
[ 101.891390]
[ 101.901069] fcntl31 3 TPASS : fcntl test F_GETOWN_EX, F_SETOWN_EX for thread ID success
[ 101.901070]
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On Fri 19-03-21 15:42:05, Mike Kravetz wrote:
> With the introduction of remove_hugetlb_page(), there is no need for
> update_and_free_page to hold the hugetlb lock. Change all callers to
> drop the lock before calling.
>
> With additional code modifications, this will allow loops which decrease
> the huge page pool to drop the hugetlb_lock with each page to reduce
> long hold times.
>
> The ugly unlock/lock cycle in free_pool_huge_page will be removed in
> a subsequent patch which restructures free_pool_huge_page.
>
> Signed-off-by: Mike Kravetz <[email protected]>
Looks good to me. I will not ack it right now though. I am still
crawling through the series and want to get a full picture. So far it
looks promising ;).
> ---
> mm/hugetlb.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ae185d3315e0..3028cf10d504 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1362,14 +1362,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> 1 << PG_writeback);
> }
> if (hstate_is_gigantic(h)) {
> - /*
> - * Temporarily drop the hugetlb_lock, because
> - * we might block in free_gigantic_page().
> - */
> - spin_unlock(&hugetlb_lock);
> destroy_compound_gigantic_page(page, huge_page_order(h));
> free_gigantic_page(page, huge_page_order(h));
> - spin_lock(&hugetlb_lock);
> } else {
> __free_pages(page, huge_page_order(h));
> }
> @@ -1435,16 +1429,18 @@ static void __free_huge_page(struct page *page)
>
> if (HPageTemporary(page)) {
> remove_hugetlb_page(h, page, false);
> + spin_unlock(&hugetlb_lock);
> update_and_free_page(h, page);
> } else if (h->surplus_huge_pages_node[nid]) {
> /* remove the page from active list */
> remove_hugetlb_page(h, page, true);
> + spin_unlock(&hugetlb_lock);
> update_and_free_page(h, page);
> } else {
> arch_clear_hugepage_flags(page);
> enqueue_huge_page(h, page);
> + spin_unlock(&hugetlb_lock);
> }
> - spin_unlock(&hugetlb_lock);
> }
>
> /*
> @@ -1725,7 +1721,13 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> list_entry(h->hugepage_freelists[node].next,
> struct page, lru);
> remove_hugetlb_page(h, page, acct_surplus);
> + /*
> + * unlock/lock around update_and_free_page is temporary
> + * and will be removed with subsequent patch.
> + */
> + spin_unlock(&hugetlb_lock);
> update_and_free_page(h, page);
> + spin_lock(&hugetlb_lock);
> ret = 1;
> break;
> }
> @@ -1794,8 +1796,9 @@ int dissolve_free_huge_page(struct page *page)
> }
> remove_hugetlb_page(h, page, false);
> h->max_huge_pages--;
> + spin_unlock(&hugetlb_lock);
> update_and_free_page(h, head);
> - rc = 0;
> + return 0;
> }
> out:
> spin_unlock(&hugetlb_lock);
> @@ -2572,7 +2575,9 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
> remove_hugetlb_page(h, page, false);
> h->free_huge_pages--;
> h->free_huge_pages_node[page_to_nid(page)]--;
> + spin_unlock(&hugetlb_lock);
> update_and_free_page(h, page);
> + spin_lock(&hugetlb_lock);
>
> /*
> * update_and_free_page could have dropped lock so
> --
> 2.30.2
>
--
Michal Hocko
SUSE Labs
On Fri 19-03-21 15:42:06, Mike Kravetz wrote:
[...]
> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h,
> while (nr_pages--) {
> h->resv_huge_pages--;
> unused_resv_pages--;
> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> + if (!page)
> goto out;
> - cond_resched_lock(&hugetlb_lock);
> +
> + /* Drop lock and free page to buddy as it could sleep */
> + spin_unlock(&hugetlb_lock);
> + update_and_free_page(h, page);
> + cond_resched();
> + spin_lock(&hugetlb_lock);
> }
>
> out:
This is likely a matter of taste but the repeated pattern of unlock,
update_and_free_page, cond_resched and lock seems rather clumsy.
Would it be slightly better/nicer to remove_pool_huge_page into a
list_head under a single lock invocation and then free up the whole lot
after the lock is dropped?
--
Michal Hocko
SUSE Labs
On Fri 19-03-21 15:42:07, Mike Kravetz wrote:
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in
> non-task context") was added to address the issue of free_huge_page
> being called from irq context. That commit hands off free_huge_page
> processing to a workqueue if !in_task. However, as seen in [1] this
> does not cover all cases. Instead, make the locks taken in the
> free_huge_page irq safe.
>
> This patch does the following:
> - Make hugetlb_lock irq safe. This is mostly a simple process of
> changing spin_*lock calls to spin_*lock_irq* calls.
> - Make subpool lock irq safe in a similar manner.
> - Revert the !in_task check and workqueue handoff.
This is not sufficient (and 0day bot has captured that already). You
cannot call update_and_free_page from the same context.
--
Michal Hocko
SUSE Labs
On Fri 19-03-21 15:42:08, Mike Kravetz wrote:
> The locks acquired in free_huge_page are irq safe. However, in certain
> circumstances the routine update_and_free_page could sleep. Since
> free_huge_page can be called from any context, it can not sleep.
>
> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> new routine update_and_free_page_no_sleep provides this functionality
> and is only called from free_huge_page.
>
> Note that any 'pages' sent to the workqueue for deferred freeing have
> already been removed from the hugetlb subsystem. What is actually
> deferred is returning those base pages to the low level allocator.
This patch or its alternative would need to be applied prior to patch 6
which makes the whole context IRQ safe.
Besides that the changelog doesn't really say anything about changed
user visible behavior change. Now if somebody decreases the GB huge pool
from the userspace the real effect on the freed up memory will be
postponed to some later time. That "later" is unpredictable as it
depends on WQ utilization. We definitely need some sort of
wait_for_inflight pages. One way to do that would be to have a dedicated
WQ and schedule a sync work item after the pool has been shrunk and wait
for that item.
--
Michal Hocko
SUSE Labs
On Mon 22-03-21 15:42:27, Michal Hocko wrote:
[...]
> Besides that the changelog doesn't really say anything about changed
> user visible behavior change. Now if somebody decreases the GB huge pool
> from the userspace the real effect on the freed up memory will be
> postponed to some later time. That "later" is unpredictable as it
> depends on WQ utilization. We definitely need some sort of
> wait_for_inflight pages. One way to do that would be to have a dedicated
> WQ and schedule a sync work item after the pool has been shrunk and wait
> for that item.
Scratch that. It is not really clear from the patch context but after
looking at the resulting code set_max_huge_pages will use the blockable
update_and_free_page so we should be fine.
Sorry about the noise!
--
Michal Hocko
SUSE Labs
Cc: Roman, Christoph
On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>> The locks acquired in free_huge_page are irq safe. However, in certain
>> circumstances the routine update_and_free_page could sleep. Since
>> free_huge_page can be called from any context, it can not sleep.
>>
>> Use a waitqueue to defer freeing of pages if the operation may sleep. A
>> new routine update_and_free_page_no_sleep provides this functionality
>> and is only called from free_huge_page.
>>
>> Note that any 'pages' sent to the workqueue for deferred freeing have
>> already been removed from the hugetlb subsystem. What is actually
>> deferred is returning those base pages to the low level allocator.
>
> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> should be in cma_release().
My thinking (which could be totally wrong) is that cma_release makes no
claims about calling context. From the code, it is pretty clear that it
can only be called from task context with no locks held. Although,
there could be code incorrectly calling it today hugetlb does. Since
hugetlb is the only code with this new requirement, it should do the
work.
Wait!!! That made me remember something.
Roman had code to create a non-blocking version of cma_release().
https://lore.kernel.org/linux-mm/[email protected]/
There were no objections, and Christoph even thought there may be
problems with callers of dma_free_contiguous.
Perhaps, we should just move forward with Roman's patches to create
cma_release_nowait() and avoid this workqueue stuff?
--
Mike Kravetz
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> Cc: Roman, Christoph
>
> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >> The locks acquired in free_huge_page are irq safe. However, in certain
> >> circumstances the routine update_and_free_page could sleep. Since
> >> free_huge_page can be called from any context, it can not sleep.
> >>
> >> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> >> new routine update_and_free_page_no_sleep provides this functionality
> >> and is only called from free_huge_page.
> >>
> >> Note that any 'pages' sent to the workqueue for deferred freeing have
> >> already been removed from the hugetlb subsystem. What is actually
> >> deferred is returning those base pages to the low level allocator.
> >
> > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> > should be in cma_release().
>
> My thinking (which could be totally wrong) is that cma_release makes no
> claims about calling context. From the code, it is pretty clear that it
> can only be called from task context with no locks held. Although,
> there could be code incorrectly calling it today hugetlb does. Since
> hugetlb is the only code with this new requirement, it should do the
> work.
>
> Wait!!! That made me remember something.
> Roman had code to create a non-blocking version of cma_release().
> https://lore.kernel.org/linux-mm/[email protected]/
>
> There were no objections, and Christoph even thought there may be
> problems with callers of dma_free_contiguous.
>
> Perhaps, we should just move forward with Roman's patches to create
> cma_release_nowait() and avoid this workqueue stuff?
Sounds good to me. If it's the preferred path, I can rebase and resend
those patches (they been carried for some time by Zi Yan for his 1GB THP work,
but they are completely independent).
Thanks!
> --
> Mike Kravetz
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> Cc: Roman, Christoph
>
> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >> The locks acquired in free_huge_page are irq safe. However, in certain
> >> circumstances the routine update_and_free_page could sleep. Since
> >> free_huge_page can be called from any context, it can not sleep.
> >>
> >> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> >> new routine update_and_free_page_no_sleep provides this functionality
> >> and is only called from free_huge_page.
> >>
> >> Note that any 'pages' sent to the workqueue for deferred freeing have
> >> already been removed from the hugetlb subsystem. What is actually
> >> deferred is returning those base pages to the low level allocator.
> >
> > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> > should be in cma_release().
>
> My thinking (which could be totally wrong) is that cma_release makes no
> claims about calling context. From the code, it is pretty clear that it
> can only be called from task context with no locks held. Although,
> there could be code incorrectly calling it today hugetlb does. Since
> hugetlb is the only code with this new requirement, it should do the
> work.
>
> Wait!!! That made me remember something.
> Roman had code to create a non-blocking version of cma_release().
> https://lore.kernel.org/linux-mm/[email protected]/
>
> There were no objections, and Christoph even thought there may be
> problems with callers of dma_free_contiguous.
>
> Perhaps, we should just move forward with Roman's patches to create
> cma_release_nowait() and avoid this workqueue stuff?
Ha!, that basically does as I suggested. Using that page is unfortunate
in that it will destroy the contig range for allocations until the work
happens, but I'm not sure I see a nice alternative.
On 3/22/21 7:31 AM, Michal Hocko wrote:
> On Fri 19-03-21 15:42:06, Mike Kravetz wrote:
> [...]
>> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h,
>> while (nr_pages--) {
>> h->resv_huge_pages--;
>> unused_resv_pages--;
>> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
>> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
>> + if (!page)
>> goto out;
>> - cond_resched_lock(&hugetlb_lock);
>> +
>> + /* Drop lock and free page to buddy as it could sleep */
>> + spin_unlock(&hugetlb_lock);
>> + update_and_free_page(h, page);
>> + cond_resched();
>> + spin_lock(&hugetlb_lock);
>> }
>>
>> out:
>
> This is likely a matter of taste but the repeated pattern of unlock,
> update_and_free_page, cond_resched and lock seems rather clumsy.
> Would it be slightly better/nicer to remove_pool_huge_page into a
> list_head under a single lock invocation and then free up the whole lot
> after the lock is dropped?
Yes, we can certainly do that.
One downside I see is that the list can contain a bunch of pages not
accounted for in hugetlb and not free in buddy (or cma). Ideally, we
would want to keep those in sync if possible. Also, the commit that
added the cond_resched talked about freeing up 12 TB worth of huge pages
and it holding the lock for 150 seconds. The new code is not holding
the lock while calling free to buddy, but I wonder how long it would
take to remove 12 TB worth of huge pages and add them to a separate list?
I do not know how realistic the 12 TB number is. But, I certainly am
aware of pools that are a few TB in size.
--
Mike Kravetz
On Mon 22-03-21 16:28:07, Mike Kravetz wrote:
> On 3/22/21 7:31 AM, Michal Hocko wrote:
> > On Fri 19-03-21 15:42:06, Mike Kravetz wrote:
> > [...]
> >> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h,
> >> while (nr_pages--) {
> >> h->resv_huge_pages--;
> >> unused_resv_pages--;
> >> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
> >> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
> >> + if (!page)
> >> goto out;
> >> - cond_resched_lock(&hugetlb_lock);
> >> +
> >> + /* Drop lock and free page to buddy as it could sleep */
> >> + spin_unlock(&hugetlb_lock);
> >> + update_and_free_page(h, page);
> >> + cond_resched();
> >> + spin_lock(&hugetlb_lock);
> >> }
> >>
> >> out:
> >
> > This is likely a matter of taste but the repeated pattern of unlock,
> > update_and_free_page, cond_resched and lock seems rather clumsy.
> > Would it be slightly better/nicer to remove_pool_huge_page into a
> > list_head under a single lock invocation and then free up the whole lot
> > after the lock is dropped?
>
> Yes, we can certainly do that.
> One downside I see is that the list can contain a bunch of pages not
> accounted for in hugetlb and not free in buddy (or cma). Ideally, we
> would want to keep those in sync if possible. Also, the commit that
> added the cond_resched talked about freeing up 12 TB worth of huge pages
> and it holding the lock for 150 seconds. The new code is not holding
> the lock while calling free to buddy, but I wonder how long it would
> take to remove 12 TB worth of huge pages and add them to a separate list?
Well, the remove_pool_huge_page is just a accounting part and that
should be pretty invisible even when the number of pages is large. The
lockless nature (from hugetlb POV) of the final page release is the
heavy weight operation and whether you do it in chunks or in a single go
(with cond_resched) should be visible either. We already do the same
thing when uncharging memcg pages (mem_cgroup_uncharge_list).
So I would agree with you that this would be a much bigger problem if
both the hugetlb and freeing path were equally heavy weight and the
delay between first pages uncaccounted and freed would be noticeable.
But I do not want to push for this. I just hated the hugetlb_lock dances
as this is ugly and repetitive pattern.
--
Michal Hocko
SUSE Labs
On 3/22/21 11:10 AM, Roman Gushchin wrote:
> On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
>> Cc: Roman, Christoph
>>
>> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
>>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>>>> The locks acquired in free_huge_page are irq safe. However, in certain
>>>> circumstances the routine update_and_free_page could sleep. Since
>>>> free_huge_page can be called from any context, it can not sleep.
>>>>
>>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A
>>>> new routine update_and_free_page_no_sleep provides this functionality
>>>> and is only called from free_huge_page.
>>>>
>>>> Note that any 'pages' sent to the workqueue for deferred freeing have
>>>> already been removed from the hugetlb subsystem. What is actually
>>>> deferred is returning those base pages to the low level allocator.
>>>
>>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
>>> should be in cma_release().
>>
>> My thinking (which could be totally wrong) is that cma_release makes no
>> claims about calling context. From the code, it is pretty clear that it
>> can only be called from task context with no locks held. Although,
>> there could be code incorrectly calling it today hugetlb does. Since
>> hugetlb is the only code with this new requirement, it should do the
>> work.
>>
>> Wait!!! That made me remember something.
>> Roman had code to create a non-blocking version of cma_release().
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> There were no objections, and Christoph even thought there may be
>> problems with callers of dma_free_contiguous.
>>
>> Perhaps, we should just move forward with Roman's patches to create
>> cma_release_nowait() and avoid this workqueue stuff?
>
> Sounds good to me. If it's the preferred path, I can rebase and resend
> those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> but they are completely independent).
Thanks Roman,
Yes, this is the preferred path. If there is a non blocking version of
cma_release, then it makes fixup of hugetlb put_page path much easier.
If you would prefer, I can rebase your patches and send with this series.
--
Mike Kravetz
On Tue, Mar 23, 2021 at 11:51:04AM -0700, Mike Kravetz wrote:
> On 3/22/21 11:10 AM, Roman Gushchin wrote:
> > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> >> Cc: Roman, Christoph
> >>
> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >>>> The locks acquired in free_huge_page are irq safe. However, in certain
> >>>> circumstances the routine update_and_free_page could sleep. Since
> >>>> free_huge_page can be called from any context, it can not sleep.
> >>>>
> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> >>>> new routine update_and_free_page_no_sleep provides this functionality
> >>>> and is only called from free_huge_page.
> >>>>
> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have
> >>>> already been removed from the hugetlb subsystem. What is actually
> >>>> deferred is returning those base pages to the low level allocator.
> >>>
> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> >>> should be in cma_release().
> >>
> >> My thinking (which could be totally wrong) is that cma_release makes no
> >> claims about calling context. From the code, it is pretty clear that it
> >> can only be called from task context with no locks held. Although,
> >> there could be code incorrectly calling it today hugetlb does. Since
> >> hugetlb is the only code with this new requirement, it should do the
> >> work.
> >>
> >> Wait!!! That made me remember something.
> >> Roman had code to create a non-blocking version of cma_release().
> >> https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >> There were no objections, and Christoph even thought there may be
> >> problems with callers of dma_free_contiguous.
> >>
> >> Perhaps, we should just move forward with Roman's patches to create
> >> cma_release_nowait() and avoid this workqueue stuff?
> >
> > Sounds good to me. If it's the preferred path, I can rebase and resend
> > those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> > but they are completely independent).
>
> Thanks Roman,
>
> Yes, this is the preferred path. If there is a non blocking version of
> cma_release, then it makes fixup of hugetlb put_page path much easier.
>
> If you would prefer, I can rebase your patches and send with this series.
Sounds good! Please, proceed. And, please, let me know if I can help.
Thanks!
On 3/23/21 12:57 AM, Michal Hocko wrote:
> On Mon 22-03-21 16:28:07, Mike Kravetz wrote:
>> On 3/22/21 7:31 AM, Michal Hocko wrote:
>>> On Fri 19-03-21 15:42:06, Mike Kravetz wrote:
>>> [...]
>>>> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate *h,
>>>> while (nr_pages--) {
>>>> h->resv_huge_pages--;
>>>> unused_resv_pages--;
>>>> - if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1))
>>>> + page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
>>>> + if (!page)
>>>> goto out;
>>>> - cond_resched_lock(&hugetlb_lock);
>>>> +
>>>> + /* Drop lock and free page to buddy as it could sleep */
>>>> + spin_unlock(&hugetlb_lock);
>>>> + update_and_free_page(h, page);
>>>> + cond_resched();
>>>> + spin_lock(&hugetlb_lock);
>>>> }
>>>>
>>>> out:
>>>
>>> This is likely a matter of taste but the repeated pattern of unlock,
>>> update_and_free_page, cond_resched and lock seems rather clumsy.
>>> Would it be slightly better/nicer to remove_pool_huge_page into a
>>> list_head under a single lock invocation and then free up the whole lot
>>> after the lock is dropped?
>>
>> Yes, we can certainly do that.
>> One downside I see is that the list can contain a bunch of pages not
>> accounted for in hugetlb and not free in buddy (or cma). Ideally, we
>> would want to keep those in sync if possible. Also, the commit that
>> added the cond_resched talked about freeing up 12 TB worth of huge pages
>> and it holding the lock for 150 seconds. The new code is not holding
>> the lock while calling free to buddy, but I wonder how long it would
>> take to remove 12 TB worth of huge pages and add them to a separate list?
>
> Well, the remove_pool_huge_page is just a accounting part and that
> should be pretty invisible even when the number of pages is large. The
> lockless nature (from hugetlb POV) of the final page release is the
> heavy weight operation and whether you do it in chunks or in a single go
> (with cond_resched) should be visible either. We already do the same
> thing when uncharging memcg pages (mem_cgroup_uncharge_list).
>
> So I would agree with you that this would be a much bigger problem if
> both the hugetlb and freeing path were equally heavy weight and the
> delay between first pages uncaccounted and freed would be noticeable.
>
> But I do not want to push for this. I just hated the hugetlb_lock dances
> as this is ugly and repetitive pattern.
As you may have seen in my reply to patch 3, I am going to use this
batching approach for all places we do remove/free hugetlb page.
Since you brought up cgroups ... what is your opinion on lock hold time
in hugetlb_cgroup_css_offline? We could potentially be calling
hugetlb_cgroup_move_parent for every hugetlb page while holding the lock
with interrupts disabled.
--
Mike Kravetz
On 3/24/21 1:40 AM, Michal Hocko wrote:
> On Tue 23-03-21 18:03:07, Mike Kravetz wrote:
> [...]
>> Since you brought up cgroups ... what is your opinion on lock hold time
>> in hugetlb_cgroup_css_offline? We could potentially be calling
>> hugetlb_cgroup_move_parent for every hugetlb page while holding the lock
>> with interrupts disabled.
>
> I am not familiar with hugetlb cgroup code TBH. But from a quick look
> there is not much of heavy lifting there. If we find out that this is
> really visible we can do the lock dance with cond_resched and retry with
> the iteration again. Or is there any strong reason to process the list
> in a single go?
AFAICT, the primary reason for processing the list in a single go is
that the lock protects the list. If you drop the lock, the list can
change ...
I have come up with a (not so pretty) way of processing the list in
batches of pages. But, I dod not want to introduce that if there is no
need. Perhaps just take a wait and see approach for now.
I'll see if I can come up with some timing information to determine
if/when we may have an issue.
--
Mike Kravetz
On Wed 24-03-21 09:38:17, Mike Kravetz wrote:
> On 3/24/21 1:40 AM, Michal Hocko wrote:
> > On Tue 23-03-21 18:03:07, Mike Kravetz wrote:
> > [...]
> >> Since you brought up cgroups ... what is your opinion on lock hold time
> >> in hugetlb_cgroup_css_offline? We could potentially be calling
> >> hugetlb_cgroup_move_parent for every hugetlb page while holding the lock
> >> with interrupts disabled.
> >
> > I am not familiar with hugetlb cgroup code TBH. But from a quick look
> > there is not much of heavy lifting there. If we find out that this is
> > really visible we can do the lock dance with cond_resched and retry with
> > the iteration again. Or is there any strong reason to process the list
> > in a single go?
>
> AFAICT, the primary reason for processing the list in a single go is
> that the lock protects the list. If you drop the lock, the list can
> change ...
>
> I have come up with a (not so pretty) way of processing the list in
> batches of pages. But, I dod not want to introduce that if there is no
> need. Perhaps just take a wait and see approach for now.
>
> I'll see if I can come up with some timing information to determine
> if/when we may have an issue.
I wouldn't bother at this stage. This can be done on top.
--
Michal Hocko
SUSE Labs
On 3/24/21 1:43 AM, Michal Hocko wrote:
> On Tue 23-03-21 11:51:04, Mike Kravetz wrote:
>> On 3/22/21 11:10 AM, Roman Gushchin wrote:
>>> On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
>>>> Cc: Roman, Christoph
>>>>
>>>> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
>>>>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
>>>>>> The locks acquired in free_huge_page are irq safe. However, in certain
>>>>>> circumstances the routine update_and_free_page could sleep. Since
>>>>>> free_huge_page can be called from any context, it can not sleep.
>>>>>>
>>>>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A
>>>>>> new routine update_and_free_page_no_sleep provides this functionality
>>>>>> and is only called from free_huge_page.
>>>>>>
>>>>>> Note that any 'pages' sent to the workqueue for deferred freeing have
>>>>>> already been removed from the hugetlb subsystem. What is actually
>>>>>> deferred is returning those base pages to the low level allocator.
>>>>>
>>>>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
>>>>> should be in cma_release().
>>>>
>>>> My thinking (which could be totally wrong) is that cma_release makes no
>>>> claims about calling context. From the code, it is pretty clear that it
>>>> can only be called from task context with no locks held. Although,
>>>> there could be code incorrectly calling it today hugetlb does. Since
>>>> hugetlb is the only code with this new requirement, it should do the
>>>> work.
>>>>
>>>> Wait!!! That made me remember something.
>>>> Roman had code to create a non-blocking version of cma_release().
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> There were no objections, and Christoph even thought there may be
>>>> problems with callers of dma_free_contiguous.
>>>>
>>>> Perhaps, we should just move forward with Roman's patches to create
>>>> cma_release_nowait() and avoid this workqueue stuff?
>>>
>>> Sounds good to me. If it's the preferred path, I can rebase and resend
>>> those patches (they been carried for some time by Zi Yan for his 1GB THP work,
>>> but they are completely independent).
>>
>> Thanks Roman,
>>
>> Yes, this is the preferred path. If there is a non blocking version of
>> cma_release, then it makes fixup of hugetlb put_page path much easier.
>
> I do not object to the plan I just want to point out that the sparse
> vmemmap for hugetlb pages will need to recognize sleep/nosleep variants
> of the freeing path as well to handle its vmemmap repopulate games.
>
Yes,
I also commented elsewhere that we will likely want to do the
drop/reacquire lock for each page in the looping page free routines when
adding the vmemmap freeing support.
Unless someone thinks otherwise, I still think it is better to first fix
the hugetlb put_page/free_huge_page path with this series. Then move on
to the free vmemmap series.
--
Mike Kravetz
On Tue 23-03-21 18:03:07, Mike Kravetz wrote:
[...]
> Since you brought up cgroups ... what is your opinion on lock hold time
> in hugetlb_cgroup_css_offline? We could potentially be calling
> hugetlb_cgroup_move_parent for every hugetlb page while holding the lock
> with interrupts disabled.
I am not familiar with hugetlb cgroup code TBH. But from a quick look
there is not much of heavy lifting there. If we find out that this is
really visible we can do the lock dance with cond_resched and retry with
the iteration again. Or is there any strong reason to process the list
in a single go?
--
Michal Hocko
SUSE Labs
On Tue 23-03-21 11:51:04, Mike Kravetz wrote:
> On 3/22/21 11:10 AM, Roman Gushchin wrote:
> > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> >> Cc: Roman, Christoph
> >>
> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >>>> The locks acquired in free_huge_page are irq safe. However, in certain
> >>>> circumstances the routine update_and_free_page could sleep. Since
> >>>> free_huge_page can be called from any context, it can not sleep.
> >>>>
> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep. A
> >>>> new routine update_and_free_page_no_sleep provides this functionality
> >>>> and is only called from free_huge_page.
> >>>>
> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have
> >>>> already been removed from the hugetlb subsystem. What is actually
> >>>> deferred is returning those base pages to the low level allocator.
> >>>
> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> >>> should be in cma_release().
> >>
> >> My thinking (which could be totally wrong) is that cma_release makes no
> >> claims about calling context. From the code, it is pretty clear that it
> >> can only be called from task context with no locks held. Although,
> >> there could be code incorrectly calling it today hugetlb does. Since
> >> hugetlb is the only code with this new requirement, it should do the
> >> work.
> >>
> >> Wait!!! That made me remember something.
> >> Roman had code to create a non-blocking version of cma_release().
> >> https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >> There were no objections, and Christoph even thought there may be
> >> problems with callers of dma_free_contiguous.
> >>
> >> Perhaps, we should just move forward with Roman's patches to create
> >> cma_release_nowait() and avoid this workqueue stuff?
> >
> > Sounds good to me. If it's the preferred path, I can rebase and resend
> > those patches (they been carried for some time by Zi Yan for his 1GB THP work,
> > but they are completely independent).
>
> Thanks Roman,
>
> Yes, this is the preferred path. If there is a non blocking version of
> cma_release, then it makes fixup of hugetlb put_page path much easier.
I do not object to the plan I just want to point out that the sparse
vmemmap for hugetlb pages will need to recognize sleep/nosleep variants
of the freeing path as well to handle its vmemmap repopulate games.
--
Michal Hocko
SUSE Labs
On 3/19/21 6:18 PM, Hillf Danton wrote:
> On Fri, 19 Mar 2021 15:42:08 -0700 Mike Kravetz wrote:
>> +
>> + if (!can_sleep && free_page_may_sleep(h, page)) {
>> + /*
>> + * Send page freeing to workqueue
>> + *
>> + * Only call schedule_work() if hpage_freelist is previously
>> + * empty. Otherwise, schedule_work() had been called but the
>> + * workfn hasn't retrieved the list yet.
>> + */
>> + if (llist_add((struct llist_node *)&page->mapping,
>> + &hpage_freelist))
>> + schedule_work(&free_hpage_work);
>> + return;
>> + }
>
> Queue work on system_unbound_wq instead of system_wq because of blocking work.
>
Thanks Hillf,
I am dropping this patch and going with Roman's patches to create an
version of cma_release that will not sleep. A workqueue handoff like
this may be needed in the vmemmap reduction series, so will keep this in
mind.
--
Mike Kravetz