When memory is hotplug added or removed the min_free_kbytes should be
recalculated based on what is expected by khugepaged. Currently
after hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost. This change restores min_free_kbytes
as expected for THP consumers.
Fixes: f000565adb77 ("thp: set recommended min free kbytes")
Signed-off-by: Vijay Balakrishna <[email protected]>
Cc: [email protected]
Reviewed-by: Pavel Tatashin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
v4 -> v5
- changelog: must -> should [Michal Hocko]
v3 -> v4
- made changes to move khugepaged_min_free_kbytes_update into
init_per_zone_wmark_min and rested changes
[suggestion from Michal Hocko]
[v2 1/2]
- removed symptoms references from changelog
[v2 2/2]
- addressed following issues Michal Hocko raised:
. nr_free_buffer_pages can oveflow in int on very large machines
. min_free_kbytes can decrease the size theoretically
v1 -> v2
--------
- addressed issue Kirill A. Shutemov raised:
. changes would override min_free_kbytes set by user
include/linux/khugepaged.h | 5 +++++
mm/khugepaged.c | 13 +++++++++++--
mm/page_alloc.c | 3 +++
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index bc45ea1efbf7..c941b7377321 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
unsigned long vm_flags);
+extern void khugepaged_min_free_kbytes_update(void);
#ifdef CONFIG_SHMEM
extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
#else
@@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
unsigned long addr)
{
}
+
+static inline void khugepaged_min_free_kbytes_update(void)
+{
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cfa0dba5fd3b..4f7107476a6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -56,6 +56,9 @@ enum scan_result {
#define CREATE_TRACE_POINTS
#include <trace/events/huge_memory.h>
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+
/* default scan 8*512 pte (or vmas) every 30 second */
static unsigned int khugepaged_pages_to_scan __read_mostly;
static unsigned int khugepaged_pages_collapsed;
@@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
int start_stop_khugepaged(void)
{
- static struct task_struct *khugepaged_thread __read_mostly;
- static DEFINE_MUTEX(khugepaged_mutex);
int err = 0;
mutex_lock(&khugepaged_mutex);
@@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
mutex_unlock(&khugepaged_mutex);
return err;
}
+
+void khugepaged_min_free_kbytes_update(void)
+{
+ mutex_lock(&khugepaged_mutex);
+ if (khugepaged_enabled() && khugepaged_thread)
+ set_recommended_min_free_kbytes();
+ mutex_unlock(&khugepaged_mutex);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fab5e97dc9ca..ac25d3526fa5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -69,6 +69,7 @@
#include <linux/nmi.h>
#include <linux/psi.h>
#include <linux/padata.h>
+#include <linux/khugepaged.h>
#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -7891,6 +7892,8 @@ int __meminit init_per_zone_wmark_min(void)
setup_min_slab_ratio();
#endif
+ khugepaged_min_free_kbytes_update();
+
return 0;
}
postcore_initcall(init_per_zone_wmark_min)
--
2.28.0
On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes should be
> recalculated based on what is expected by khugepaged. Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This change restores min_free_kbytes
> as expected for THP consumers.
>
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <[email protected]>
> Cc: [email protected]
> Reviewed-by: Pavel Tatashin <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
Sorry for jumping in so late. Should we use this as an opportunity to
also fix up the messages logged when (re)calculating mfk? They are wrong
and could be quite confusing. For example consider the following sequence
of operations and corresponding log messages produced.
Freshly booted VM with 2 nodes and 8GB memory:
# cat /proc/sys/vm/min_free_kbytes
90112
# echo 90000 > /proc/sys/vm/min_free_kbytes
# cat /proc/sys/vm/min_free_kbytes
90000
# echo 0 > /sys/devices/system/node/node1/memory56/online
[ 135.099947] Offlined Pages 32768
[ 135.102362] min_free_kbytes is not updated to 11241 because user defined value 90000 is preferred
[ 135.109070] khugepaged: raising min_free_kbytes from 90000 to 90112 to help t
ransparent hugepage allocations
# cat /proc/sys/vm/min_free_kbytes
90112
# echo 1 > /sys/devices/system/node/node1/memory56/online
[ 231.656075] min_free_kbytes is not updated to 11334 because user defined value 90000 is preferred
# cat /proc/sys/vm/min_free_kbytes
90112
--
Mike Kravetz
On 9/30/2020 11:20 AM, Mike Kravetz wrote:
> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
>> When memory is hotplug added or removed the min_free_kbytes should be
>> recalculated based on what is expected by khugepaged. Currently
>> after hotplug, min_free_kbytes will be set to a lower default and higher
>> default set when THP enabled is lost. This change restores min_free_kbytes
>> as expected for THP consumers.
>>
>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>
>> Signed-off-by: Vijay Balakrishna <[email protected]>
>> Cc: [email protected]
>> Reviewed-by: Pavel Tatashin <[email protected]>
>> Acked-by: Michal Hocko <[email protected]>
>
> Sorry for jumping in so late. Should we use this as an opportunity to
> also fix up the messages logged when (re)calculating mfk? They are wrong
> and could be quite confusing.
Sure. Please share your thoughts regarding appropriate message. Here
is what I'm thinking
pr_warn("min_free_kbytes is not updated to %d because current value %d
is preferred\n", new_min_free_kbytes, min_free_kbytes);
If above message is reasonable I can post a new revision (v6).
Thanks,
Vijay
> For example consider the following sequence
> of operations and corresponding log messages produced.
>
> Freshly booted VM with 2 nodes and 8GB memory:
> # cat /proc/sys/vm/min_free_kbytes
> 90112
> # echo 90000 > /proc/sys/vm/min_free_kbytes
> # cat /proc/sys/vm/min_free_kbytes
> 90000
> # echo 0 > /sys/devices/system/node/node1/memory56/online
> [ 135.099947] Offlined Pages 32768
> [ 135.102362] min_free_kbytes is not updated to 11241 because user defined value 90000 is preferred
> [ 135.109070] khugepaged: raising min_free_kbytes from 90000 to 90112 to help t
> ransparent hugepage allocations
> # cat /proc/sys/vm/min_free_kbytes
> 90112
> # echo 1 > /sys/devices/system/node/node1/memory56/online
> [ 231.656075] min_free_kbytes is not updated to 11334 because user defined value 90000 is preferred
> # cat /proc/sys/vm/min_free_kbytes
> 90112
>
On 9/30/20 1:47 PM, Vijay Balakrishna wrote:
> On 9/30/2020 11:20 AM, Mike Kravetz wrote:
>> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
>>
>> Sorry for jumping in so late. Should we use this as an opportunity to
>> also fix up the messages logged when (re)calculating mfk? They are wrong
>> and could be quite confusing.
>
>
> Sure. Please share your thoughts regarding appropriate message. Here is what I'm thinking
>
> pr_warn("min_free_kbytes is not updated to %d because current value %d is preferred\n", new_min_free_kbytes, min_free_kbytes);
>
> If above message is reasonable I can post a new revision (v6).
Just considering the below example,
>> For example consider the following sequence
>> of operations and corresponding log messages produced.
>>
>> Freshly booted VM with 2 nodes and 8GB memory:
>> # cat /proc/sys/vm/min_free_kbytes
>> 90112
>> # echo 90000 > /proc/sys/vm/min_free_kbytes
>> # cat /proc/sys/vm/min_free_kbytes
>> 90000
>> # echo 0 > /sys/devices/system/node/node1/memory56/online
>> [ 135.099947] Offlined Pages 32768
>> [ 135.102362] min_free_kbytes is not updated to 11241 because user defined value 90000 is preferred
I am not sure if there is any value in printing the above line. Especially
in this context as it becomes obsolete with the printing of the next line.
>> [ 135.109070] khugepaged: raising min_free_kbytes from 90000 to 90112 to help t
>> ransparent hugepage allocations
IMO, the above line is the only one that should be output as a result of the
recalculation.
I guess that brings up the question of 'should we continue to track the user
defined value if we overwrite it?". If we quit tracking it may help with the
next message.
>> # cat /proc/sys/vm/min_free_kbytes
>> 90112
>> # echo 1 > /sys/devices/system/node/node1/memory56/online
>> [ 231.656075] min_free_kbytes is not updated to 11334 because user defined value 90000 is preferred
I do not see any value in this line of output. Neither value (11334 or 90000)
is actually of use. We did not recalculate/change mfk. Perhaps no output is
necessary in this case?
>> # cat /proc/sys/vm/min_free_kbytes
>> 90112
All this may be out of scope for this patch and done with a update. However,
I think it is something that should be considered.
--
Mike Kravetz
On Wed 30-09-20 15:03:11, Mike Kravetz wrote:
> On 9/30/20 1:47 PM, Vijay Balakrishna wrote:
> > On 9/30/2020 11:20 AM, Mike Kravetz wrote:
> >> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
> >>
> >> Sorry for jumping in so late. Should we use this as an opportunity to
> >> also fix up the messages logged when (re)calculating mfk? They are wrong
> >> and could be quite confusing.
> >
> >
> > Sure. Please share your thoughts regarding appropriate message. Here is what I'm thinking
> >
> > pr_warn("min_free_kbytes is not updated to %d because current value %d is preferred\n", new_min_free_kbytes, min_free_kbytes);
> >
> > If above message is reasonable I can post a new revision (v6).
>
> Just considering the below example,
>
> >> For example consider the following sequence
> >> of operations and corresponding log messages produced.
> >>
> >> Freshly booted VM with 2 nodes and 8GB memory:
> >> # cat /proc/sys/vm/min_free_kbytes
> >> 90112
> >> # echo 90000 > /proc/sys/vm/min_free_kbytes
> >> # cat /proc/sys/vm/min_free_kbytes
> >> 90000
> >> # echo 0 > /sys/devices/system/node/node1/memory56/online
> >> [ 135.099947] Offlined Pages 32768
> >> [ 135.102362] min_free_kbytes is not updated to 11241 because user defined value 90000 is preferred
>
> I am not sure if there is any value in printing the above line. Especially
> in this context as it becomes obsolete with the printing of the next line.
The original intention was to make it explicit that auto-tuning is
influenced by the user provided configuration.
> >> [ 135.109070] khugepaged: raising min_free_kbytes from 90000 to 90112 to help t
> >> ransparent hugepage allocations
>
> IMO, the above line is the only one that should be output as a result of the
> recalculation.
Well, but khugepaged could be disabled and then the above might not get
printed. Sure the code could get reorganized and all that but is this
really worth that?
> I guess that brings up the question of 'should we continue to track the user
> defined value if we overwrite it?". If we quit tracking it may help with the
> next message.
Auto tuning and user provided override is quite tricky to get sensible.
Especially in the case here. Admin has provided an override but has the
potential memory hotplug been considered? Or to make it even more
complicated, consider that the hotplug happens without admin involvement
- e.g. memory gets hotremoved due to HW problems. Is the admin provided
value still meaningful? To be honest I do not have a good answer and I
am not sure we should care all that much until we see practical
problems.
--
Michal Hocko
SUSE Labs
On 10/2/20 4:25 AM, Michal Hocko wrote:
> On Wed 30-09-20 15:03:11, Mike Kravetz wrote:
>> On 9/30/20 1:47 PM, Vijay Balakrishna wrote:
>>> On 9/30/2020 11:20 AM, Mike Kravetz wrote:
>>>> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
>>>>
>>>> Sorry for jumping in so late. Should we use this as an opportunity to
>>>> also fix up the messages logged when (re)calculating mfk? They are wrong
>>>> and could be quite confusing.
>>>
>>>
>>> Sure. Please share your thoughts regarding appropriate message. Here is what I'm thinking
>>>
>>> pr_warn("min_free_kbytes is not updated to %d because current value %d is preferred\n", new_min_free_kbytes, min_free_kbytes);
>>>
>>> If above message is reasonable I can post a new revision (v6).
>>
>> Just considering the below example,
>>
>>>> For example consider the following sequence
>>>> of operations and corresponding log messages produced.
>>>>
>>>> Freshly booted VM with 2 nodes and 8GB memory:
>>>> # cat /proc/sys/vm/min_free_kbytes
>>>> 90112
>>>> # echo 90000 > /proc/sys/vm/min_free_kbytes
>>>> # cat /proc/sys/vm/min_free_kbytes
>>>> 90000
>>>> # echo 0 > /sys/devices/system/node/node1/memory56/online
>>>> [ 135.099947] Offlined Pages 32768
>>>> [ 135.102362] min_free_kbytes is not updated to 11241 because user defined value 90000 is preferred
>>
>> I am not sure if there is any value in printing the above line. Especially
>> in this context as it becomes obsolete with the printing of the next line.
>
> The original intention was to make it explicit that auto-tuning is
> influenced by the user provided configuration.
>
>>>> [ 135.109070] khugepaged: raising min_free_kbytes from 90000 to 90112 to help t
>>>> ransparent hugepage allocations
>>
>> IMO, the above line is the only one that should be output as a result of the
>> recalculation.
>
> Well, but khugepaged could be disabled and then the above might not get
> printed. Sure the code could get reorganized and all that but is this
> really worth that?
>
>> I guess that brings up the question of 'should we continue to track the user
>> defined value if we overwrite it?". If we quit tracking it may help with the
>> next message.
>
> Auto tuning and user provided override is quite tricky to get sensible.
> Especially in the case here. Admin has provided an override but has the
> potential memory hotplug been considered? Or to make it even more
> complicated, consider that the hotplug happens without admin involvement
> - e.g. memory gets hotremoved due to HW problems. Is the admin provided
> value still meaningful? To be honest I do not have a good answer and I
> am not sure we should care all that much until we see practical
> problems.
I am not insisting that this be cleaned up. The change in this patch to
ensure THP related calculations are performed during hotplug is the most
important.
I became aware of the logging issues when looking at a customer issue with
an older kernel. The min_free_kbytes setting was integral to the issue we
were investigating, and it was unclear whether or not the customer had
changed the value. I knew the system log should contain evidence of manually
setting min_free_kbytes. However, there was no evidence in the log. Turns
out the customer did not change the value, but it did cause me to do a deep
dive into the logging code.
--
Mike Kravetz