When memory is hotplug added or removed the min_free_kbytes must be
recalculated based on what is expected by khugepaged. Currently
after hotplug, min_free_kbytes will be set to a lower default and higher
default set when THP enabled is lost. This leaves the system with small
min_free_kbytes which isn't suitable for systems especially with network
intensive loads. Typical failure symptoms include HW WATCHDOG reset,
soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
kills.
Fixes: f000565adb77 ("thp: set recommended min free kbytes")
Signed-off-by: Vijay Balakrishna <[email protected]>
Cc: [email protected]
---
include/linux/khugepaged.h | 5 +++++
mm/khugepaged.c | 13 +++++++++++--
mm/memory_hotplug.c | 3 +++
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index bc45ea1efbf7..c941b7377321 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
unsigned long vm_flags);
+extern void khugepaged_min_free_kbytes_update(void);
#ifdef CONFIG_SHMEM
extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
#else
@@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
unsigned long addr)
{
}
+
+static inline void khugepaged_min_free_kbytes_update(void)
+{
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cfa0dba5fd3b..4f7107476a6f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -56,6 +56,9 @@ enum scan_result {
#define CREATE_TRACE_POINTS
#include <trace/events/huge_memory.h>
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+
/* default scan 8*512 pte (or vmas) every 30 second */
static unsigned int khugepaged_pages_to_scan __read_mostly;
static unsigned int khugepaged_pages_collapsed;
@@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
int start_stop_khugepaged(void)
{
- static struct task_struct *khugepaged_thread __read_mostly;
- static DEFINE_MUTEX(khugepaged_mutex);
int err = 0;
mutex_lock(&khugepaged_mutex);
@@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
mutex_unlock(&khugepaged_mutex);
return err;
}
+
+void khugepaged_min_free_kbytes_update(void)
+{
+ mutex_lock(&khugepaged_mutex);
+ if (khugepaged_enabled() && khugepaged_thread)
+ set_recommended_min_free_kbytes();
+ mutex_unlock(&khugepaged_mutex);
+}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..3e19272c1fad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -36,6 +36,7 @@
#include <linux/memblock.h>
#include <linux/compaction.h>
#include <linux/rmap.h>
+#include <linux/khugepaged.h>
#include <asm/tlbflush.h>
@@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
zone_pcp_update(zone);
init_per_zone_wmark_min();
+ khugepaged_min_free_kbytes_update();
kswapd_run(nid);
kcompactd_run(nid);
@@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
pgdat_resize_unlock(zone->zone_pgdat, &flags);
init_per_zone_wmark_min();
+ khugepaged_min_free_kbytes_update();
if (!populated_zone(zone)) {
zone_pcp_reset(zone);
--
2.28.0
On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged. Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
>
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <[email protected]>
> Cc: [email protected]
NAK. It would override min_free_kbytes set by user.
--
Kirill A. Shutemov
Hi Kirill,
On Thu, Sep 10, 2020 at 6:01 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
> > When memory is hotplug added or removed the min_free_kbytes must be
> > recalculated based on what is expected by khugepaged. Currently
> > after hotplug, min_free_kbytes will be set to a lower default and higher
> > default set when THP enabled is lost. This leaves the system with small
> > min_free_kbytes which isn't suitable for systems especially with network
> > intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > kills.
> >
> > Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> >
> > Signed-off-by: Vijay Balakrishna <[email protected]>
> > Cc: [email protected]
>
> NAK. It would override min_free_kbytes set by user.
Hi Kirill,
Thank you for looking into this. How is this different from when
khugepaged modifies it?
echo always >/sys/kernel/mm/transparent_hugepage/enabled
Which results in:
start_stop_khugepaged
set_recommended_min_free_kbytes
Which will also adjust min_free_kbytes according to hugepaged requirement?
This bug that Vijay described is another hot-plug related issue that
we have found on our system where we perform memory hot add and hot
remove on every reboot.
Thank you,
Pasha
On 9/10/2020 3:28 PM, Pavel Tatashin wrote:
> Hi Kirill,
>
> On Thu, Sep 10, 2020 at 6:01 PM Kirill A. Shutemov
> <[email protected]> wrote:
>>
>> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
>>> When memory is hotplug added or removed the min_free_kbytes must be
>>> recalculated based on what is expected by khugepaged. Currently
>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>> default set when THP enabled is lost. This leaves the system with small
>>> min_free_kbytes which isn't suitable for systems especially with network
>>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>> kills.
>>>
>>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>>
>>> Signed-off-by: Vijay Balakrishna <[email protected]>
>>> Cc: [email protected]
>>
>> NAK. It would override min_free_kbytes set by user.
>
> Hi Kirill,
>
> Thank you for looking into this. How is this different from when
> khugepaged modifies it?
>
> echo always >/sys/kernel/mm/transparent_hugepage/enabled
>
> Which results in:
>
> start_stop_khugepaged
> set_recommended_min_free_kbytes
>
> Which will also adjust min_free_kbytes according to hugepaged requirement?
>
> This bug that Vijay described is another hot-plug related issue that
> we have found on our system where we perform memory hot add and hot
> remove on every reboot.
>
> Thank you,
> Pasha
>
Thanks Kirill for taking a look and spotting it.
IIUC, it is an existing issue, we should fix it while we are at it.
Otherwise with my propsed patch introduces a regression for hotplug
memory consumers with THP enabled and min_free_kbytes set by user higher
than calculated by THP.
Vijay
On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged. Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
Care to explain some more please? The whole point of increasing
min_free_kbytes for THP is to get a larger free memory with a hope that
huge pages will be more likely to appear. While this might help for
other users that need a high order pages it is definitely not the
primary reason behind it. Could you provide an example with some more
data?
I do see how the inconsistency between boot and hotplug watermarks
setting is not ideal but I do worry about interaction with the user
specific values as a potential problem. set_recommended_min_free_kbytes
happens early enough that user space cannot really interfere but the
hotplug happens at any time.
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <[email protected]>
> Cc: [email protected]
> ---
> include/linux/khugepaged.h | 5 +++++
> mm/khugepaged.c | 13 +++++++++++--
> mm/memory_hotplug.c | 3 +++
> 3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
> extern void __khugepaged_exit(struct mm_struct *mm);
> extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
> #ifdef CONFIG_SHMEM
> extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> unsigned long addr)
> {
> }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
>
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
> /* default scan 8*512 pte (or vmas) every 30 second */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
> static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>
> int start_stop_khugepaged(void)
> {
> - static struct task_struct *khugepaged_thread __read_mostly;
> - static DEFINE_MUTEX(khugepaged_mutex);
> int err = 0;
>
> mutex_lock(&khugepaged_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
> mutex_unlock(&khugepaged_mutex);
> return err;
> }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> + mutex_lock(&khugepaged_mutex);
> + if (khugepaged_enabled() && khugepaged_thread)
> + set_recommended_min_free_kbytes();
> + mutex_unlock(&khugepaged_mutex);
> +}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..3e19272c1fad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -36,6 +36,7 @@
> #include <linux/memblock.h>
> #include <linux/compaction.h>
> #include <linux/rmap.h>
> +#include <linux/khugepaged.h>
>
> #include <asm/tlbflush.h>
>
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> zone_pcp_update(zone);
>
> init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
>
> kswapd_run(nid);
> kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
> pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
>
> if (!populated_zone(zone)) {
> zone_pcp_reset(zone);
> --
> 2.28.0
>
--
Michal Hocko
SUSE Labs
On 9/14/2020 7:33 AM, Michal Hocko wrote:
> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>> When memory is hotplug added or removed the min_free_kbytes must be
>> recalculated based on what is expected by khugepaged. Currently
>> after hotplug, min_free_kbytes will be set to a lower default and higher
>> default set when THP enabled is lost. This leaves the system with small
>> min_free_kbytes which isn't suitable for systems especially with network
>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>> kills.
>
> Care to explain some more please? The whole point of increasing
> min_free_kbytes for THP is to get a larger free memory with a hope that
> huge pages will be more likely to appear. While this might help for
> other users that need a high order pages it is definitely not the
> primary reason behind it. Could you provide an example with some more
> data?
Thanks Michal. I haven't looked into THP as part of my investigation,
so I cannot comment.
In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
during normal reboot/shutdown. This memory is hotplug hot-added as
movable type via systemd late service during start-of-day.
In our stress test first we ran into HW WATCHDOG recovery, on enabling
kernel watchdog we started seeing soft lockup hung task notices, failure
symptons varied, where stack trace of hung tasks sometimes trying to
allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE
WATCHDOG timeouts, OOM process kills etc., During investigation we
reran stress test without hotplug use case. Surprisingly this run
didn't encounter the said problems. This led to comparing what is
different between the two runs, while looking at various globals,
studying hotplug code I uncovered the issue of failing to restore
min_free_kbytes. In particular on our 8GB SoC min_free_kbytes went down
to 8703 from 22528 after hotplug add.
I'm going send a new patch to fix the issue Kirill A. Shutemov raised:
NAK. It would override min_free_kbytes set by user.
Vijay
>
> I do see how the inconsistency between boot and hotplug watermarks
> setting is not ideal but I do worry about interaction with the user
> specific values as a potential problem. set_recommended_min_free_kbytes
> happens early enough that user space cannot really interfere but the
> hotplug happens at any time.
>
>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>
>> Signed-off-by: Vijay Balakrishna <[email protected]>
>> Cc: [email protected]
>> ---
>> include/linux/khugepaged.h | 5 +++++
>> mm/khugepaged.c | 13 +++++++++++--
>> mm/memory_hotplug.c | 3 +++
>> 3 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>> index bc45ea1efbf7..c941b7377321 100644
>> --- a/include/linux/khugepaged.h
>> +++ b/include/linux/khugepaged.h
>> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>> extern void __khugepaged_exit(struct mm_struct *mm);
>> extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
>> unsigned long vm_flags);
>> +extern void khugepaged_min_free_kbytes_update(void);
>> #ifdef CONFIG_SHMEM
>> extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
>> #else
>> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
>> unsigned long addr)
>> {
>> }
>> +
>> +static inline void khugepaged_min_free_kbytes_update(void)
>> +{
>> +}
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> #endif /* _LINUX_KHUGEPAGED_H */
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index cfa0dba5fd3b..4f7107476a6f 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -56,6 +56,9 @@ enum scan_result {
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/huge_memory.h>
>>
>> +static struct task_struct *khugepaged_thread __read_mostly;
>> +static DEFINE_MUTEX(khugepaged_mutex);
>> +
>> /* default scan 8*512 pte (or vmas) every 30 second */
>> static unsigned int khugepaged_pages_to_scan __read_mostly;
>> static unsigned int khugepaged_pages_collapsed;
>> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>>
>> int start_stop_khugepaged(void)
>> {
>> - static struct task_struct *khugepaged_thread __read_mostly;
>> - static DEFINE_MUTEX(khugepaged_mutex);
>> int err = 0;
>>
>> mutex_lock(&khugepaged_mutex);
>> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>> mutex_unlock(&khugepaged_mutex);
>> return err;
>> }
>> +
>> +void khugepaged_min_free_kbytes_update(void)
>> +{
>> + mutex_lock(&khugepaged_mutex);
>> + if (khugepaged_enabled() && khugepaged_thread)
>> + set_recommended_min_free_kbytes();
>> + mutex_unlock(&khugepaged_mutex);
>> +}
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index e9d5ab5d3ca0..3e19272c1fad 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -36,6 +36,7 @@
>> #include <linux/memblock.h>
>> #include <linux/compaction.h>
>> #include <linux/rmap.h>
>> +#include <linux/khugepaged.h>
>>
>> #include <asm/tlbflush.h>
>>
>> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>> zone_pcp_update(zone);
>>
>> init_per_zone_wmark_min();
>> + khugepaged_min_free_kbytes_update();
>>
>> kswapd_run(nid);
>> kcompactd_run(nid);
>> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>> pgdat_resize_unlock(zone->zone_pgdat, &flags);
>>
>> init_per_zone_wmark_min();
>> + khugepaged_min_free_kbytes_update();
>>
>> if (!populated_zone(zone)) {
>> zone_pcp_reset(zone);
>> --
>> 2.28.0
>>
>
On 9/10/2020 3:01 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 10, 2020 at 01:47:39PM -0700, Vijay Balakrishna wrote:
>> When memory is hotplug added or removed the min_free_kbytes must be
>> recalculated based on what is expected by khugepaged. Currently
>> after hotplug, min_free_kbytes will be set to a lower default and higher
>> default set when THP enabled is lost. This leaves the system with small
>> min_free_kbytes which isn't suitable for systems especially with network
>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>> kills.
>>
>> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>>
>> Signed-off-by: Vijay Balakrishna <[email protected]>
>> Cc: [email protected]
>
> NAK. It would override min_free_kbytes set by user.
Hi Kirill,
To fix the issue you raised I just submitted
https://lore.kernel.org/lkml/[email protected]/
Thanks,
Vijay
>
On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>
>
> On 9/14/2020 7:33 AM, Michal Hocko wrote:
> > On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> > > When memory is hotplug added or removed the min_free_kbytes must be
> > > recalculated based on what is expected by khugepaged. Currently
> > > after hotplug, min_free_kbytes will be set to a lower default and higher
> > > default set when THP enabled is lost. This leaves the system with small
> > > min_free_kbytes which isn't suitable for systems especially with network
> > > intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> > > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > > kills.
> >
> > Care to explain some more please? The whole point of increasing
> > min_free_kbytes for THP is to get a larger free memory with a hope that
> > huge pages will be more likely to appear. While this might help for
> > other users that need a high order pages it is definitely not the
> > primary reason behind it. Could you provide an example with some more
> > data?
>
> Thanks Michal. I haven't looked into THP as part of my investigation, so I
> cannot comment.
>
> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
> during normal reboot/shutdown. This memory is hotplug hot-added as movable
> type via systemd late service during start-of-day.
>
> In our stress test first we ran into HW WATCHDOG recovery, on enabling
> kernel watchdog we started seeing soft lockup hung task notices, failure
> symptons varied, where stack trace of hung tasks sometimes trying to
> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
> timeouts, OOM process kills etc., During investigation we reran stress test
> without hotplug use case. Surprisingly this run didn't encounter the said
> problems. This led to comparing what is different between the two runs,
> while looking at various globals, studying hotplug code I uncovered the
> issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
> min_free_kbytes went down to 8703 from 22528 after hotplug add.
Did you try to increase min_free_kbytes manually after hot remove? Btw.
I would consider oom killer invocation due to min_free_kbytes really
weird behavior. If anything the higher value would cause more memory
reclaim and potentially oom rather than smaller one.
--
Michal Hocko
SUSE Labs
On Thu, Sep 10, 2020 at 4:47 PM Vijay Balakrishna
<[email protected]> wrote:
>
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged. Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost. This leaves the system with small
> min_free_kbytes which isn't suitable for systems especially with network
> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> kills.
>
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
>
> Signed-off-by: Vijay Balakrishna <[email protected]>
> Cc: [email protected]
> ---
> include/linux/khugepaged.h | 5 +++++
> mm/khugepaged.c | 13 +++++++++++--
> mm/memory_hotplug.c | 3 +++
> 3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
> extern void __khugepaged_exit(struct mm_struct *mm);
> extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
> #ifdef CONFIG_SHMEM
> extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr);
> #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct mm_struct *mm,
> unsigned long addr)
> {
> }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
>
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
> /* default scan 8*512 pte (or vmas) every 30 second */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
> static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>
> int start_stop_khugepaged(void)
> {
> - static struct task_struct *khugepaged_thread __read_mostly;
> - static DEFINE_MUTEX(khugepaged_mutex);
> int err = 0;
>
> mutex_lock(&khugepaged_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
> mutex_unlock(&khugepaged_mutex);
> return err;
> }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> + mutex_lock(&khugepaged_mutex);
> + if (khugepaged_enabled() && khugepaged_thread)
> + set_recommended_min_free_kbytes();
> + mutex_unlock(&khugepaged_mutex);
> +}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..3e19272c1fad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -36,6 +36,7 @@
> #include <linux/memblock.h>
> #include <linux/compaction.h>
> #include <linux/rmap.h>
> +#include <linux/khugepaged.h>
>
> #include <asm/tlbflush.h>
>
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
> zone_pcp_update(zone);
>
> init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
>
> kswapd_run(nid);
> kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
> pgdat_resize_unlock(zone->zone_pgdat, &flags);
>
> init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
The bug makes sense that min_free_kbytes should be consistent after
reboot or after hot-add, hence
Reviewed-by: Pavel Tatashin <[email protected]>
On 9/15/2020 1:18 AM, Michal Hocko wrote:
> On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>>
>>
>> On 9/14/2020 7:33 AM, Michal Hocko wrote:
>>> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>>>> When memory is hotplug added or removed the min_free_kbytes must be
>>>> recalculated based on what is expected by khugepaged. Currently
>>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>>> default set when THP enabled is lost. This leaves the system with small
>>>> min_free_kbytes which isn't suitable for systems especially with network
>>>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>>> kills.
>>>
>>> Care to explain some more please? The whole point of increasing
>>> min_free_kbytes for THP is to get a larger free memory with a hope that
>>> huge pages will be more likely to appear. While this might help for
>>> other users that need a high order pages it is definitely not the
>>> primary reason behind it. Could you provide an example with some more
>>> data?
>>
>> Thanks Michal. I haven't looked into THP as part of my investigation, so I
>> cannot comment.
>>
>> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
>> during normal reboot/shutdown. This memory is hotplug hot-added as movable
>> type via systemd late service during start-of-day.
>>
>> In our stress test first we ran into HW WATCHDOG recovery, on enabling
>> kernel watchdog we started seeing soft lockup hung task notices, failure
>> symptons varied, where stack trace of hung tasks sometimes trying to
>> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
>> timeouts, OOM process kills etc., During investigation we reran stress test
>> without hotplug use case. Surprisingly this run didn't encounter the said
>> problems. This led to comparing what is different between the two runs,
>> while looking at various globals, studying hotplug code I uncovered the
>> issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
>> min_free_kbytes went down to 8703 from 22528 after hotplug add.
>
> Did you try to increase min_free_kbytes manually after hot remove? Btw.
No, in our use case memory hot remove done during shutdown.
> I would consider oom killer invocation due to min_free_kbytes really
> weird behavior. If anything the higher value would cause more memory
> reclaim and potentially oom rather than smaller one.
Yes, we wondered about it too. One panic stack trace (after many OOM kills)
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G
O 5.4.51-xxx #1
[330321.196900] Hardware name: Overlake (DT)
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!
Then while searching I came across documented warning below. In above
instance panic after OOM kills happened after 3+ days of stress run (a
mixure of ttcp, cpuloadgen and fio).
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
Warning
Extreme values can damage your system. Setting min_free_kbytes to an
extremely low value prevents the system from reclaiming memory, which
can result in system hangs and OOM-killing processes. However, setting
min_free_kbytes too high (for example, to 5–10% of total system memory)
causes the system to enter an out-of-memory state immediately, resulting
in the system spending too much time reclaiming memory.
On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote:
>
>
> On 9/15/2020 1:18 AM, Michal Hocko wrote:
> > On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
> > >
> > >
> > > On 9/14/2020 7:33 AM, Michal Hocko wrote:
> > > > On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
> > > > > When memory is hotplug added or removed the min_free_kbytes must be
> > > > > recalculated based on what is expected by khugepaged. Currently
> > > > > after hotplug, min_free_kbytes will be set to a lower default and higher
> > > > > default set when THP enabled is lost. This leaves the system with small
> > > > > min_free_kbytes which isn't suitable for systems especially with network
> > > > > intensive loads. Typical failure symptoms include HW WATCHDOG reset,
> > > > > soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
> > > > > kills.
> > > >
> > > > Care to explain some more please? The whole point of increasing
> > > > min_free_kbytes for THP is to get a larger free memory with a hope that
> > > > huge pages will be more likely to appear. While this might help for
> > > > other users that need a high order pages it is definitely not the
> > > > primary reason behind it. Could you provide an example with some more
> > > > data?
> > >
> > > Thanks Michal. I haven't looked into THP as part of my investigation, so I
> > > cannot comment.
> > >
> > > In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
> > > during normal reboot/shutdown. This memory is hotplug hot-added as movable
> > > type via systemd late service during start-of-day.
> > >
> > > In our stress test first we ran into HW WATCHDOG recovery, on enabling
> > > kernel watchdog we started seeing soft lockup hung task notices, failure
> > > symptons varied, where stack trace of hung tasks sometimes trying to
> > > allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
> > > timeouts, OOM process kills etc., During investigation we reran stress test
> > > without hotplug use case. Surprisingly this run didn't encounter the said
> > > problems. This led to comparing what is different between the two runs,
> > > while looking at various globals, studying hotplug code I uncovered the
> > > issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
> > > min_free_kbytes went down to 8703 from 22528 after hotplug add.
> >
> > Did you try to increase min_free_kbytes manually after hot remove? Btw.
>
> No, in our use case memory hot remove done during shutdown.
I do not follow. If you are hotremoving during shutdown then how come
the value of min_free_kbytes matter at all?
> > I would consider oom killer invocation due to min_free_kbytes really
> > weird behavior. If anything the higher value would cause more memory
> > reclaim and potentially oom rather than smaller one.
>
> Yes, we wondered about it too. One panic stack trace (after many OOM kills)
>
> [330321.174240] Out of memory and no killable processes...
> [330321.179658] Kernel panic - not syncing: System is deadlocked on memory
> [330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G O
> 5.4.51-xxx #1
> [330321.196900] Hardware name: Overlake (DT)
> [330321.201038] Call trace:
> [330321.203660] dump_backtrace+0x0/0x1d0
> [330321.207533] show_stack+0x20/0x2c
> [330321.211048] dump_stack+0xe8/0x150
> [330321.214656] panic+0x18c/0x3b4
> [330321.217901] out_of_memory+0x4c0/0x6e4
> [330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
> [330321.226722] alloc_pages_current+0x21c/0x2b0
> [330321.231220] alloc_slab_page+0x1e0/0x7d8
> [330321.235361] new_slab+0x2e8/0x2f8
> [330321.238874] ___slab_alloc+0x45c/0x59c
> [330321.242835] kmem_cache_alloc+0x2d4/0x360
> [330321.247065] getname_flags+0x6c/0x2a8
> [330321.250938] user_path_at_empty+0x3c/0x68
> [330321.255168] do_readlinkat+0x7c/0x17c
> [330321.259039] __arm64_sys_readlinkat+0x5c/0x70
> [330321.263627] el0_svc_handler+0x1b8/0x32c
> [330321.267767] el0_svc+0x10/0x14
> [330321.271026] SMP: stopping secondary CPUs
> [330321.275382] Starting crashdump kernel...
> [330321.279526] Bye!
Do you have the full oom splat? The fact that previous oom killer
invocations haven't helped and all the eligible tasks have been killed
and you still hit the oom would suggest there is a lot of memory
allocated without a direct relation to tasks. I fail to see how
min_free_kbytes would be related.
> Then while searching I came across documented warning below. In above
> instance panic after OOM kills happened after 3+ days of stress run (a
> mixure of ttcp, cpuloadgen and fio).
>
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
>
> Warning
>
> Extreme values can damage your system. Setting min_free_kbytes to an
> extremely low value prevents the system from reclaiming memory, which can
> result in system hangs and OOM-killing processes. However, setting
> min_free_kbytes too high (for example, to 5–10% of total system memory)
> causes the system to enter an out-of-memory state immediately, resulting in
> the system spending too much time reclaiming memory.
The auto tuned value should never reach such a low value to cause
problems.
--
Michal Hocko
SUSE Labs
On 9/15/2020 11:53 PM, Michal Hocko wrote:
> On Tue 15-09-20 08:48:08, Vijay Balakrishna wrote:
>>
>>
>> On 9/15/2020 1:18 AM, Michal Hocko wrote:
>>> On Mon 14-09-20 09:57:02, Vijay Balakrishna wrote:
>>>>
>>>>
>>>> On 9/14/2020 7:33 AM, Michal Hocko wrote:
>>>>> On Thu 10-09-20 13:47:39, Vijay Balakrishna wrote:
>>>>>> When memory is hotplug added or removed the min_free_kbytes must be
>>>>>> recalculated based on what is expected by khugepaged. Currently
>>>>>> after hotplug, min_free_kbytes will be set to a lower default and higher
>>>>>> default set when THP enabled is lost. This leaves the system with small
>>>>>> min_free_kbytes which isn't suitable for systems especially with network
>>>>>> intensive loads. Typical failure symptoms include HW WATCHDOG reset,
>>>>>> soft lockup hang notices, NETDEVICE WATCHDOG timeouts, and OOM process
>>>>>> kills.
>>>>>
>>>>> Care to explain some more please? The whole point of increasing
>>>>> min_free_kbytes for THP is to get a larger free memory with a hope that
>>>>> huge pages will be more likely to appear. While this might help for
>>>>> other users that need a high order pages it is definitely not the
>>>>> primary reason behind it. Could you provide an example with some more
>>>>> data?
>>>>
>>>> Thanks Michal. I haven't looked into THP as part of my investigation, so I
>>>> cannot comment.
>>>>
>>>> In our use case we are hotplug removing ~2GB of 8GB total (on our SoC)
>>>> during normal reboot/shutdown. This memory is hotplug hot-added as movable
>>>> type via systemd late service during start-of-day.
>>>>
>>>> In our stress test first we ran into HW WATCHDOG recovery, on enabling
>>>> kernel watchdog we started seeing soft lockup hung task notices, failure
>>>> symptons varied, where stack trace of hung tasks sometimes trying to
>>>> allocate GFP_ATOMIC memory, looping in do_notify_resume, NETDEVICE WATCHDOG
>>>> timeouts, OOM process kills etc., During investigation we reran stress test
>>>> without hotplug use case. Surprisingly this run didn't encounter the said
>>>> problems. This led to comparing what is different between the two runs,
>>>> while looking at various globals, studying hotplug code I uncovered the
>>>> issue of failing to restore min_free_kbytes. In particular on our 8GB SoC
>>>> min_free_kbytes went down to 8703 from 22528 after hotplug add.
>>>
>>> Did you try to increase min_free_kbytes manually after hot remove? Btw.
>>
>> No, in our use case memory hot remove done during shutdown.
>
> I do not follow. If you are hotremoving during shutdown then how come
> the value of min_free_kbytes matter at all?
We are hot adding (which is hot removed memory during shutdown) during
boot, the removed memory treated as persistent.
>
>>> I would consider oom killer invocation due to min_free_kbytes really
>>> weird behavior. If anything the higher value would cause more memory
>>> reclaim and potentially oom rather than smaller one.
>>
>> Yes, we wondered about it too. One panic stack trace (after many OOM kills)
>>
>> [330321.174240] Out of memory and no killable processes...
>> [330321.179658] Kernel panic - not syncing: System is deadlocked on memory
>> [330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G O
>> 5.4.51-xxx #1
>> [330321.196900] Hardware name: Overlake (DT)
>> [330321.201038] Call trace:
>> [330321.203660] dump_backtrace+0x0/0x1d0
>> [330321.207533] show_stack+0x20/0x2c
>> [330321.211048] dump_stack+0xe8/0x150
>> [330321.214656] panic+0x18c/0x3b4
>> [330321.217901] out_of_memory+0x4c0/0x6e4
>> [330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
>> [330321.226722] alloc_pages_current+0x21c/0x2b0
>> [330321.231220] alloc_slab_page+0x1e0/0x7d8
>> [330321.235361] new_slab+0x2e8/0x2f8
>> [330321.238874] ___slab_alloc+0x45c/0x59c
>> [330321.242835] kmem_cache_alloc+0x2d4/0x360
>> [330321.247065] getname_flags+0x6c/0x2a8
>> [330321.250938] user_path_at_empty+0x3c/0x68
>> [330321.255168] do_readlinkat+0x7c/0x17c
>> [330321.259039] __arm64_sys_readlinkat+0x5c/0x70
>> [330321.263627] el0_svc_handler+0x1b8/0x32c
>> [330321.267767] el0_svc+0x10/0x14
>> [330321.271026] SMP: stopping secondary CPUs
>> [330321.275382] Starting crashdump kernel...
>> [330321.279526] Bye!
>
> Do you have the full oom splat? The fact that previous oom killer
> invocations haven't helped and all the eligible tasks have been killed
> and you still hit the oom would suggest there is a lot of memory
> allocated without a direct relation to tasks. I fail to see how
> min_free_kbytes would be related.
OOM splat below. I see we had kmem leak detection turned on here. We
haven't run stress with kmem leak detection since uncovereing low
min_free_kbytes. During investigation we wanted to make sure there is
no kmem leaks, we didn't find significant leaks detected.
[330319.234959]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/dbus-broker.service,task=dbus-broker,pid=541,uid=999
[330319.251380] Out of memory: Killed process 541 (dbus-broker)
total-vm:4400kB, anon-rss:892kB, file-rss:380kB, shmem-rss:0kB, UID:999
pgtables:44kB oom_score_adj:-900
[330319.267587] oom_reaper: reaped process 541 (dbus-broker), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[330319.766059] systemd invoked oom-killer:
gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
[330319.776060] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G
O 5.4.51-xxx #1
[330319.790612] Call trace:
[330319.793240] dump_backtrace+0x0/0x1d0
[330319.797112] show_stack+0x20/0x2c
[330319.800628] dump_stack+0xe8/0x150
[330319.804234] dump_header+0x80/0x494
[330319.807925] out_of_memory+0x480/0x6e4
[330319.811887] __alloc_pages_nodemask+0xbdc/0x1c90
[330319.816745] alloc_pages_current+0x21c/0x2b0
[330319.821244] alloc_slab_page+0x1e0/0x7d8
[330319.825383] new_slab+0x2e8/0x2f8
[330319.828895] ___slab_alloc+0x45c/0x59c
[330319.832854] kmem_cache_alloc+0x2d4/0x360
[330319.837086] getname_flags+0x6c/0x2a8
[330319.840958] user_path_at_empty+0x3c/0x68
[330319.845188] do_readlinkat+0x7c/0x17c
[330319.849059] __arm64_sys_readlinkat+0x5c/0x70
[330319.853648] el0_svc_handler+0x1b8/0x32c
[330319.857788] el0_svc+0x10/0x14
[330319.861064] Mem-Info:
[330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
active_file:6418 inactive_file:3869 isolated_file:2
unevictable:0 dirty:8 writeback:1 unstable:0
slab_reclaimable:34660 slab_unreclaimable:795718
mapped:1256 shmem:165765 pagetables:689 bounce:0
free:340962 free_pcp:4672 free_cma:0
[330319.898873] Node 0 active_anon:242976kB inactive_anon:436904kB
active_file:25672kB inactive_file:15476kB unevictable:0kB
isolated(anon):0kB isolated(file):8kB mapped:5024kB dirty:32kB
writeback:4kB shmem:663060kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 73728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB
high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB
inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB
managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB
bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
[330319.958376] lowmem_reserve[]: 0 15360 15360
[330319.962814] Node 0 Movable free:1351196kB min:2544kB low:4508kB
high:6472kB active_anon:188352kB inactive_anon:376856kB
active_file:26120kB inactive_file:15308kB unevictable:0kB
writepending:32kB present:1966080kB managed:1966080kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:8632kB
local_pcp:1336kB free_cma:0kB
[330319.993157] lowmem_reserve[]: 0 0 0
[330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
[330320.009592] Node 0 Movable: 16382*4kB (M) 2980*8kB (M) 311*16kB (M)
77*32kB (M) 28*64kB (M) 5*128kB (M) 6*256kB (M) 1*512kB (M) 1*1024kB (M)
120*2048kB (M) 245*4096kB (M) = 1351592kB
[330320.026541] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[330320.035631] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=32768kB
[330320.044543] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[330320.053363] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=64kB
[330320.062004] 176215 total pagecache pages
[330320.066165] 2046684 pages RAM
[330320.069339] 0 pages HighMem/MovableOnly
[330320.073410] 367615 pages reserved
[330320.076943] 0 pages hwpoisoned
[330320.080190] Unreclaimable slab info:
[330320.083991] Name Used Total
[330320.090244] bio-3 560KB 672KB
[330320.095747] bio-2 757KB 832KB
[330320.101254] nf-frags 7KB 15KB
[330320.106763] fib6_nodes 16KB 16KB
[330320.112270] ip6_dst_cache 70KB 70KB
[330320.117779] RAWv6 154KB 154KB
[330320.123296] UDPv6 246KB 246KB
[330320.128806] TCPv6 146KB 146KB
[330320.134315] nf_conntrack 84KB 94KB
[330320.139823] io 66KB 80KB
[330320.145332] sd_ext_cdb 3KB 3KB
[330320.150838] virtio_scsi_cmd 16KB 16KB
[330320.156345] sgpool-128 29KB 29KB
[330320.161852] sgpool-64 31KB 31KB
[330320.167359] sgpool-32 31KB 31KB
[330320.172867] sgpool-16 15KB 15KB
[330320.178374] sgpool-8 7KB 7KB
[330320.183880] mqueue_inode_cache 31KB 31KB
[330320.189479] jbd2_inode 35KB 43KB
[330320.194989] ext4_system_zone 21KB 55KB
[330320.200497] ext4_bio_post_read_ctx 15KB 15KB
[330320.206453] kioctx 255KB 255KB
[330320.211960] aio_kiocb 54KB 77KB
[330320.217477] dio 196KB 323KB
[330320.222985] bio-1 7KB 7KB
[330320.228499] UNIX 308KB 369KB
[330320.234009] tcp_bind_bucket 20KB 20KB
[330320.239518] ip_fib_trie 16KB 16KB
[330320.245024] ip_fib_alias 15KB 15KB
[330320.250537] ip_dst_cache 64KB 64KB
[330320.256047] RAW 158KB 158KB
[330320.261553] UDP 247KB 247KB
[330320.267060] tw_sock_TCP 15KB 15KB
[330320.272566] request_sock_TCP 30KB 30KB
[330320.278073] TCP 278KB 278KB
[330320.283580] hugetlbfs_inode_cache 62KB 62KB
[330320.289446] eventpoll_pwq 31KB 31KB
[330320.294953] eventpoll_epi 31KB 31KB
[330320.300460] inotify_inode_mark 31KB 31KB
[330320.306061] request_queue 467KB 499KB
[330320.311568] blkdev_ioc 61KB 61KB
[330320.317087] bio-0 259KB 600KB
[330320.322603] biovec-max 3060KB 3718KB
[330320.328112] biovec-64 252KB 315KB
[330320.333619] biovec-16 62KB 78KB
[330320.339127] khugepaged_mm_slot 31KB 35KB
[330320.344723] user_namespace 122KB 122KB
[330320.350230] uid_cache 64KB 64KB
[330320.355738] iommu_iova 1076KB 1076KB
[330320.361245] dmaengine-unmap-2 4KB 4KB
[330320.366754] skbuff_fclone_cache 128KB 160KB
[330320.374394] skbuff_head_cache 79402KB 106265KB
[330320.379908] file_lock_cache 62KB 92KB
[330320.385416] file_lock_ctx 42KB 47KB
[330320.390924] fsnotify_mark_connector 46KB 51KB
[330320.396969] net_namespace 64KB 64KB
[330320.402477] task_delay_info 66KB 78KB
[330320.407984] taskstats 63KB 63KB
[330320.413491] proc_dir_entry 279KB 289KB
[330320.418998] pde_opener 31KB 31KB
[330320.424507] seq_file 63KB 94KB
[330320.430014] sigqueue 55KB 55KB
[330320.435525] shmem_inode_cache 1086KB 1221KB
[330320.441036] kernfs_iattrs_cache 782KB 826KB
[330320.446746] kernfs_node_cache 7943KB 8221KB
[330320.452306] mnt_cache 1579KB 2756KB
[330320.457813] filp 265KB 265KB
[330320.463321] names_cache 21543KB 21543KB
[330320.468833] hashtab_node 115KB 131KB
[330320.474341] ebitmap_node 626KB 641KB
[330320.479849] avtab_node 1047KB 1063KB
[330320.485361] avc_node 118KB 118KB
[330320.490868] iint_cache 23KB 23KB
[330320.496376] lsm_inode_cache 8578KB 8578KB
[330320.501902] lsm_file_cache 147KB 552KB
[330320.507416] key_jar 157KB 157KB
[330320.512927] nsproxy 31KB 31KB
[330320.518436] vm_area_struct 89KB 127KB
[330320.523944] mm_struct 252KB 252KB
[330320.529451] fs_cache 64KB 64KB
[330320.534959] files_cache 255KB 255KB
[330320.540468] signal_cache 507KB 569KB
[330320.545979] sighand_cache 633KB 841KB
[330320.551488] task_struct 1721KB 1940KB
[330320.556997] cred_jar 119KB 136KB
[330320.562504] anon_vma_chain 35KB 47KB
[330320.568011] anon_vma 73KB 95KB
[330320.573520] pid 101KB 120KB
[330320.579029] numa_policy 3KB 3KB
[330320.584536] trace_event_file 262KB 262KB
[330320.590041] ftrace_event_field 184KB 184KB
[330320.595638] pool_workqueue 128KB 128KB
[330320.601146] task_group 64KB 64KB
[330320.606652] vmap_area 77KB 78KB
[330320.612159] page->ptl 517KB 517KB
[330320.617665] kmemleak_scan_area 47KB 47KB
[330320.623262] kmemleak_object 2449340KB 2449340KB
[330320.628773] kmalloc-8k 4848KB 4928KB
[330320.634423] kmalloc-4k 48944KB 61856KB
[330320.639946] kmalloc-2k 11768KB 12480KB
[330320.645453] kmalloc-1k 10752KB 10752KB
[330320.651049] kmalloc-512 87024KB 94124KB
[330320.656561] kmalloc-256 2433KB 2528KB
[330320.662359] kmalloc-128 24071KB 29104KB
[330320.667869] kmem_cache_node 867KB 896KB
[330320.673377] kmem_cache 2162KB 2171KB
[330320.678881] Tasks state (memory values in pages):
[330320.683848] [ pid ] uid tgid total_vm rss pgtables_bytes
swapents oom_score_adj name
[330320.692867] [ 238] 0 238 3446 1394 61440
0 -1000 systemd-udevd
[330320.702174] [ 20024] 62412 20008 313895 0 176128
0 0 test_ntttcp
[330320.711281] [ 20891] 64250 20885 67810 0 49152
0 0 test_cpuloadgen
[330320.720743] [ 20704] 62689 20704 157371 23 90112
0 0 fio
[330320.729119] [ 20708] 62689 20708 157372 5 90112
0 0 fio
[330320.737496] [ 20709] 62689 20709 157373 6 90112
0 0 fio
[330320.745874] [ 20710] 62689 20710 157374 6 90112
0 0 fio
[330320.754267] [ 4698] 0 4698 2231 0 57344
0 0 umount
[330320.762919] [ 4699] 0 4699 2231 0 57344
0 0 umount
[330320.771565] [ 4700] 0 4700 2231 0 53248
0 0 umount
[330320.780212] [ 4701] 0 4701 2231 0 57344
0 0 umount
[330320.788858] [ 4705] 0 4705 2231 0 53248
0 0 umount
[330320.797505] [ 4706] 0 4706 2231 0 57344
0 0 umount
[330320.806152] [ 4707] 0 4707 2231 0 61440
0 0 umount
[330320.814798] [ 4709] 0 4709 2265 0 57344
0 0 veritysetup
[330320.823891] [ 4715] 0 4715 2231 0 61440
0 0 umount
[330320.832537] [ 4716] 0 4716 2231 0 57344
0 0 umount
[330320.841185] [ 4717] 0 4717 2231 0 57344
0 0 umount
[330320.849832] [ 4721] 0 4721 2231 0 57344
0 0 umount
[330320.858478] [ 4722] 0 4722 2231 0 57344
0 0 umount
[330320.867124] [ 4723] 0 4723 2231 0 57344
0 0 umount
[330320.875770] [ 4728] 0 4728 2231 0 61440
0 0 umount
[330320.884423] [ 4729] 0 4729 2231 0 57344
0 0 umount
[330320.893075] [ 4730] 0 4730 2231 0 57344
0 0 umount
[330320.901722] [ 4731] 0 4731 2231 0 57344
0 0 umount
[330320.910369] [ 4732] 0 4732 2231 0 61440
0 0 umount
[330320.919016] [ 4733] 0 4733 2231 0 57344
0 0 umount
[330320.927662] [ 4735] 0 4735 2231 0 61440
0 0 umount
[330320.936307] [ 4736] 0 4736 2231 0 61440
0 0 umount
[330320.944953] [ 4737] 0 4737 2231 0 57344
0 0 umount
[330320.953599] [ 4738] 0 4738 2231 0 61440
0 0 umount
[330320.962245] [ 4739] 0 4739 2231 0 57344
0 0 umount
[330320.970891] [ 4740] 0 4740 2231 0 53248
0 0 umount
[330320.979536] [ 4744] 0 4744 2231 0 61440
0 0 umount
[330320.988187] [ 4746] 0 4746 2231 0 57344
0 0 umount
[330320.996832] [ 4747] 0 4747 2231 0 61440
0 0 umount
[330321.005479] [ 4757] 0 4757 2265 0 53248
0 0 veritysetup
[330321.014573] [ 4758] 0 4758 2231 0 57344
0 0 umount
[330321.023225] [ 4759] 0 4759 2231 0 57344
0 0 umount
[330321.031872] [ 4760] 0 4760 2231 0 61440
0 0 umount
[330321.040519] [ 5922] 0 5922 3012 0 61440
0 0 systemd-user-ru
[330321.049972] [ 6557] 0 6557 2231 0 61440
0 0 umount
[330321.058618] [ 6558] 0 6558 2231 0 61440
0 0 umount
[330321.067264] [ 6563] 0 6563 2231 0 57344
0 0 umount
[330321.075910] [ 6567] 0 6567 2231 0 57344
0 0 umount
[330321.084556] [ 6569] 0 6569 2231 0 53248
0 0 umount
[330321.093194] [ 6570] 0 6570 2231 0 65536
0 0 umount
[330321.101840] [ 6575] 0 6575 2231 0 57344
0 0 umount
[330321.110485] [ 6578] 0 6578 2231 0 61440
0 0 umount
[330321.119132] [ 6579] 0 6579 2231 0 57344
0 0 umount
[330321.127778] [ 6580] 0 6580 2231 0 61440
0 0 umount
[330321.136425] [ 7215] 0 7215 5087 0 69632
0 0 systemd-journal
[330321.145879] [ 8410] 0 8410 5087 0 65536
0 0 systemd-journal
[330321.155336] [ 9603] 0 9603 5087 0 73728
0 0 systemd-journal
[330321.164790] [ 10366] 0 10366 3012 0 61440
0 0 systemd-user-ru
[330321.174240] Out of memory and no killable processes...
[330321.179658] Kernel panic - not syncing: System is deadlocked on memory
[330321.186489] CPU: 4 PID: 1 Comm: systemd Kdump: loaded Tainted: G
O 5.4.51-xxx #1
[330321.201038] Call trace:
[330321.203660] dump_backtrace+0x0/0x1d0
[330321.207533] show_stack+0x20/0x2c
[330321.211048] dump_stack+0xe8/0x150
[330321.214656] panic+0x18c/0x3b4
[330321.217901] out_of_memory+0x4c0/0x6e4
[330321.221863] __alloc_pages_nodemask+0xbdc/0x1c90
[330321.226722] alloc_pages_current+0x21c/0x2b0
[330321.231220] alloc_slab_page+0x1e0/0x7d8
[330321.235361] new_slab+0x2e8/0x2f8
[330321.238874] ___slab_alloc+0x45c/0x59c
[330321.242835] kmem_cache_alloc+0x2d4/0x360
[330321.247065] getname_flags+0x6c/0x2a8
[330321.250938] user_path_at_empty+0x3c/0x68
[330321.255168] do_readlinkat+0x7c/0x17c
[330321.259039] __arm64_sys_readlinkat+0x5c/0x70
[330321.263627] el0_svc_handler+0x1b8/0x32c
[330321.267767] el0_svc+0x10/0x14
[330321.271026] SMP: stopping secondary CPUs
[330321.275382] Starting crashdump kernel...
[330321.279526] Bye!
>
>> Then while searching I came across documented warning below. In above
>> instance panic after OOM kills happened after 3+ days of stress run (a
>> mixure of ttcp, cpuloadgen and fio).
>>
>> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-configuration_tools-configuring_system_memory_capacity
>>
>> Warning
>>
>> Extreme values can damage your system. Setting min_free_kbytes to an
>> extremely low value prevents the system from reclaiming memory, which can
>> result in system hangs and OOM-killing processes. However, setting
>> min_free_kbytes too high (for example, to 5–10% of total system memory)
>> causes the system to enter an out-of-memory state immediately, resulting in
>> the system spending too much time reclaiming memory.
>
> The auto tuned value should never reach such a low value to cause
> problems.
The auto tuned value is incorrect post hotplug memory operation, in our
use case memoy hot add occurs very early during boot.
Thanks,
Vijay
On Wed 16-09-20 11:28:40, Vijay Balakrishna wrote:
[...]
> OOM splat below. I see we had kmem leak detection turned on here. We
> haven't run stress with kmem leak detection since uncovereing low
> min_free_kbytes. During investigation we wanted to make sure there is no
> kmem leaks, we didn't find significant leaks detected.
>
> [330319.766059] systemd invoked oom-killer:
> gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
[...]
> [330319.861064] Mem-Info:
> [330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
> active_file:6418 inactive_file:3869 isolated_file:2
> unevictable:0 dirty:8 writeback:1 unstable:0
> slab_reclaimable:34660 slab_unreclaimable:795718
> mapped:1256 shmem:165765 pagetables:689 bounce:0
> free:340962 free_pcp:4672 free_cma:0
The memory consumption is predominantely in slab (unreclaimable). Only
~8% of the memory is on LRUs (anonymous + file). Slab (both reclaimable
and unreclaimable) is ~40%. So there is still a lot of memory
unaccounted (direct users of the page allocator). This would partially
explain why the oom killer is not able to make progress and eventually
panics because it is the kernel which is blowing the memory consumption.
There is still ~1G free memory but the problem is that this is a
GFP_KERNEL request which is not allowed to consume Movable memory.
Zone normal is depleted and therefore it cannot satisfy this request
even when there are some order-1 pages available.
> [330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB=20
> high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB
> inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB
> managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB
> bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
[...]
> [330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
I do not see the state of swap in the oom splat so I assume you have
swap disabled. If that is the case then the memory reclaim cannot really
do much for this request. There is almost no page cache to reclaim.
That being said I do not see how a increased min_free_kbytes could help
for this particular OOM situation. If there is really any relation it is
more of a unintended side effect.
[...]
> > > Extreme values can damage your system. Setting min_free_kbytes to an
> > > extremely low value prevents the system from reclaiming memory, which can
> > > result in system hangs and OOM-killing processes. However, setting
> > > min_free_kbytes too high (for example, to 5–10% of total system memory)
> > > causes the system to enter an out-of-memory state immediately, resulting in
> > > the system spending too much time reclaiming memory.
> >
> > The auto tuned value should never reach such a low value to cause
> > problems.
>
> The auto tuned value is incorrect post hotplug memory operation, in our use
> case memoy hot add occurs very early during boot.
Define incorrect. What are the actual values? Have you tried to increase
the value manually after the hotplug?
--
Michal Hocko
SUSE Labs
On 9/17/2020 5:12 AM, Michal Hocko wrote:
> On Wed 16-09-20 11:28:40, Vijay Balakrishna wrote:
> [...]
>> OOM splat below. I see we had kmem leak detection turned on here. We
>> haven't run stress with kmem leak detection since uncovereing low
>> min_free_kbytes. During investigation we wanted to make sure there is no
>> kmem leaks, we didn't find significant leaks detected.
>>
>> [330319.766059] systemd invoked oom-killer:
>> gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=1, oom_score_adj=0
>
> [...]
>> [330319.861064] Mem-Info:
>> [330319.863519] active_anon:60744 inactive_anon:109226 isolated_anon:0
>> active_file:6418 inactive_file:3869 isolated_file:2
>> unevictable:0 dirty:8 writeback:1 unstable:0
>> slab_reclaimable:34660 slab_unreclaimable:795718
>> mapped:1256 shmem:165765 pagetables:689 bounce:0
>> free:340962 free_pcp:4672 free_cma:0
>
> The memory consumption is predominantely in slab (unreclaimable). Only
> ~8% of the memory is on LRUs (anonymous + file). Slab (both reclaimable
> and unreclaimable) is ~40%. So there is still a lot of memory
> unaccounted (direct users of the page allocator). This would partially
> explain why the oom killer is not able to make progress and eventually
> panics because it is the kernel which is blowing the memory consumption.
>
> There is still ~1G free memory but the problem is that this is a
> GFP_KERNEL request which is not allowed to consume Movable memory.
> Zone normal is depleted and therefore it cannot satisfy this request
> even when there are some order-1 pages available.
>
>> [330319.928124] Node 0 Normal free:12652kB min:14344kB low:19092kB=20
>> high:23840kB active_anon:55340kB inactive_anon:60276kB active_file:60kB
>> inactive_file:128kB unevictable:0kB writepending:4kB present:6220656kB
>> managed:4750196kB mlocked:0kB kernel_stack:9568kB pagetables:2756kB
>> bounce:0kB free_pcp:10056kB local_pcp:1376kB free_cma:0kB
> [...]
>> [330319.996879] Node 0 Normal: 3138*4kB (UME) 38*8kB (UM) 0*16kB 0*32kB
>> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12856kB
>
> I do not see the state of swap in the oom splat so I assume you have
> swap disabled. If that is the case then the memory reclaim cannot really
> do much for this request. There is almost no page cache to reclaim.
No swap configured in our system.
>
> That being said I do not see how a increased min_free_kbytes could help
> for this particular OOM situation. If there is really any relation it is
> more of a unintended side effect.
I haven't had a chance to rerun stress with kmem leak detection to know
if we still see OOM kills after min_free_kbytes restore.
>
> [...]
>>>> Extreme values can damage your system. Setting min_free_kbytes to an
>>>> extremely low value prevents the system from reclaiming memory, which can
>>>> result in system hangs and OOM-killing processes. However, setting
>>>> min_free_kbytes too high (for example, to 5–10% of total system memory)
>>>> causes the system to enter an out-of-memory state immediately, resulting in
>>>> the system spending too much time reclaiming memory.
>>>
>>> The auto tuned value should never reach such a low value to cause
>>> problems.
>>
>> The auto tuned value is incorrect post hotplug memory operation, in our use
>> case memoy hot add occurs very early during boot.
>
> Define incorrect. What are the actual values? Have you tried to increase
> the value manually after the hotplug?
In our case SoC with 8GB memory, system tuned min_free_kbytes
- first to 22528
- we perform memory hot add very early in boot
- now min_free_kbytes is 8703
Before looking at code, first I manually restored min_free_kbytes soon
after boot, reran stress and didn't notice symptoms I mentioned in
change log.
Thanks,
Vijay
On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
[...]
> > > The auto tuned value is incorrect post hotplug memory operation, in our use
> > > case memoy hot add occurs very early during boot.
> > Define incorrect. What are the actual values? Have you tried to increase
> > the value manually after the hotplug?
>
> In our case SoC with 8GB memory, system tuned min_free_kbytes
> - first to 22528
> - we perform memory hot add very early in boot
What was the original and after-the-hotplug size of memory and layout?
I suspect that all the hotplugged memory is in Movable zone, right?
> - now min_free_kbytes is 8703
>
> Before looking at code, first I manually restored min_free_kbytes soon after
> boot, reran stress and didn't notice symptoms I mentioned in change log.
This is really surprising and I strongly suspect that an earlier reclaim
just changed the timing enough so that workload has spread the memory
prpessure over a longer time and that might have been enough to recycle
some of the unreclaimable memory due to its natural life time. But this
is a pure speculation. Much more data would be needed to analyze this.
In any case your stress test is oveprovisioning your Normal zone and
increased min_free_kbytes just papers over the sizing problem.
--
Michal Hocko
SUSE Labs
On 9/17/2020 10:45 PM, Michal Hocko wrote:
> On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
> [...]
>>>> The auto tuned value is incorrect post hotplug memory operation, in our use
>>>> case memoy hot add occurs very early during boot.
>>> Define incorrect. What are the actual values? Have you tried to increase
>>> the value manually after the hotplug?
>>
>> In our case SoC with 8GB memory, system tuned min_free_kbytes
>> - first to 22528
>> - we perform memory hot add very early in boot
>
> What was the original and after-the-hotplug size of memory and layout?
> I suspect that all the hotplugged memory is in Movable zone, right?
Yes, added ~1.92GB as Movable type, booting with 6GB at start.
>
>> - now min_free_kbytes is 8703
>>
>> Before looking at code, first I manually restored min_free_kbytes soon after
>> boot, reran stress and didn't notice symptoms I mentioned in change log.
>
> This is really surprising and I strongly suspect that an earlier reclaim
> just changed the timing enough so that workload has spread the memory
> prpessure over a longer time and that might have been enough to recycle
> some of the unreclaimable memory due to its natural life time. But this
> is a pure speculation. Much more data would be needed to analyze this.
>
> In any case your stress test is oveprovisioning your Normal zone and
> increased min_free_kbytes just papers over the sizing problem.
>
It is a synthetic workload, likely not sized I need to check. I feel
having higher min_free_kbytes made GFP_ATOMIC allocations not to fail.
I have seen NETDEV WATCHDOG timeout with stacktrace trying to allocate
memory, looping in net rx receive path.
Thanks,
Vijay
On Fri 18-09-20 08:32:13, Vijay Balakrishna wrote:
>
>
> On 9/17/2020 10:45 PM, Michal Hocko wrote:
> > On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
> > [...]
> > > > > The auto tuned value is incorrect post hotplug memory operation, in our use
> > > > > case memoy hot add occurs very early during boot.
> > > > Define incorrect. What are the actual values? Have you tried to increase
> > > > the value manually after the hotplug?
> > >
> > > In our case SoC with 8GB memory, system tuned min_free_kbytes
> > > - first to 22528
> > > - we perform memory hot add very early in boot
> >
> > What was the original and after-the-hotplug size of memory and layout?
> > I suspect that all the hotplugged memory is in Movable zone, right?
>
> Yes, added ~1.92GB as Movable type, booting with 6GB at start.
>
> >
> > > - now min_free_kbytes is 8703
> > >
> > > Before looking at code, first I manually restored min_free_kbytes soon after
> > > boot, reran stress and didn't notice symptoms I mentioned in change log.
> >
> > This is really surprising and I strongly suspect that an earlier reclaim
> > just changed the timing enough so that workload has spread the memory
> > prpessure over a longer time and that might have been enough to recycle
> > some of the unreclaimable memory due to its natural life time. But this
> > is a pure speculation. Much more data would be needed to analyze this.
> >
> > In any case your stress test is oveprovisioning your Normal zone and
> > increased min_free_kbytes just papers over the sizing problem.
> >
>
> It is a synthetic workload, likely not sized I need to check. I feel having
> higher min_free_kbytes made GFP_ATOMIC allocations not to fail.
Yes a higher min_free_kbytes will help GFP_ATOMIC. But only to some
degree. But nobody should depend on an atomic allocation for
correctness. It is just way too easy to fail under a higher memory
pressure.
> I have seen
> NETDEV WATCHDOG timeout with stacktrace trying to allocate memory, looping
> in net rx receive path.
You should talk to net folks.
--
Michal Hocko
SUSE Labs