2019-08-23 23:40:28

by Adric Blake

[permalink] [raw]
Subject: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

Synopsis:
A WARN_ON_ONCE is hit twice in set_task_reclaim_state under the
following conditions:
- a memory cgroup has been created and a task assigned it it
- memory.limit_in_bytes has been set
- memory has filled up, likely from cache

In my usage, I create a cgroup under the current session scope and
assign a task to it. I then set memory.limit_in_bytes and
memory.soft_limit_in_bytes for the cgroup to reasonable values, say
1G/512M. The program accesses large files frequently and gradually
fills memory with the page cache. The warnings appears when the
entirety of the system memory is filled, presumably from other
programs.

If I wait until the program has filled the entirety of system memory
with cache and then assign a memory limit, the warnings appear
immediately.

I am building the linux git. I first noticed this issue with the
drm-tip 5.3rc3 and 5.3rc4 kernels, and tested linux master after
5.3rc5 to confirm the bug more resoundingly.

Here are the warnings.

[38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245
set_task_reclaim_state+0x1e/0x40
[38491.963106] Modules linked in: iwlmvm mac80211 libarc4 iwlwifi
cfg80211 xt_comment nls_iso8859_1 nls_cp437 vfat fat xfs jfs btrfs xor
raid6_pq libcrc32c ccm tun rfcomm fuse xt_tcpudp ip6t_REJECT
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_owner
snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter bnep ext4
crc32c_generic mbcache jbd2 snd_hda_codec_realtek
snd_hda_codec_generic snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc
coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel
snd_soc_acpi i915 snd_soc_core kvm snd_compress ac97_bus
snd_pcm_dmaengine snd_hda_intel i2c_algo_bit btusb irqbypass
drm_kms_helper btrtl snd_hda_codec dell_laptop btbcm crct10dif_pclmul
snd_hda_core crc32c_intel btintel iTCO_wdt ghash_clmulni_intel drm
ledtrig_audio aesni_intel iTCO_vendor_support snd_hwdep dell_wmi
rtsx_usb_ms r8169 dell_smbios aes_x86_64 mei_hdcp crypto_simd
intel_gtt bluetooth snd_pcm cryptd dcdbas
[38491.963155] wmi_bmof dell_wmi_descriptor intel_rapl_msr
glue_helper snd_timer joydev intel_cstate snd realtek memstick
dell_smm_hwmon mousedev psmouse input_leds libphy intel_uncore
ecdh_generic ecc crc16 rfkill intel_rapl_perf soundcore i2c_i801
agpgart mei_me tpm_crb syscopyarea sysfillrect sysimgblt mei
intel_xhci_usb_role_switch fb_sys_fops idma64 tpm_tis roles
processor_thermal_device intel_rapl_common i2c_hid tpm_tis_core
int3403_thermal intel_soc_dts_iosf battery wmi intel_lpss_pci
intel_lpss intel_pch_thermal tpm int3400_thermal int3402_thermal
acpi_thermal_rel int340x_thermal_zone rng_core intel_hid ac
sparse_keymap evdev mac_hid crypto_user ip_tables x_tables
hid_multitouch rtsx_usb_sdmmc mmc_core rtsx_usb hid_logitech_hidpp
sr_mod cdrom sd_mod uas usb_storage hid_logitech_dj hid_generic usbhid
hid ahci serio_raw libahci atkbd libps2 libata xhci_pci scsi_mod
xhci_hcd crc32_pclmul i8042 serio f2fs [last unloaded: cfg80211]
[38491.963221] CPU: 7 PID: 175 Comm: kswapd0 Not tainted
5.3.0-rc5+149+gbb7ba8069de9 #1
[38491.963222] Hardware name: Dell Inc. Inspiron 5570/09YTN7, BIOS
1.2.3 05/15/2019
[38491.963226] RIP: 0010:set_task_reclaim_state+0x1e/0x40
[38491.963228] Code: 78 a9 e7 ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 55 48 89 f5 53 48 89 fb 48 85 ed 48 8b 83 08 08 00 00 74 11 48 85
c0 74 02 <0f> 0b 48 89 ab 08 08 00 00 5b 5d c3 48 85 c0 75 f1 0f 0b 48
89 ab
[38491.963229] RSP: 0018:ffff8c898031fc60 EFLAGS: 00010286
[38491.963230] RAX: ffff8c898031fe28 RBX: ffff892aa04ddc40 RCX: 0000000000000000
[38491.963231] RDX: ffff8c898031fc60 RSI: ffff8c898031fcd0 RDI: ffff892aa04ddc40
[38491.963233] RBP: ffff8c898031fcd0 R08: ffff8c898031fd48 R09: ffff89279674b800
[38491.963234] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8c898031fd48
[38491.963235] R13: ffff892a842ef000 R14: ffff892aaf7fc000 R15: 0000000000000000
[38491.963236] FS: 0000000000000000(0000) GS:ffff892aa33c0000(0000)
knlGS:0000000000000000
[38491.963238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38491.963239] CR2: 00007f90628fa000 CR3: 000000027ee0a002 CR4: 00000000003606e0
[38491.963239] Call Trace:
[38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
[38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
[38491.963254] balance_pgdat+0x276/0x540
[38491.963258] kswapd+0x200/0x3f0
[38491.963261] ? wait_woken+0x80/0x80
[38491.963265] kthread+0xfd/0x130
[38491.963267] ? balance_pgdat+0x540/0x540
[38491.963269] ? kthread_park+0x80/0x80
[38491.963273] ret_from_fork+0x35/0x40
[38491.963276] ---[ end trace 727343df67b2398a ]---
[38492.129877] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:248
set_task_reclaim_state+0x2f/0x40
[38492.129879] Modules linked in: iwlmvm mac80211 libarc4 iwlwifi
cfg80211 xt_comment nls_iso8859_1 nls_cp437 vfat fat xfs jfs btrfs xor
raid6_pq libcrc32c ccm tun rfcomm fuse xt_tcpudp ip6t_REJECT
nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_owner
snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter bnep ext4
crc32c_generic mbcache jbd2 snd_hda_codec_realtek
snd_hda_codec_generic snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc
coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel
snd_soc_acpi i915 snd_soc_core kvm snd_compress ac97_bus
snd_pcm_dmaengine snd_hda_intel i2c_algo_bit btusb irqbypass
drm_kms_helper btrtl snd_hda_codec dell_laptop btbcm crct10dif_pclmul
snd_hda_core crc32c_intel btintel iTCO_wdt ghash_clmulni_intel drm
ledtrig_audio aesni_intel iTCO_vendor_support snd_hwdep dell_wmi
rtsx_usb_ms r8169 dell_smbios aes_x86_64 mei_hdcp crypto_simd
intel_gtt bluetooth snd_pcm cryptd dcdbas
[38492.129919] wmi_bmof dell_wmi_descriptor intel_rapl_msr
glue_helper snd_timer joydev intel_cstate snd realtek memstick
dell_smm_hwmon mousedev psmouse input_leds libphy intel_uncore
ecdh_generic ecc crc16 rfkill intel_rapl_perf soundcore i2c_i801
agpgart mei_me tpm_crb syscopyarea sysfillrect sysimgblt mei
intel_xhci_usb_role_switch fb_sys_fops idma64 tpm_tis roles
processor_thermal_device intel_rapl_common i2c_hid tpm_tis_core
int3403_thermal intel_soc_dts_iosf battery wmi intel_lpss_pci
intel_lpss intel_pch_thermal tpm int3400_thermal int3402_thermal
acpi_thermal_rel int340x_thermal_zone rng_core intel_hid ac
sparse_keymap evdev mac_hid crypto_user ip_tables x_tables
hid_multitouch rtsx_usb_sdmmc mmc_core rtsx_usb hid_logitech_hidpp
sr_mod cdrom sd_mod uas usb_storage hid_logitech_dj hid_generic usbhid
hid ahci serio_raw libahci atkbd libps2 libata xhci_pci scsi_mod
xhci_hcd crc32_pclmul i8042 serio f2fs [last unloaded: cfg80211]
[38492.129961] CPU: 7 PID: 175 Comm: kswapd0 Tainted: G W
5.3.0-rc5+149+gbb7ba8069de9 #1
[38492.129962] Hardware name: Dell Inc. Inspiron 5570/09YTN7, BIOS
1.2.3 05/15/2019
[38492.129965] RIP: 0010:set_task_reclaim_state+0x2f/0x40
[38492.129968] Code: 55 48 89 f5 53 48 89 fb 48 85 ed 48 8b 83 08 08
00 00 74 11 48 85 c0 74 02 0f 0b 48 89 ab 08 08 00 00 5b 5d c3 48 85
c0 75 f1 <0f> 0b 48 89 ab 08 08 00 00 5b 5d c3 0f 1f 44 00 00 55 48 89
fd 53
[38492.129969] RSP: 0018:ffff8c898031fd88 EFLAGS: 00010246
[38492.129971] RAX: 0000000000000000 RBX: ffff892aa04ddc40 RCX: 0000000000000000
[38492.129972] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff892aa04ddc40
[38492.129973] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[38492.129974] R10: ffff892aa33d7544 R11: 0000000000000000 R12: ffff8c898031fe40
[38492.129974] R13: 0000000000000200 R14: ffff8c898031fe40 R15: 0000000000000001
[38492.129976] FS: 0000000000000000(0000) GS:ffff892aa33c0000(0000)
knlGS:0000000000000000
[38492.129977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[38492.129978] CR2: 00007fc3a2787010 CR3: 000000035c794001 CR4: 00000000003606e0
[38492.129979] Call Trace:
[38492.129985] balance_pgdat+0x4e6/0x540
[38492.129991] kswapd+0x200/0x3f0
[38492.129994] ? wait_woken+0x80/0x80
[38492.129997] kthread+0xfd/0x130
[38492.130000] ? balance_pgdat+0x540/0x540
[38492.130001] ? kthread_park+0x80/0x80
[38492.130005] ret_from_fork+0x35/0x40
[38492.130008] ---[ end trace 727343df67b2398b ]---

Thanks for reading. Please be gentle.


2019-08-24 01:04:15

by Yang Shi

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage



On 8/23/19 3:00 PM, Adric Blake wrote:
> Synopsis:
> A WARN_ON_ONCE is hit twice in set_task_reclaim_state under the
> following conditions:
> - a memory cgroup has been created and a task assigned it it
> - memory.limit_in_bytes has been set
> - memory has filled up, likely from cache
>
> In my usage, I create a cgroup under the current session scope and
> assign a task to it. I then set memory.limit_in_bytes and
> memory.soft_limit_in_bytes for the cgroup to reasonable values, say
> 1G/512M. The program accesses large files frequently and gradually
> fills memory with the page cache. The warnings appears when the
> entirety of the system memory is filled, presumably from other
> programs.
>
> If I wait until the program has filled the entirety of system memory
> with cache and then assign a memory limit, the warnings appear
> immediately.

It looks the warning is triggered because kswapd set reclaim_state then
the memcg soft limit reclaim in the same kswapd set it again.

But, kswapd and memcg soft limit uses different reclaim_state from
different scan control. It sounds not correct, they should use the same
reclaim_state if they come from the same context if my understanding is
correct.

>
> I am building the linux git. I first noticed this issue with the
> drm-tip 5.3rc3 and 5.3rc4 kernels, and tested linux master after
> 5.3rc5 to confirm the bug more resoundingly.
>
> Here are the warnings.
>
> [38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245
> set_task_reclaim_state+0x1e/0x40
> [38491.963106] Modules linked in: iwlmvm mac80211 libarc4 iwlwifi
> cfg80211 xt_comment nls_iso8859_1 nls_cp437 vfat fat xfs jfs btrfs xor
> raid6_pq libcrc32c ccm tun rfcomm fuse xt_tcpudp ip6t_REJECT
> nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_owner
> snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter bnep ext4
> crc32c_generic mbcache jbd2 snd_hda_codec_realtek
> snd_hda_codec_generic snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
> snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc
> coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel
> snd_soc_acpi i915 snd_soc_core kvm snd_compress ac97_bus
> snd_pcm_dmaengine snd_hda_intel i2c_algo_bit btusb irqbypass
> drm_kms_helper btrtl snd_hda_codec dell_laptop btbcm crct10dif_pclmul
> snd_hda_core crc32c_intel btintel iTCO_wdt ghash_clmulni_intel drm
> ledtrig_audio aesni_intel iTCO_vendor_support snd_hwdep dell_wmi
> rtsx_usb_ms r8169 dell_smbios aes_x86_64 mei_hdcp crypto_simd
> intel_gtt bluetooth snd_pcm cryptd dcdbas
> [38491.963155] wmi_bmof dell_wmi_descriptor intel_rapl_msr
> glue_helper snd_timer joydev intel_cstate snd realtek memstick
> dell_smm_hwmon mousedev psmouse input_leds libphy intel_uncore
> ecdh_generic ecc crc16 rfkill intel_rapl_perf soundcore i2c_i801
> agpgart mei_me tpm_crb syscopyarea sysfillrect sysimgblt mei
> intel_xhci_usb_role_switch fb_sys_fops idma64 tpm_tis roles
> processor_thermal_device intel_rapl_common i2c_hid tpm_tis_core
> int3403_thermal intel_soc_dts_iosf battery wmi intel_lpss_pci
> intel_lpss intel_pch_thermal tpm int3400_thermal int3402_thermal
> acpi_thermal_rel int340x_thermal_zone rng_core intel_hid ac
> sparse_keymap evdev mac_hid crypto_user ip_tables x_tables
> hid_multitouch rtsx_usb_sdmmc mmc_core rtsx_usb hid_logitech_hidpp
> sr_mod cdrom sd_mod uas usb_storage hid_logitech_dj hid_generic usbhid
> hid ahci serio_raw libahci atkbd libps2 libata xhci_pci scsi_mod
> xhci_hcd crc32_pclmul i8042 serio f2fs [last unloaded: cfg80211]
> [38491.963221] CPU: 7 PID: 175 Comm: kswapd0 Not tainted
> 5.3.0-rc5+149+gbb7ba8069de9 #1
> [38491.963222] Hardware name: Dell Inc. Inspiron 5570/09YTN7, BIOS
> 1.2.3 05/15/2019
> [38491.963226] RIP: 0010:set_task_reclaim_state+0x1e/0x40
> [38491.963228] Code: 78 a9 e7 ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00
> 00 55 48 89 f5 53 48 89 fb 48 85 ed 48 8b 83 08 08 00 00 74 11 48 85
> c0 74 02 <0f> 0b 48 89 ab 08 08 00 00 5b 5d c3 48 85 c0 75 f1 0f 0b 48
> 89 ab
> [38491.963229] RSP: 0018:ffff8c898031fc60 EFLAGS: 00010286
> [38491.963230] RAX: ffff8c898031fe28 RBX: ffff892aa04ddc40 RCX: 0000000000000000
> [38491.963231] RDX: ffff8c898031fc60 RSI: ffff8c898031fcd0 RDI: ffff892aa04ddc40
> [38491.963233] RBP: ffff8c898031fcd0 R08: ffff8c898031fd48 R09: ffff89279674b800
> [38491.963234] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8c898031fd48
> [38491.963235] R13: ffff892a842ef000 R14: ffff892aaf7fc000 R15: 0000000000000000
> [38491.963236] FS: 0000000000000000(0000) GS:ffff892aa33c0000(0000)
> knlGS:0000000000000000
> [38491.963238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [38491.963239] CR2: 00007f90628fa000 CR3: 000000027ee0a002 CR4: 00000000003606e0
> [38491.963239] Call Trace:
> [38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
> [38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
> [38491.963254] balance_pgdat+0x276/0x540
> [38491.963258] kswapd+0x200/0x3f0
> [38491.963261] ? wait_woken+0x80/0x80
> [38491.963265] kthread+0xfd/0x130
> [38491.963267] ? balance_pgdat+0x540/0x540
> [38491.963269] ? kthread_park+0x80/0x80
> [38491.963273] ret_from_fork+0x35/0x40
> [38491.963276] ---[ end trace 727343df67b2398a ]---
> [38492.129877] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:248
> set_task_reclaim_state+0x2f/0x40
> [38492.129879] Modules linked in: iwlmvm mac80211 libarc4 iwlwifi
> cfg80211 xt_comment nls_iso8859_1 nls_cp437 vfat fat xfs jfs btrfs xor
> raid6_pq libcrc32c ccm tun rfcomm fuse xt_tcpudp ip6t_REJECT
> nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_owner
> snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter bnep ext4
> crc32c_generic mbcache jbd2 snd_hda_codec_realtek
> snd_hda_codec_generic snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
> snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc
> coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel
> snd_soc_acpi i915 snd_soc_core kvm snd_compress ac97_bus
> snd_pcm_dmaengine snd_hda_intel i2c_algo_bit btusb irqbypass
> drm_kms_helper btrtl snd_hda_codec dell_laptop btbcm crct10dif_pclmul
> snd_hda_core crc32c_intel btintel iTCO_wdt ghash_clmulni_intel drm
> ledtrig_audio aesni_intel iTCO_vendor_support snd_hwdep dell_wmi
> rtsx_usb_ms r8169 dell_smbios aes_x86_64 mei_hdcp crypto_simd
> intel_gtt bluetooth snd_pcm cryptd dcdbas
> [38492.129919] wmi_bmof dell_wmi_descriptor intel_rapl_msr
> glue_helper snd_timer joydev intel_cstate snd realtek memstick
> dell_smm_hwmon mousedev psmouse input_leds libphy intel_uncore
> ecdh_generic ecc crc16 rfkill intel_rapl_perf soundcore i2c_i801
> agpgart mei_me tpm_crb syscopyarea sysfillrect sysimgblt mei
> intel_xhci_usb_role_switch fb_sys_fops idma64 tpm_tis roles
> processor_thermal_device intel_rapl_common i2c_hid tpm_tis_core
> int3403_thermal intel_soc_dts_iosf battery wmi intel_lpss_pci
> intel_lpss intel_pch_thermal tpm int3400_thermal int3402_thermal
> acpi_thermal_rel int340x_thermal_zone rng_core intel_hid ac
> sparse_keymap evdev mac_hid crypto_user ip_tables x_tables
> hid_multitouch rtsx_usb_sdmmc mmc_core rtsx_usb hid_logitech_hidpp
> sr_mod cdrom sd_mod uas usb_storage hid_logitech_dj hid_generic usbhid
> hid ahci serio_raw libahci atkbd libps2 libata xhci_pci scsi_mod
> xhci_hcd crc32_pclmul i8042 serio f2fs [last unloaded: cfg80211]
> [38492.129961] CPU: 7 PID: 175 Comm: kswapd0 Tainted: G W
> 5.3.0-rc5+149+gbb7ba8069de9 #1
> [38492.129962] Hardware name: Dell Inc. Inspiron 5570/09YTN7, BIOS
> 1.2.3 05/15/2019
> [38492.129965] RIP: 0010:set_task_reclaim_state+0x2f/0x40
> [38492.129968] Code: 55 48 89 f5 53 48 89 fb 48 85 ed 48 8b 83 08 08
> 00 00 74 11 48 85 c0 74 02 0f 0b 48 89 ab 08 08 00 00 5b 5d c3 48 85
> c0 75 f1 <0f> 0b 48 89 ab 08 08 00 00 5b 5d c3 0f 1f 44 00 00 55 48 89
> fd 53
> [38492.129969] RSP: 0018:ffff8c898031fd88 EFLAGS: 00010246
> [38492.129971] RAX: 0000000000000000 RBX: ffff892aa04ddc40 RCX: 0000000000000000
> [38492.129972] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff892aa04ddc40
> [38492.129973] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [38492.129974] R10: ffff892aa33d7544 R11: 0000000000000000 R12: ffff8c898031fe40
> [38492.129974] R13: 0000000000000200 R14: ffff8c898031fe40 R15: 0000000000000001
> [38492.129976] FS: 0000000000000000(0000) GS:ffff892aa33c0000(0000)
> knlGS:0000000000000000
> [38492.129977] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [38492.129978] CR2: 00007fc3a2787010 CR3: 000000035c794001 CR4: 00000000003606e0
> [38492.129979] Call Trace:
> [38492.129985] balance_pgdat+0x4e6/0x540
> [38492.129991] kswapd+0x200/0x3f0
> [38492.129994] ? wait_woken+0x80/0x80
> [38492.129997] kthread+0xfd/0x130
> [38492.130000] ? balance_pgdat+0x540/0x540
> [38492.130001] ? kthread_park+0x80/0x80
> [38492.130005] ret_from_fork+0x35/0x40
> [38492.130008] ---[ end trace 727343df67b2398b ]---
>
> Thanks for reading. Please be gentle.

2019-08-24 03:38:29

by Yafang Shao

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Sat, Aug 24, 2019 at 10:57 AM Hillf Danton <[email protected]> wrote:
>
>
> On Fri, 23 Aug 2019 18:00:15 -0400 Adric Blake wrote:
> > Synopsis:
> > A WARN_ON_ONCE is hit twice in set_task_reclaim_state under the
> > following conditions:
> > - a memory cgroup has been created and a task assigned it it
> > - memory.limit_in_bytes has been set
> > - memory has filled up, likely from cache
> >
> Thanks for report.
>
> > In my usage, I create a cgroup under the current session scope and
> > assign a task to it. I then set memory.limit_in_bytes and
> > memory.soft_limit_in_bytes for the cgroup to reasonable values, say
> > 1G/512M. The program accesses large files frequently and gradually
> > fills memory with the page cache. The warnings appears when the
> > entirety of the system memory is filled, presumably from other
> > programs.
> >
> > If I wait until the program has filled the entirety of system memory
> > with cache and then assign a memory limit, the warnings appear
> > immediately.
> >
> > I am building the linux git. I first noticed this issue with the
> > drm-tip 5.3rc3 and 5.3rc4 kernels, and tested linux master after
> > 5.3rc5 to confirm the bug more resoundingly.
> >
> > Here are the warnings.
> >
> > [38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
> > [38491.963106] Modules linked in: iwlmvm mac80211 libarc4 iwlwifi
> > cfg80211 xt_comment nls_iso8859_1 nls_cp437 vfat fat xfs jfs btrfs xor
> > raid6_pq libcrc32c ccm tun rfcomm fuse xt_tcpudp ip6t_REJECT
> > nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_multiport xt_owner
> > snd_hda_codec_hdmi ip6table_filter ip6_tables iptable_filter bnep ext4
> > crc32c_generic mbcache jbd2 snd_hda_codec_realtek
> > snd_hda_codec_generic snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
> > snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp snd_soc_sst_ipc
> > coretemp snd_soc_sst_dsp snd_soc_acpi_intel_match kvm_intel
> > snd_soc_acpi i915 snd_soc_core kvm snd_compress ac97_bus
> > snd_pcm_dmaengine snd_hda_intel i2c_algo_bit btusb irqbypass
> > drm_kms_helper btrtl snd_hda_codec dell_laptop btbcm crct10dif_pclmul
> > snd_hda_core crc32c_intel btintel iTCO_wdt ghash_clmulni_intel drm
> > ledtrig_audio aesni_intel iTCO_vendor_support snd_hwdep dell_wmi
> > rtsx_usb_ms r8169 dell_smbios aes_x86_64 mei_hdcp crypto_simd
> > intel_gtt bluetooth snd_pcm cryptd dcdbas
> > [38491.963155] wmi_bmof dell_wmi_descriptor intel_rapl_msr
> > glue_helper snd_timer joydev intel_cstate snd realtek memstick
> > dell_smm_hwmon mousedev psmouse input_leds libphy intel_uncore
> > ecdh_generic ecc crc16 rfkill intel_rapl_perf soundcore i2c_i801
> > agpgart mei_me tpm_crb syscopyarea sysfillrect sysimgblt mei
> > intel_xhci_usb_role_switch fb_sys_fops idma64 tpm_tis roles
> > processor_thermal_device intel_rapl_common i2c_hid tpm_tis_core
> > int3403_thermal intel_soc_dts_iosf battery wmi intel_lpss_pci
> > intel_lpss intel_pch_thermal tpm int3400_thermal int3402_thermal
> > acpi_thermal_rel int340x_thermal_zone rng_core intel_hid ac
> > sparse_keymap evdev mac_hid crypto_user ip_tables x_tables
> > hid_multitouch rtsx_usb_sdmmc mmc_core rtsx_usb hid_logitech_hidpp
> > sr_mod cdrom sd_mod uas usb_storage hid_logitech_dj hid_generic usbhid
> > hid ahci serio_raw libahci atkbd libps2 libata xhci_pci scsi_mod
> > xhci_hcd crc32_pclmul i8042 serio f2fs [last unloaded: cfg80211]
> > [38491.963221] CPU: 7 PID: 175 Comm: kswapd0 Not tainted 5.3.0-rc5+149+gbb7ba8069de9 #1
> > [38491.963222] Hardware name: Dell Inc. Inspiron 5570/09YTN7, BIOS 1.2.3 05/15/2019
> > [38491.963226] RIP: 0010:set_task_reclaim_state+0x1e/0x40
> > [38491.963228] Code: 78 a9 e7 ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00
> > 00 55 48 89 f5 53 48 89 fb 48 85 ed 48 8b 83 08 08 00 00 74 11 48 85
> > c0 74 02 <0f> 0b 48 89 ab 08 08 00 00 5b 5d c3 48 85 c0 75 f1 0f 0b 48
> > 89 ab
> > [38491.963229] RSP: 0018:ffff8c898031fc60 EFLAGS: 00010286
> > [38491.963230] RAX: ffff8c898031fe28 RBX: ffff892aa04ddc40 RCX: 0000000000000000
> > [38491.963231] RDX: ffff8c898031fc60 RSI: ffff8c898031fcd0 RDI: ffff892aa04ddc40
> > [38491.963233] RBP: ffff8c898031fcd0 R08: ffff8c898031fd48 R09: ffff89279674b800
> > [38491.963234] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8c898031fd48
> > [38491.963235] R13: ffff892a842ef000 R14: ffff892aaf7fc000 R15: 0000000000000000
> > [38491.963236] FS: 0000000000000000(0000) GS:ffff892aa33c0000(0000) knlGS:0000000000000000
> > [38491.963238] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [38491.963239] CR2: 00007f90628fa000 CR3: 000000027ee0a002 CR4: 00000000003606e0
> > [38491.963239] Call Trace:
> > [38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
> > [38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
> > [38491.963254] balance_pgdat+0x276/0x540
> > [38491.963258] kswapd+0x200/0x3f0
> > [38491.963261] ? wait_woken+0x80/0x80
> > [38491.963265] kthread+0xfd/0x130
> > [38491.963267] ? balance_pgdat+0x540/0x540
> > [38491.963269] ? kthread_park+0x80/0x80
> > [38491.963273] ret_from_fork+0x35/0x40
> > [38491.963276] ---[ end trace 727343df67b2398a ]---
>
> Save and restore reclaim state for global reclaimer as it
> can be clobbered by memcg.
>

Hi Hillf,

Thanks for your patch. It could fix this issue.
But I'm wondering if it is proper to place a new scan_control in
mem_cgroup_shrink_node().
Because the page alloction context is stored in the original
scan_control, but this new scan_control beaks it at all.
For example, the sc.nodemask is the page allocation preferred node,
but it is override by the new scan_control, that may cause extra
useless page reclaim, especially in the direct reclaim path.

Thanks
Yafang
> --- a/mm/vmscan.c
> +++ b/bb/vmscan.c
> @@ -253,6 +253,22 @@ static void set_task_reclaim_state(struc
> task->reclaim_state = rs;
> }
>
> +static struct reclaim_state *
> +save_task_reclaim_state(struct task_struct *task)
> +{
> + struct reclaim_state *rs = task->reclaim_state;
> + if (rs)
> + set_task_reclaim_state(task, NULL);
> + return rs;
> +}
> +
> +static void restore_task_reclaim_state(struct task_struct *task,
> + struct reclaim_state *rs)
> +{
> + if (rs)
> + set_task_reclaim_state(task, rs);
> +}
> +
> #ifdef CONFIG_MEMCG
> static bool global_reclaim(struct scan_control *sc)
> {
> @@ -3241,7 +3257,9 @@ unsigned long mem_cgroup_shrink_node(str
> .may_shrinkslab = 1,
> };
> unsigned long lru_pages;
> + struct reclaim_state *rs;
>
> + rs = save_task_reclaim_state(current);
> set_task_reclaim_state(current, &sc.reclaim_state);
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> @@ -3261,6 +3279,7 @@ unsigned long mem_cgroup_shrink_node(str
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> set_task_reclaim_state(current, NULL);
> + restore_task_reclaim_state(current, rs);
> *nr_scanned = sc.nr_scanned;
>
> return sc.nr_reclaimed;
> --
>

2019-08-26 11:34:05

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Fri 23-08-19 18:03:01, Yang Shi wrote:
>
>
> On 8/23/19 3:00 PM, Adric Blake wrote:
> > Synopsis:
> > A WARN_ON_ONCE is hit twice in set_task_reclaim_state under the
> > following conditions:
> > - a memory cgroup has been created and a task assigned it it
> > - memory.limit_in_bytes has been set
> > - memory has filled up, likely from cache
> >
> > In my usage, I create a cgroup under the current session scope and
> > assign a task to it. I then set memory.limit_in_bytes and
> > memory.soft_limit_in_bytes for the cgroup to reasonable values, say
> > 1G/512M. The program accesses large files frequently and gradually
> > fills memory with the page cache. The warnings appears when the
> > entirety of the system memory is filled, presumably from other
> > programs.
> >
> > If I wait until the program has filled the entirety of system memory
> > with cache and then assign a memory limit, the warnings appear
> > immediately.
>
> It looks the warning is triggered because kswapd set reclaim_state then the
> memcg soft limit reclaim in the same kswapd set it again.

Yes, this is indeed the case. The same seems possible from the direct
reclaim AFAICS.

> But, kswapd and memcg soft limit uses different reclaim_state from different
> scan control. It sounds not correct, they should use the same reclaim_state
> if they come from the same context if my understanding is correct.

I haven't checked very closely and I might be wrong but setting the
reclaim state from the mem_cgroup_shrink_node doesn't make any sense in
the current code. The soft limit is always called from the global
reclaim and both kswapd and the direct reclaim already track reclaim
state correctly. We just haven't noticed until now beause the warning is
quite recent and mostly likely only few people tend to use soft limit
these days.

That being said, we should simply do this instead:

From 59d128214a62bf2d83c2a2a9cde887b4817275e7 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 26 Aug 2019 12:43:15 +0200
Subject: [PATCH] mm, memcg: do not set reclaim_state on soft limit reclaim

Adric Blake has noticed the following warning:
[38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
[...]
[38491.963239] Call Trace:
[38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
[38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
[38491.963254] balance_pgdat+0x276/0x540
[38491.963258] kswapd+0x200/0x3f0
[38491.963261] ? wait_woken+0x80/0x80
[38491.963265] kthread+0xfd/0x130
[38491.963267] ? balance_pgdat+0x540/0x540
[38491.963269] ? kthread_park+0x80/0x80
[38491.963273] ret_from_fork+0x35/0x40
[38491.963276] ---[ end trace 727343df67b2398a ]---

which tells us that soft limit reclaim is about to overwrite the
reclaim_state configured up in the call chain (kswapd in this case but
the direct reclaim is equally possible). This means that reclaim stats
would get misleading once the soft reclaim returns and another reclaim
is done.

Fix the warning by dropping set_task_reclaim_state from the soft reclaim
which is always called with reclaim_state set up.

Reported-by: Adric Blake <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c77d1e3761a7..a6c5d0b28321 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3220,6 +3220,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,

#ifdef CONFIG_MEMCG

+/* Only used by soft limit reclaim. Do not reuse for anything else. */
unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
@@ -3235,7 +3236,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
};
unsigned long lru_pages;

- set_task_reclaim_state(current, &sc.reclaim_state);
+ WARN_ON_ONCE(!current->reclaim_state);
+
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);

@@ -3253,7 +3255,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

- set_task_reclaim_state(current, NULL);
*nr_scanned = sc.nr_scanned;

return sc.nr_reclaimed;
--
2.20.1

--
Michal Hocko
SUSE Labs

2019-08-27 10:44:52

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

If there are no objection to the patch I will post it as a standalong
one.

On Mon 26-08-19 12:55:21, Michal Hocko wrote:
> From 59d128214a62bf2d83c2a2a9cde887b4817275e7 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 26 Aug 2019 12:43:15 +0200
> Subject: [PATCH] mm, memcg: do not set reclaim_state on soft limit reclaim
>
> Adric Blake has noticed the following warning:
> [38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
> [...]
> [38491.963239] Call Trace:
> [38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
> [38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
> [38491.963254] balance_pgdat+0x276/0x540
> [38491.963258] kswapd+0x200/0x3f0
> [38491.963261] ? wait_woken+0x80/0x80
> [38491.963265] kthread+0xfd/0x130
> [38491.963267] ? balance_pgdat+0x540/0x540
> [38491.963269] ? kthread_park+0x80/0x80
> [38491.963273] ret_from_fork+0x35/0x40
> [38491.963276] ---[ end trace 727343df67b2398a ]---
>
> which tells us that soft limit reclaim is about to overwrite the
> reclaim_state configured up in the call chain (kswapd in this case but
> the direct reclaim is equally possible). This means that reclaim stats
> would get misleading once the soft reclaim returns and another reclaim
> is done.
>
> Fix the warning by dropping set_task_reclaim_state from the soft reclaim
> which is always called with reclaim_state set up.
>
> Reported-by: Adric Blake <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmscan.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c77d1e3761a7..a6c5d0b28321 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3220,6 +3220,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>
> #ifdef CONFIG_MEMCG
>
> +/* Only used by soft limit reclaim. Do not reuse for anything else. */
> unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> gfp_t gfp_mask, bool noswap,
> pg_data_t *pgdat,
> @@ -3235,7 +3236,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> };
> unsigned long lru_pages;
>
> - set_task_reclaim_state(current, &sc.reclaim_state);
> + WARN_ON_ONCE(!current->reclaim_state);
> +
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>
> @@ -3253,7 +3255,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> - set_task_reclaim_state(current, NULL);
> *nr_scanned = sc.nr_scanned;
>
> return sc.nr_reclaimed;
> --
> 2.20.1
>
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs

2019-08-27 11:47:08

by Yafang Shao

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
>
> If there are no objection to the patch I will post it as a standalong
> one.

I have no objection to your patch. It could fix the issue.

I still think that it is not proper to use a new scan_control here as
it breaks the global reclaim context.
This context switch from global reclaim to memcg reclaim is very
subtle change to the subsequent processing, that may cause some
unexpected behavior.
Anyway, we can send this patch as a standalong one.
Feel free to add:

Acked-by: Yafang Shao <[email protected]>

>
> On Mon 26-08-19 12:55:21, Michal Hocko wrote:
> > From 59d128214a62bf2d83c2a2a9cde887b4817275e7 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Mon, 26 Aug 2019 12:43:15 +0200
> > Subject: [PATCH] mm, memcg: do not set reclaim_state on soft limit reclaim
> >
> > Adric Blake has noticed the following warning:
> > [38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
> > [...]
> > [38491.963239] Call Trace:
> > [38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
> > [38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
> > [38491.963254] balance_pgdat+0x276/0x540
> > [38491.963258] kswapd+0x200/0x3f0
> > [38491.963261] ? wait_woken+0x80/0x80
> > [38491.963265] kthread+0xfd/0x130
> > [38491.963267] ? balance_pgdat+0x540/0x540
> > [38491.963269] ? kthread_park+0x80/0x80
> > [38491.963273] ret_from_fork+0x35/0x40
> > [38491.963276] ---[ end trace 727343df67b2398a ]---
> >
> > which tells us that soft limit reclaim is about to overwrite the
> > reclaim_state configured up in the call chain (kswapd in this case but
> > the direct reclaim is equally possible). This means that reclaim stats
> > would get misleading once the soft reclaim returns and another reclaim
> > is done.
> >
> > Fix the warning by dropping set_task_reclaim_state from the soft reclaim
> > which is always called with reclaim_state set up.
> >
> > Reported-by: Adric Blake <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/vmscan.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c77d1e3761a7..a6c5d0b28321 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3220,6 +3220,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >
> > #ifdef CONFIG_MEMCG
> >
> > +/* Only used by soft limit reclaim. Do not reuse for anything else. */
> > unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> > gfp_t gfp_mask, bool noswap,
> > pg_data_t *pgdat,
> > @@ -3235,7 +3236,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> > };
> > unsigned long lru_pages;
> >
> > - set_task_reclaim_state(current, &sc.reclaim_state);
> > + WARN_ON_ONCE(!current->reclaim_state);
> > +
> > sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> >
> > @@ -3253,7 +3255,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
> >
> > trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
> >
> > - set_task_reclaim_state(current, NULL);
> > *nr_scanned = sc.nr_scanned;
> >
> > return sc.nr_reclaimed;
> > --
> > 2.20.1
> >
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs

2019-08-27 11:51:37

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue 27-08-19 19:43:49, Yafang Shao wrote:
> On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
> >
> > If there are no objection to the patch I will post it as a standalong
> > one.
>
> I have no objection to your patch. It could fix the issue.
>
> I still think that it is not proper to use a new scan_control here as
> it breaks the global reclaim context.
>
> This context switch from global reclaim to memcg reclaim is very
> subtle change to the subsequent processing, that may cause some
> unexpected behavior.

Why would it break it? Could you be more specific please?

> Anyway, we can send this patch as a standalong one.
> Feel free to add:
>
> Acked-by: Yafang Shao <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs

2019-08-27 11:57:54

by Yafang Shao

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue, Aug 27, 2019 at 7:50 PM Michal Hocko <[email protected]> wrote:
>
> On Tue 27-08-19 19:43:49, Yafang Shao wrote:
> > On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
> > >
> > > If there are no objection to the patch I will post it as a standalong
> > > one.
> >
> > I have no objection to your patch. It could fix the issue.
> >
> > I still think that it is not proper to use a new scan_control here as
> > it breaks the global reclaim context.
> >
> > This context switch from global reclaim to memcg reclaim is very
> > subtle change to the subsequent processing, that may cause some
> > unexpected behavior.
>
> Why would it break it? Could you be more specific please?
>

Hmm, I have explained it when replying to Hillf's patch.
The most suspcious one is settting target_mem_cgroup here, because we
only use it to judge whether it is in global reclaim.
While the memcg softlimit reclaim is really in global reclaims.

Another example the reclaim_idx, if is not same with reclaim_idx in
page allocation context, the reclaimed pages may not be used by the
allocator, especially in the direct reclaim.

And some other things in scan_control.

> > Anyway, we can send this patch as a standalong one.
> > Feel free to add:
> >
> > Acked-by: Yafang Shao <[email protected]>
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

2019-08-27 12:05:05

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue 27-08-19 19:56:16, Yafang Shao wrote:
> On Tue, Aug 27, 2019 at 7:50 PM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 27-08-19 19:43:49, Yafang Shao wrote:
> > > On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
> > > >
> > > > If there are no objection to the patch I will post it as a standalong
> > > > one.
> > >
> > > I have no objection to your patch. It could fix the issue.
> > >
> > > I still think that it is not proper to use a new scan_control here as
> > > it breaks the global reclaim context.
> > >
> > > This context switch from global reclaim to memcg reclaim is very
> > > subtle change to the subsequent processing, that may cause some
> > > unexpected behavior.
> >
> > Why would it break it? Could you be more specific please?
> >
>
> Hmm, I have explained it when replying to Hillf's patch.
> The most suspcious one is settting target_mem_cgroup here, because we
> only use it to judge whether it is in global reclaim.
> While the memcg softlimit reclaim is really in global reclaims.

But we are reclaim the target_mem_cgroup hierarchy. This is the whole
point of the soft reclaim. Push down that hierarchy below the configured
limit. And that is why we absolutely have to switch the reclaim context.

> Another example the reclaim_idx, if is not same with reclaim_idx in
> page allocation context, the reclaimed pages may not be used by the
> allocator, especially in the direct reclaim.

Again, we do not care about that as well. All we care about is to
reclaim _some_ memory to get below the soft limit. This is the semantic
that is not really great but this is how the Soft reclaim has
traditionally worked and why we keep claiming that people shouldn't
really use it. It does lead to over reclaim and that is a design rather
than a bug.

> And some other things in scan_control.

Like?
--
Michal Hocko
SUSE Labs

2019-08-27 12:21:26

by Yafang Shao

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue, Aug 27, 2019 at 8:03 PM Michal Hocko <[email protected]> wrote:
>
> On Tue 27-08-19 19:56:16, Yafang Shao wrote:
> > On Tue, Aug 27, 2019 at 7:50 PM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 27-08-19 19:43:49, Yafang Shao wrote:
> > > > On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > If there are no objection to the patch I will post it as a standalong
> > > > > one.
> > > >
> > > > I have no objection to your patch. It could fix the issue.
> > > >
> > > > I still think that it is not proper to use a new scan_control here as
> > > > it breaks the global reclaim context.
> > > >
> > > > This context switch from global reclaim to memcg reclaim is very
> > > > subtle change to the subsequent processing, that may cause some
> > > > unexpected behavior.
> > >
> > > Why would it break it? Could you be more specific please?
> > >
> >
> > Hmm, I have explained it when replying to Hillf's patch.
> > The most suspcious one is settting target_mem_cgroup here, because we
> > only use it to judge whether it is in global reclaim.
> > While the memcg softlimit reclaim is really in global reclaims.
>
> But we are reclaim the target_mem_cgroup hierarchy. This is the whole
> point of the soft reclaim. Push down that hierarchy below the configured
> limit. And that is why we absolutely have to switch the reclaim context.
>

One obvious issue is the reclaim couters may not correct.
See shrink_inactive_list().
The pages relcaimed in memcg softlimit will not be counted to
PGSCAN_{DIRECT, KSWAPD} and
PGSTEAL_{DIRECT, KSWAPD}.
That may cause some misleading. For example, if these counters are not
changed, we will think that direct relcaim doesn't occur, while it
really occurs.

May issues are also in some other code around the usage of
global_reclaim(). I'm not sure of it.

> > Another example the reclaim_idx, if is not same with reclaim_idx in
> > page allocation context, the reclaimed pages may not be used by the
> > allocator, especially in the direct reclaim.
>
> Again, we do not care about that as well. All we care about is to
> reclaim _some_ memory to get below the soft limit. This is the semantic
> that is not really great but this is how the Soft reclaim has
> traditionally worked and why we keep claiming that people shouldn't
> really use it. It does lead to over reclaim and that is a design rather
> than a bug.
>
> > And some other things in scan_control.
>
> Like?
> --
> Michal Hocko
> SUSE Labs
>

2019-08-27 12:58:08

by Michal Hocko

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage

On Tue 27-08-19 20:19:34, Yafang Shao wrote:
> On Tue, Aug 27, 2019 at 8:03 PM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 27-08-19 19:56:16, Yafang Shao wrote:
> > > On Tue, Aug 27, 2019 at 7:50 PM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Tue 27-08-19 19:43:49, Yafang Shao wrote:
> > > > > On Tue, Aug 27, 2019 at 6:43 PM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > If there are no objection to the patch I will post it as a standalong
> > > > > > one.
> > > > >
> > > > > I have no objection to your patch. It could fix the issue.
> > > > >
> > > > > I still think that it is not proper to use a new scan_control here as
> > > > > it breaks the global reclaim context.
> > > > >
> > > > > This context switch from global reclaim to memcg reclaim is very
> > > > > subtle change to the subsequent processing, that may cause some
> > > > > unexpected behavior.
> > > >
> > > > Why would it break it? Could you be more specific please?
> > > >
> > >
> > > Hmm, I have explained it when replying to Hillf's patch.
> > > The most suspcious one is settting target_mem_cgroup here, because we
> > > only use it to judge whether it is in global reclaim.
> > > While the memcg softlimit reclaim is really in global reclaims.
> >
> > But we are reclaim the target_mem_cgroup hierarchy. This is the whole
> > point of the soft reclaim. Push down that hierarchy below the configured
> > limit. And that is why we absolutely have to switch the reclaim context.
> >
>
> One obvious issue is the reclaim couters may not correct.
> See shrink_inactive_list().
> The pages relcaimed in memcg softlimit will not be counted to
> PGSCAN_{DIRECT, KSWAPD} and PGSTEAL_{DIRECT, KSWAPD}.

And again this a long term behavior so I would be really curious why it
is considered a bug now. Really, the semantic of the soft limit is
weird. It has been grafted into the reclaim while it was doing a
semantically different thing. It doesn't really reflect kswapd or direct
reclaim targets.
--
Michal Hocko
SUSE Labs

2019-08-27 17:13:53

by Yang Shi

[permalink] [raw]
Subject: Re: WARNINGs in set_task_reclaim_state with memory cgroup and full memory usage



On 8/27/19 3:43 AM, Michal Hocko wrote:
> If there are no objection to the patch I will post it as a standalong
> one.
>
> On Mon 26-08-19 12:55:21, Michal Hocko wrote:
>> From 59d128214a62bf2d83c2a2a9cde887b4817275e7 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <[email protected]>
>> Date: Mon, 26 Aug 2019 12:43:15 +0200
>> Subject: [PATCH] mm, memcg: do not set reclaim_state on soft limit reclaim
>>
>> Adric Blake has noticed the following warning:
>> [38491.963105] WARNING: CPU: 7 PID: 175 at mm/vmscan.c:245 set_task_reclaim_state+0x1e/0x40
>> [...]
>> [38491.963239] Call Trace:
>> [38491.963246] mem_cgroup_shrink_node+0x9b/0x1d0
>> [38491.963250] mem_cgroup_soft_limit_reclaim+0x10c/0x3a0
>> [38491.963254] balance_pgdat+0x276/0x540
>> [38491.963258] kswapd+0x200/0x3f0
>> [38491.963261] ? wait_woken+0x80/0x80
>> [38491.963265] kthread+0xfd/0x130
>> [38491.963267] ? balance_pgdat+0x540/0x540
>> [38491.963269] ? kthread_park+0x80/0x80
>> [38491.963273] ret_from_fork+0x35/0x40
>> [38491.963276] ---[ end trace 727343df67b2398a ]---
>>
>> which tells us that soft limit reclaim is about to overwrite the
>> reclaim_state configured up in the call chain (kswapd in this case but
>> the direct reclaim is equally possible). This means that reclaim stats
>> would get misleading once the soft reclaim returns and another reclaim
>> is done.
>>
>> Fix the warning by dropping set_task_reclaim_state from the soft reclaim
>> which is always called with reclaim_state set up.

This is exactly what I thought. Looks good to me. Acked-by: Yang Shi
<[email protected]>

>>
>> Reported-by: Adric Blake <[email protected]>
>> Signed-off-by: Michal Hocko <[email protected]>
>> ---
>> mm/vmscan.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c77d1e3761a7..a6c5d0b28321 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3220,6 +3220,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>
>> #ifdef CONFIG_MEMCG
>>
>> +/* Only used by soft limit reclaim. Do not reuse for anything else. */
>> unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>> gfp_t gfp_mask, bool noswap,
>> pg_data_t *pgdat,
>> @@ -3235,7 +3236,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>> };
>> unsigned long lru_pages;
>>
>> - set_task_reclaim_state(current, &sc.reclaim_state);
>> + WARN_ON_ONCE(!current->reclaim_state);
>> +
>> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
>>
>> @@ -3253,7 +3255,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>>
>> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>>
>> - set_task_reclaim_state(current, NULL);
>> *nr_scanned = sc.nr_scanned;
>>
>> return sc.nr_reclaimed;
>> --
>> 2.20.1
>>
>> --
>> Michal Hocko
>> SUSE Labs