It is reported that some user app would like to echo a huge
number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
of the available memory, which will cause the coinstantaneous
page allocation failed and introduce OOM. The commit checking the
val against the available mem first to avoid the consequence allocation.
Signed-off-by: Zhaoyang Huang <[email protected]>
---
kernel/trace/trace.c | 39 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2d0ffcc..a4a4237 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -43,6 +43,8 @@
#include <linux/trace.h>
#include <linux/sched/rt.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
#include "trace.h"
#include "trace_output.h"
@@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
return ret;
}
+static long get_available_mem(void)
+{
+ struct sysinfo i;
+ long available;
+ unsigned long pagecache;
+ unsigned long wmark_low = 0;
+ unsigned long pages[NR_LRU_LISTS];
+ struct zone *zone;
+ int lru;
+
+ si_meminfo(&i);
+ si_swapinfo(&i);
+
+ for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+ pages[lru] = global_page_state(NR_LRU_BASE + lru);
+
+ for_each_zone(zone)
+ wmark_low += zone->watermark[WMARK_LOW];
+
+ available = i.freeram - wmark_low;
+
+ pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
+ pagecache -= min(pagecache / 2, wmark_low);
+ available += pagecache;
+
+ available += global_page_state(NR_SLAB_RECLAIMABLE) -
+ min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
+
+ if (available < 0)
+ available = 0;
+ return available;
+}
+
static ssize_t
tracing_entries_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
@@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
struct trace_array *tr = inode->i_private;
unsigned long val;
int ret;
+ long available;
+ available = get_available_mem();
ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
if (ret)
return ret;
/* must have at least 1 entry */
- if (!val)
+ if (!val || (val > available))
return -EINVAL;
/* value is in KB */
--
1.9.1
On Thu, 29 Mar 2018 18:41:44 +0800
Zhaoyang Huang <[email protected]> wrote:
> It is reported that some user app would like to echo a huge
> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
> of the available memory, which will cause the coinstantaneous
> page allocation failed and introduce OOM. The commit checking the
> val against the available mem first to avoid the consequence allocation.
>
One of my tests is to stress buffer_size_kb, and it fails nicely if you
try to get too much. Although, it may cause an OOM, but that's expected.
The application should do the test (try "free" on the command line).
This isn't something that the kernel should be responsible for. If
someone wants to allocate all memory for tracing, that's their
prerogative.
-- Steve
On Fri, Mar 30, 2018 at 12:05 AM, Steven Rostedt <[email protected]> wrote:
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> It is reported that some user app would like to echo a huge
>> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>> of the available memory, which will cause the coinstantaneous
>> page allocation failed and introduce OOM. The commit checking the
>> val against the available mem first to avoid the consequence allocation.
>>
>
> One of my tests is to stress buffer_size_kb, and it fails nicely if you
> try to get too much. Although, it may cause an OOM, but that's expected.
>
> The application should do the test (try "free" on the command line).
> This isn't something that the kernel should be responsible for. If
> someone wants to allocate all memory for tracing, that's their
> prerogative.
>
> -- Steve
Steve, thanks for your quick feedback. The original purpose is to
avoid such kind
of OOM as kernel can not define the behavior of the application. I
think it is no need
to do the alloc->free process if we have known the number of pages
difinitly lead to failure.
Furthermore, the app which screw up the thing usually escape the OOM and cause
the innocent to be sacrificed.
Hi Zhaoyang,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917
config: i386-randconfig-x073-201812 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All error/warnings (new ones prefixed by >>):
kernel/trace/trace.c: In function 'get_available_mem':
>> kernel/trace/trace.c:5992:16: error: implicit declaration of function 'global_page_state'; did you mean 'zone_page_state'? [-Werror=implicit-function-declaration]
pages[lru] = global_page_state(NR_LRU_BASE + lru);
^~~~~~~~~~~~~~~~~
zone_page_state
In file included from include/asm-generic/bug.h:18:0,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from include/linux/ring_buffer.h:5,
from kernel/trace/trace.c:14:
include/linux/kernel.h:793:16: warning: comparison of distinct pointer types lacks a cast
(void) (&min1 == &min2); \
^
include/linux/kernel.h:802:2: note: in expansion of macro '__min'
__min(typeof(x), typeof(y), \
^~~~~
>> kernel/trace/trace.c:6004:3: note: in expansion of macro 'min'
min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
^~~
cc1: some warnings being treated as errors
vim +5992 kernel/trace/trace.c
5977
5978 static long get_available_mem(void)
5979 {
5980 struct sysinfo i;
5981 long available;
5982 unsigned long pagecache;
5983 unsigned long wmark_low = 0;
5984 unsigned long pages[NR_LRU_LISTS];
5985 struct zone *zone;
5986 int lru;
5987
5988 si_meminfo(&i);
5989 si_swapinfo(&i);
5990
5991 for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> 5992 pages[lru] = global_page_state(NR_LRU_BASE + lru);
5993
5994 for_each_zone(zone)
5995 wmark_low += zone->watermark[WMARK_LOW];
5996
5997 available = i.freeram - wmark_low;
5998
5999 pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
6000 pagecache -= min(pagecache / 2, wmark_low);
6001 available += pagecache;
6002
6003 available += global_page_state(NR_SLAB_RECLAIMABLE) -
> 6004 min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
6005
6006 if (available < 0)
6007 available = 0;
6008 return available;
6009 }
6010
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Zhaoyang,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/kernel-trace-check-the-val-against-the-available-mem/20180330-140917
config: x86_64-randconfig-x009-201812 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
kernel//trace/trace.c: In function 'get_available_mem':
>> kernel//trace/trace.c:5992:16: error: implicit declaration of function 'global_page_state'; did you mean 'global_numa_state'? [-Werror=implicit-function-declaration]
pages[lru] = global_page_state(NR_LRU_BASE + lru);
^~~~~~~~~~~~~~~~~
global_numa_state
In file included from include/asm-generic/bug.h:18:0,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from include/linux/ring_buffer.h:5,
from kernel//trace/trace.c:14:
include/linux/kernel.h:793:16: warning: comparison of distinct pointer types lacks a cast
(void) (&min1 == &min2); \
^
include/linux/kernel.h:802:2: note: in expansion of macro '__min'
__min(typeof(x), typeof(y), \
^~~~~
kernel//trace/trace.c:6004:3: note: in expansion of macro 'min'
min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
^~~
Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:set_bit
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:clear_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls64
Cyclomatic Complexity 1 arch/x86/include/asm/arch_hweight.h:__arch_hweight64
Cyclomatic Complexity 2 include/linux/bitops.h:hweight_long
Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u64
Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
Cyclomatic Complexity 1 include/linux/list.h:__list_add_valid
Cyclomatic Complexity 1 include/linux/list.h:__list_del_entry_valid
Cyclomatic Complexity 2 include/linux/list.h:__list_add
Cyclomatic Complexity 1 include/linux/list.h:list_add
Cyclomatic Complexity 1 include/linux/list.h:__list_del
Cyclomatic Complexity 2 include/linux/list.h:__list_del_entry
Cyclomatic Complexity 1 include/linux/list.h:list_del
Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_read
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_set
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_inc
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_dec
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_add_return
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:atomic_cmpxchg
Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:atomic64_read
Cyclomatic Complexity 1 include/asm-generic/atomic-long.h:atomic_long_read
Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_count
Cyclomatic Complexity 2 include/linux/jump_label.h:static_key_false
Cyclomatic Complexity 1 include/linux/string.h:strnlen
Cyclomatic Complexity 4 include/linux/string.h:strlen
Cyclomatic Complexity 6 include/linux/string.h:strlcpy
Cyclomatic Complexity 3 include/linux/string.h:memset
Cyclomatic Complexity 4 include/linux/string.h:memcpy
Cyclomatic Complexity 2 include/linux/string.h:strcpy
Cyclomatic Complexity 3 include/linux/bitmap.h:bitmap_weight
Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_check
Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_set_cpu
Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_test_cpu
Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_weight
Cyclomatic Complexity 1 include/linux/cpumask.h:cpumask_available
Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_save_flags
Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_restore
Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_disable
Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_enable
Cyclomatic Complexity 1 arch/x86/include/asm/paravirt.h:arch_local_irq_save
Cyclomatic Complexity 1 include/linux/err.h:ERR_PTR
Cyclomatic Complexity 1 include/linux/err.h:PTR_ERR
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:arch_irqs_disabled_flags
Cyclomatic Complexity 1 include/linux/thread_info.h:check_object_size
Cyclomatic Complexity 5 include/linux/thread_info.h:check_copy_size
Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:preempt_count
Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:test_preempt_need_resched
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_add
Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:__preempt_count_dec_and_test
Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:should_resched
Cyclomatic Complexity 3 include/asm-generic/qspinlock.h:queued_spin_trylock
Cyclomatic Complexity 2 include/asm-generic/qspinlock.h:queued_spin_lock
Cyclomatic Complexity 1 include/linux/spinlock.h:spinlock_check
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock_irqrestore
Cyclomatic Complexity 1 include/linux/mutex.h:__mutex_owner
Cyclomatic Complexity 1 include/linux/mutex.h:mutex_is_locked
Cyclomatic Complexity 1 include/linux/topology.h:numa_node_id
Cyclomatic Complexity 1 include/linux/topology.h:cpu_to_node
Cyclomatic Complexity 1 include/linux/topology.h:numa_mem_id
Cyclomatic Complexity 1 include/linux/dcache.h:d_inode
Cyclomatic Complexity 1 include/linux/uidgid.h:__kuid_val
Cyclomatic Complexity 1 include/linux/uidgid.h:from_kuid
Cyclomatic Complexity 2 include/linux/uidgid.h:from_kuid_munged
Cyclomatic Complexity 1 include/linux/fs.h:file_inode
Cyclomatic Complexity 1 include/linux/mm.h:lowmem_page_address
Cyclomatic Complexity 1 include/linux/seq_file.h:seq_user_ns
Cyclomatic Complexity 1 arch/x86/include/asm/smap.h:clac
Cyclomatic Complexity 1 arch/x86/include/asm/smap.h:stac
Cyclomatic Complexity 1 arch/x86/include/asm/uaccess_64.h:copy_user_generic
Cyclomatic Complexity 10 arch/x86/include/asm/uaccess_64.h:raw_copy_from_user
Cyclomatic Complexity 1 include/linux/uaccess.h:__copy_from_user_inatomic
Cyclomatic Complexity 2 include/linux/uaccess.h:copy_from_user
Cyclomatic Complexity 2 include/linux/uaccess.h:copy_to_user
Cyclomatic Complexity 28 include/linux/slab.h:kmalloc_index
vim +5992 kernel//trace/trace.c
5977
5978 static long get_available_mem(void)
5979 {
5980 struct sysinfo i;
5981 long available;
5982 unsigned long pagecache;
5983 unsigned long wmark_low = 0;
5984 unsigned long pages[NR_LRU_LISTS];
5985 struct zone *zone;
5986 int lru;
5987
5988 si_meminfo(&i);
5989 si_swapinfo(&i);
5990
5991 for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> 5992 pages[lru] = global_page_state(NR_LRU_BASE + lru);
5993
5994 for_each_zone(zone)
5995 wmark_low += zone->watermark[WMARK_LOW];
5996
5997 available = i.freeram - wmark_low;
5998
5999 pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
6000 pagecache -= min(pagecache / 2, wmark_low);
6001 available += pagecache;
6002
6003 available += global_page_state(NR_SLAB_RECLAIMABLE) -
6004 min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
6005
6006 if (available < 0)
6007 available = 0;
6008 return available;
6009 }
6010
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Fri, 30 Mar 2018 11:32:22 +0800
Zhaoyang Huang <[email protected]> wrote:
> Steve, thanks for your quick feedback. The original purpose is to
> avoid such kind
> of OOM as kernel can not define the behavior of the application. I
> think it is no need
> to do the alloc->free process if we have known the number of pages
> difinitly lead to failure.
> Furthermore, the app which screw up the thing usually escape the OOM and cause
> the innocent to be sacrificed.
A couple of things. Your patch goes way too deep into coupling with the
memory management subsystem. If I were to accept this patch, Linus
would have a fit. That get_available_mem() does not belong in the
tracing code. If you want that feature, you need to get the mm folks to
accept it, and then tracing could call it.
Next, this may be an issue with the GPF_RETRY_MAYFAIL should not
trigger OOM (although, when it fails, other allocations that happen at
that time and before the ring buffer can clean up, will cause an OOM).
I'll send an email to folks about this.
-- Steve
[ Adding memory management folks to discuss the issue ]
On Thu, 29 Mar 2018 18:41:44 +0800
Zhaoyang Huang <[email protected]> wrote:
> It is reported that some user app would like to echo a huge
> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
> of the available memory, which will cause the coinstantaneous
> page allocation failed and introduce OOM. The commit checking the
> val against the available mem first to avoid the consequence allocation.
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> kernel/trace/trace.c | 39 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 2d0ffcc..a4a4237 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -43,6 +43,8 @@
> #include <linux/trace.h>
> #include <linux/sched/rt.h>
>
> +#include <linux/mm.h>
> +#include <linux/swap.h>
> #include "trace.h"
> #include "trace_output.h"
>
> @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> return ret;
> }
>
> +static long get_available_mem(void)
> +{
> + struct sysinfo i;
> + long available;
> + unsigned long pagecache;
> + unsigned long wmark_low = 0;
> + unsigned long pages[NR_LRU_LISTS];
> + struct zone *zone;
> + int lru;
> +
> + si_meminfo(&i);
> + si_swapinfo(&i);
> +
> + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> + pages[lru] = global_page_state(NR_LRU_BASE + lru);
> +
> + for_each_zone(zone)
> + wmark_low += zone->watermark[WMARK_LOW];
> +
> + available = i.freeram - wmark_low;
> +
> + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
> + pagecache -= min(pagecache / 2, wmark_low);
> + available += pagecache;
> +
> + available += global_page_state(NR_SLAB_RECLAIMABLE) -
> + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
> +
> + if (available < 0)
> + available = 0;
> + return available;
> +}
> +
As I stated in my other reply, the above function does not belong in
tracing.
That said, it appears you are having issues that were caused by the
change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
NORETRY was to keep allocations of the tracing ring-buffer from causing
OOMs. But the RETRY was too strong in that case, because there were
those that wanted to allocate large ring buffers but it would fail due
to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
is to allocate with reclaim but still allow to fail, and isn't suppose
to trigger an OOM. From my own tests, this is obviously not the case.
Perhaps this is because the ring buffer allocates one page at a time,
and by doing so, it can get every last available page, and if anything
in the mean time does an allocation without MAYFAIL, it will cause an
OOM. For example, when I stressed this I triggered this:
pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
pool cpuset=/ mems_allowed=0
CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
dump_stack+0x8e/0xce
dump_header.isra.30+0x6e/0x28f
? _raw_spin_unlock_irqrestore+0x30/0x60
oom_kill_process+0x218/0x400
? has_capability_noaudit+0x17/0x20
out_of_memory+0xe3/0x5c0
__alloc_pages_slowpath+0xa8e/0xe50
__alloc_pages_nodemask+0x206/0x220
alloc_pages_current+0x6a/0xe0
__page_cache_alloc+0x6a/0xa0
filemap_fault+0x208/0x5f0
? __might_sleep+0x4a/0x80
ext4_filemap_fault+0x31/0x44
__do_fault+0x20/0xd0
__handle_mm_fault+0xc08/0x1160
handle_mm_fault+0x76/0x110
__do_page_fault+0x299/0x580
do_page_fault+0x2d/0x110
? page_fault+0x2f/0x50
page_fault+0x45/0x50
I wonder if I should have the ring buffer allocate groups of pages, to
avoid this. Or try to allocate with NORETRY, one page at a time, and
when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
may keep it from causing an OOM?
Thoughts?
-- Steve
> static ssize_t
> tracing_entries_write(struct file *filp, const char __user *ubuf,
> size_t cnt, loff_t *ppos)
> @@ -5975,13 +6010,15 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> struct trace_array *tr = inode->i_private;
> unsigned long val;
> int ret;
> + long available;
>
> + available = get_available_mem();
> ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
> if (ret)
> return ret;
>
> /* must have at least 1 entry */
> - if (!val)
> + if (!val || (val > available))
> return -EINVAL;
>
> /* value is in KB */
Hi Steve,
On Fri, Mar 30, 2018 at 7:20 AM, Steven Rostedt <[email protected]> wrote:
>
> [ Adding memory management folks to discuss the issue ]
>
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
>> It is reported that some user app would like to echo a huge
>> number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
>> of the available memory, which will cause the coinstantaneous
>> page allocation failed and introduce OOM. The commit checking the
>> val against the available mem first to avoid the consequence allocation.
>>
>> Signed-off-by: Zhaoyang Huang <[email protected]>
>> ---
>> kernel/trace/trace.c | 39 ++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index 2d0ffcc..a4a4237 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -43,6 +43,8 @@
>> #include <linux/trace.h>
>> #include <linux/sched/rt.h>
>>
>> +#include <linux/mm.h>
>> +#include <linux/swap.h>
>> #include "trace.h"
>> #include "trace_output.h"
>>
>> @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>> return ret;
>> }
>>
>> +static long get_available_mem(void)
>> +{
>> + struct sysinfo i;
>> + long available;
>> + unsigned long pagecache;
>> + unsigned long wmark_low = 0;
>> + unsigned long pages[NR_LRU_LISTS];
>> + struct zone *zone;
>> + int lru;
>> +
>> + si_meminfo(&i);
>> + si_swapinfo(&i);
>> +
>> + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
>> + pages[lru] = global_page_state(NR_LRU_BASE + lru);
>> +
>> + for_each_zone(zone)
>> + wmark_low += zone->watermark[WMARK_LOW];
>> +
>> + available = i.freeram - wmark_low;
>> +
>> + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
>> + pagecache -= min(pagecache / 2, wmark_low);
>> + available += pagecache;
>> +
>> + available += global_page_state(NR_SLAB_RECLAIMABLE) -
>> + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
>> +
>> + if (available < 0)
>> + available = 0;
>> + return available;
>> +}
>> +
>
> As I stated in my other reply, the above function does not belong in
> tracing.
>
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were
Yes this was discussed with -mm folks. Basically the problem we were
seeing is devices with tonnes of free memory (but free as in free but
used by page cache) were not being used so it was unnecessarily
failing to allocate ring buffer on the system with otherwise lots of
memory.
> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.
>
IIRC, the OOM that my patch was trying to avoid, was being triggered
in the path/context of the write to buffer_size_kb itself (when not
doing the NORETRY), not by other processes.
> Perhaps this is because the ring buffer allocates one page at a time,
> and by doing so, it can get every last available page, and if anything
> in the mean time does an allocation without MAYFAIL, it will cause an
> OOM. For example, when I stressed this I triggered this:
>
> pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> pool cpuset=/ mems_allowed=0
> CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
> Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> Call Trace:
> dump_stack+0x8e/0xce
> dump_header.isra.30+0x6e/0x28f
> ? _raw_spin_unlock_irqrestore+0x30/0x60
> oom_kill_process+0x218/0x400
> ? has_capability_noaudit+0x17/0x20
> out_of_memory+0xe3/0x5c0
> __alloc_pages_slowpath+0xa8e/0xe50
> __alloc_pages_nodemask+0x206/0x220
> alloc_pages_current+0x6a/0xe0
> __page_cache_alloc+0x6a/0xa0
> filemap_fault+0x208/0x5f0
> ? __might_sleep+0x4a/0x80
> ext4_filemap_fault+0x31/0x44
> __do_fault+0x20/0xd0
> __handle_mm_fault+0xc08/0x1160
> handle_mm_fault+0x76/0x110
> __do_page_fault+0x299/0x580
> do_page_fault+0x2d/0x110
> ? page_fault+0x2f/0x50
> page_fault+0x45/0x50
But this OOM is not in the path of the buffer_size_kb write, right? So
then what does it have to do with buffer_size_kb write failure?
I guess the original issue reported is that the buffer_size_kb write
causes *other* applications to fail allocation. So in that case,
capping the amount that ftrace writes makes sense. Basically my point
is I don't see how the patch you mentioned introduces the problem here
- in the sense the patch just makes ftrace allocate from memory it
couldn't before and to try harder.
>
> I wonder if I should have the ring buffer allocate groups of pages, to
> avoid this. Or try to allocate with NORETRY, one page at a time, and
> when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> may keep it from causing an OOM?
>
I don't see immediately how that can prevent an OOM in other
applications here? If ftrace allocates lots of memory with
RETRY_MAYFAIL, then we would still OOM in other applications if memory
isn't available. Sorry if I missed something.
Thanks,
- Joel
On Fri, 30 Mar 2018 09:37:58 -0700
Joel Fernandes <[email protected]> wrote:
> > That said, it appears you are having issues that were caused by the
> > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> > NORETRY was to keep allocations of the tracing ring-buffer from causing
> > OOMs. But the RETRY was too strong in that case, because there were
>
> Yes this was discussed with -mm folks. Basically the problem we were
> seeing is devices with tonnes of free memory (but free as in free but
> used by page cache) were not being used so it was unnecessarily
> failing to allocate ring buffer on the system with otherwise lots of
> memory.
Right.
>
> IIRC, the OOM that my patch was trying to avoid, was being triggered
> in the path/context of the write to buffer_size_kb itself (when not
> doing the NORETRY), not by other processes.
Yes, that is correct.
>
> > Perhaps this is because the ring buffer allocates one page at a time,
> > and by doing so, it can get every last available page, and if anything
> > in the mean time does an allocation without MAYFAIL, it will cause an
> > OOM. For example, when I stressed this I triggered this:
> >
> > pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> > pool cpuset=/ mems_allowed=0
> > CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
> > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> > Call Trace:
> > dump_stack+0x8e/0xce
> > dump_header.isra.30+0x6e/0x28f
> > ? _raw_spin_unlock_irqrestore+0x30/0x60
> > oom_kill_process+0x218/0x400
> > ? has_capability_noaudit+0x17/0x20
> > out_of_memory+0xe3/0x5c0
> > __alloc_pages_slowpath+0xa8e/0xe50
> > __alloc_pages_nodemask+0x206/0x220
> > alloc_pages_current+0x6a/0xe0
> > __page_cache_alloc+0x6a/0xa0
> > filemap_fault+0x208/0x5f0
> > ? __might_sleep+0x4a/0x80
> > ext4_filemap_fault+0x31/0x44
> > __do_fault+0x20/0xd0
> > __handle_mm_fault+0xc08/0x1160
> > handle_mm_fault+0x76/0x110
> > __do_page_fault+0x299/0x580
> > do_page_fault+0x2d/0x110
> > ? page_fault+0x2f/0x50
> > page_fault+0x45/0x50
>
> But this OOM is not in the path of the buffer_size_kb write, right? So
> then what does it have to do with buffer_size_kb write failure?
Yep. I'll explain below.
>
> I guess the original issue reported is that the buffer_size_kb write
> causes *other* applications to fail allocation. So in that case,
> capping the amount that ftrace writes makes sense. Basically my point
> is I don't see how the patch you mentioned introduces the problem here
> - in the sense the patch just makes ftrace allocate from memory it
> couldn't before and to try harder.
The issue is that ftrace allocates its ring buffer one page at a time.
Thus, when a RETRY_MAYFAIL succeeds, that memory is allocated. Since it
does it one page at a time, even if ftrace does not get all the memory
it needs at the end, it will take all memory from the system before it
finds that out. Then, if something else (like the above splat) tries to
allocate anything, it will fail and trigger an OOM.
>
> >
> > I wonder if I should have the ring buffer allocate groups of pages, to
> > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > may keep it from causing an OOM?
> >
>
> I don't see immediately how that can prevent an OOM in other
> applications here? If ftrace allocates lots of memory with
> RETRY_MAYFAIL, then we would still OOM in other applications if memory
> isn't available. Sorry if I missed something.
Here's the idea.
Allocate one page at a time with NORETRY. If that fails, then allocate
larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it
can't get all the memory it needs, it wont take up all memory in the
system before it finds out that it can't have any more.
Or perhaps the memory management system can provide a
get_available_mem() function that ftrace could call before it tries to
increase the ring buffer and take up all the memory of the system
before it realizes that it can't get all the memory it wants.
The main problem I have with Zhaoyang's patch is that
get_available_mem() does not belong in the tracing code. It should be
something that the mm subsystem provides.
-- Steve
Hi Steve,
On Fri, Mar 30, 2018 at 12:10 PM, Steven Rostedt <[email protected]> wrote:
[..]
>> > I wonder if I should have the ring buffer allocate groups of pages, to
>> > avoid this. Or try to allocate with NORETRY, one page at a time, and
>> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
>> > may keep it from causing an OOM?
>> >
>>
>> I don't see immediately how that can prevent an OOM in other
>> applications here? If ftrace allocates lots of memory with
>> RETRY_MAYFAIL, then we would still OOM in other applications if memory
>> isn't available. Sorry if I missed something.
>
> Here's the idea.
>
> Allocate one page at a time with NORETRY. If that fails, then allocate
> larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it
> can't get all the memory it needs, it wont take up all memory in the
> system before it finds out that it can't have any more.
>
> Or perhaps the memory management system can provide a
> get_available_mem() function that ftrace could call before it tries to
> increase the ring buffer and take up all the memory of the system
> before it realizes that it can't get all the memory it wants.
>
> The main problem I have with Zhaoyang's patch is that
> get_available_mem() does not belong in the tracing code. It should be
> something that the mm subsystem provides.
>
Cool. Personally I like the getting of available memory solution and
use that, since its simpler.
MM already provides it through si_mem_available since the commit
"mm/page_alloc.c: calculate 'available' memory in a separate function"
(sha d02bd27b). Maybe we could just use that?
MemAvailable was initially added in commit "/proc/meminfo: provide
estimated available memory" (sha 34e431b0ae39)
thanks,
- Joel
On Fri, Mar 30, 2018 at 10:20:38AM -0400, Steven Rostedt wrote:
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were
> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.
That's not exactly what the comment says in gfp.h:
* __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
* procedures that have previously failed if there is some indication
* that progress has been made else where. It can wait for other
* tasks to attempt high level approaches to freeing memory such as
* compaction (which removes fragmentation) and page-out.
* There is still a definite limit to the number of retries, but it is
* a larger limit than with __GFP_NORETRY.
* Allocations with this flag may fail, but only when there is
* genuinely little unused memory. While these allocations do not
* directly trigger the OOM killer, their failure indicates that
* the system is likely to need to use the OOM killer soon. The
* caller must handle failure, but can reasonably do so by failing
* a higher-level request, or completing it only in a much less
* efficient manner.
* If the allocation does fail, and the caller is in a position to
* free some non-essential memory, doing so could benefit the system
* as a whole.
It seems to me that what you're asking for at the moment is
lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely
clear to me why your allocation is so much more important than other
allocations in the kernel.
Also, the pattern you have is very close to that of vmalloc. You're
allocating one page at a time to satisfy a multi-page request. In lieu
of actually thinking about what you should do, I might recommend using the
same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN
(possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages
when accessed from within the kernel).
On Fri, 30 Mar 2018 13:53:56 -0700
Matthew Wilcox <[email protected]> wrote:
> It seems to me that what you're asking for at the moment is
> lower-likelihood-of-failure-than-GFP_KERNEL, and it's not entirely
> clear to me why your allocation is so much more important than other
> allocations in the kernel.
The ring buffer is for fast tracing and is allocated when a user
requests it. Usually there's plenty of memory, but when a user wants a
lot of tracing events stored, they may increase it themselves.
>
> Also, the pattern you have is very close to that of vmalloc. You're
> allocating one page at a time to satisfy a multi-page request. In lieu
> of actually thinking about what you should do, I might recommend using the
> same GFP flags as vmalloc() which works out to GFP_KERNEL | __GFP_NOWARN
> (possibly | __GFP_HIGHMEM if you can tolerate having to kmap the pages
> when accessed from within the kernel).
When the ring buffer was first created, we couldn't use vmalloc because
vmalloc access wasn't working in NMIs (that has recently changed with
lots of work to handle faults). But the ring buffer is broken up into
pages (that are sent to the user or to the network), and allocating one
page at a time makes everything work fine.
The issue that happens when someone allocates a large ring buffer is
that it will allocate all memory in the system before it fails. This
means that there's a short time that any allocation will cause an OOM
(which is what is happening).
I think I agree with Joel and Zhaoyang, that we shouldn't allocate a
ring buffer if there's not going to be enough memory to do it. If we
can see the available memory before we start allocating one page at a
time, and if the available memory isn't going to be sufficient, there's
no reason to try to do the allocation, and simply send to the use
-ENOMEM, and let them try something smaller.
I'll take a look at si_mem_available() that Joel suggested and see if
we can make that work.
-- Steve
On Fri, 30 Mar 2018 17:30:31 -0400
Steven Rostedt <[email protected]> wrote:
> I'll take a look at si_mem_available() that Joel suggested and see if
> we can make that work.
Wow, this appears to work great! Joel and Zhaoyang, can you test this?
-- Steve
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a2fd3893cc02..32a803626ee2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
struct buffer_page *bpage, *tmp;
long i;
+ /* Check if the available memory is there first */
+ i = si_mem_available();
+ if (i < nr_pages)
+ return -ENOMEM;
+
for (i = 0; i < nr_pages; i++) {
struct page *page;
/*
Hi Steven,
On Fri, Mar 30, 2018 at 2:42 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 30 Mar 2018 17:30:31 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> I'll take a look at si_mem_available() that Joel suggested and see if
>> we can make that work.
>
> Wow, this appears to work great! Joel and Zhaoyang, can you test this?
>
> -- Steve
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index a2fd3893cc02..32a803626ee2 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> struct buffer_page *bpage, *tmp;
> long i;
>
> + /* Check if the available memory is there first */
> + i = si_mem_available();
> + if (i < nr_pages)
Does it make sense to add a small margin here so that after ftrace
finishes allocating, we still have some memory left for the system?
But then then we have to define a magic number :-|
> + return -ENOMEM;
> +
I tested in Qemu with 1GB memory, I am always able to get it to fail
allocation even without this patch without causing an OOM. Maybe I am
not running enough allocations in parallel or something :)
The patch you shared using si_mem_available is working since I'm able
to allocate till the end without a page allocation failure:
bash-4.3# echo 237800 > /d/tracing/buffer_size_kb
bash: echo: write error: Cannot allocate memory
bash-4.3# echo 237700 > /d/tracing/buffer_size_kb
bash-4.3# free -m
total used free shared buffers
Mem: 985 977 7 10 0
-/+ buffers: 977 7
Swap: 0 0 0
bash-4.3#
I think this patch is still good to have, since IMO we should not go
and get page allocation failure (even if its a non-OOM) and subsequent
stack dump from mm's allocator, if we can avoid it.
Tested-by: Joel Fernandes <[email protected]>
thanks,
- Joel
On Fri, 30 Mar 2018 16:38:52 -0700
Joel Fernandes <[email protected]> wrote:
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> > struct buffer_page *bpage, *tmp;
> > long i;
> >
> > + /* Check if the available memory is there first */
> > + i = si_mem_available();
> > + if (i < nr_pages)
>
> Does it make sense to add a small margin here so that after ftrace
> finishes allocating, we still have some memory left for the system?
> But then then we have to define a magic number :-|
I don't think so. The memory is allocated by user defined numbers. They
can do "free" to see what is available. The original patch from
Zhaoyang was due to a script that would just try a very large number
and cause issues.
If the memory is available, I just say let them have it. This is
borderline user space issue and not a kernel one.
> > +
>
> I tested in Qemu with 1GB memory, I am always able to get it to fail
> allocation even without this patch without causing an OOM. Maybe I am
> not running enough allocations in parallel or something :)
Try just echoing in "1000000" into buffer_size_kb and see what happens.
>
> The patch you shared using si_mem_available is working since I'm able
> to allocate till the end without a page allocation failure:
>
> bash-4.3# echo 237800 > /d/tracing/buffer_size_kb
> bash: echo: write error: Cannot allocate memory
> bash-4.3# echo 237700 > /d/tracing/buffer_size_kb
> bash-4.3# free -m
> total used free shared buffers
> Mem: 985 977 7 10 0
> -/+ buffers: 977 7
> Swap: 0 0 0
> bash-4.3#
>
> I think this patch is still good to have, since IMO we should not go
> and get page allocation failure (even if its a non-OOM) and subsequent
> stack dump from mm's allocator, if we can avoid it.
>
> Tested-by: Joel Fernandes <[email protected]>
Great thanks! I'll make it into a formal patch.
-- Steve
On Fri, Mar 30, 2018 at 09:41:51PM -0400, Steven Rostedt wrote:
> On Fri, 30 Mar 2018 16:38:52 -0700
> Joel Fernandes <[email protected]> wrote:
>
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> > > struct buffer_page *bpage, *tmp;
> > > long i;
> > >
> > > + /* Check if the available memory is there first */
> > > + i = si_mem_available();
> > > + if (i < nr_pages)
> >
> > Does it make sense to add a small margin here so that after ftrace
> > finishes allocating, we still have some memory left for the system?
> > But then then we have to define a magic number :-|
>
> I don't think so. The memory is allocated by user defined numbers. They
> can do "free" to see what is available. The original patch from
> Zhaoyang was due to a script that would just try a very large number
> and cause issues.
>
> If the memory is available, I just say let them have it. This is
> borderline user space issue and not a kernel one.
Again though, this is the same pattern as vmalloc. There are any number
of places where userspace can cause an arbitrarily large vmalloc to be
attempted (grep for kvmalloc_array for a list of promising candidates).
I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
__GFP_NOWARN will give you the exact behaviour that you want with no
need to grub around in the VM to find out if your huge allocation is
likely to succeed.
On Fri, 30 Mar 2018 19:18:57 -0700
Matthew Wilcox <[email protected]> wrote:
> Again though, this is the same pattern as vmalloc. There are any number
> of places where userspace can cause an arbitrarily large vmalloc to be
> attempted (grep for kvmalloc_array for a list of promising candidates).
> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
> __GFP_NOWARN will give you the exact behaviour that you want with no
> need to grub around in the VM to find out if your huge allocation is
> likely to succeed.
Not sure how this helps. Note, I don't care about consecutive pages, so
this is not an array. It's a link list of thousands of pages. How do
you suggest allocating them? The ring buffer is a link list of pages.
What I currently do is to see how many more pages I need. Allocate them
one at a time and put them in a temporary list, if it succeeds I add
them to the ring buffer, if not, I free the entire list (it's an all or
nothing operation).
The allocation I'm making doesn't warn. The problem is the
GFP_RETRY_MAYFAIL, which will try to allocate any possible memory in
the system. When it succeeds, the ring buffer allocation logic will
then try to allocate another page. If we add too many pages, we will
allocate all possible pages and then try to allocate more. This
allocation will fail without causing an OOM. That's not the problem.
The problem is if the system is active during this time, and something
else tries to do any allocation, after all memory has been consumed,
that allocation will fail. Then it will trigger an OOM.
I showed this in my Call Trace, that the allocation that failed during
my test was something completely unrelated, and that failure caused an
OOM.
What this last patch does is see if there's space available before it
even starts the process.
Maybe I'm missing something, but I don't see how NOWARN can help. My
allocations are not what is giving the warning.
-- Steve
On Fri, Mar 30, 2018 at 8:07 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 30 Mar 2018 19:18:57 -0700
> Matthew Wilcox <[email protected]> wrote:
>
>> Again though, this is the same pattern as vmalloc. There are any number
>> of places where userspace can cause an arbitrarily large vmalloc to be
>> attempted (grep for kvmalloc_array for a list of promising candidates).
>> I'm pretty sure that just changing your GFP flags to GFP_KERNEL |
>> __GFP_NOWARN will give you the exact behaviour that you want with no
>> need to grub around in the VM to find out if your huge allocation is
>> likely to succeed.
>
> Not sure how this helps. Note, I don't care about consecutive pages, so
> this is not an array. It's a link list of thousands of pages. How do
> you suggest allocating them? The ring buffer is a link list of pages.
Yeah I didn't understand the suggestion either. If I remember
correctly, not using either NO_RETRY or RETRY_MAY_FAIL, and just plain
GFP_KERNEL was precisely causing the buffer_size_kb write to cause an
OOM in my testing. So I think Steven's patch does the right thing in
checking in advance.
thanks,
- Joel
On Sat, Mar 31, 2018 at 5:42 AM, Steven Rostedt <[email protected]> wrote:
> On Fri, 30 Mar 2018 17:30:31 -0400
> Steven Rostedt <[email protected]> wrote:
>
>> I'll take a look at si_mem_available() that Joel suggested and see if
>> we can make that work.
>
> Wow, this appears to work great! Joel and Zhaoyang, can you test this?
>
> -- Steve
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index a2fd3893cc02..32a803626ee2 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1164,6 +1164,11 @@ static int __rb_allocate_pages(long nr_pages, struct list_head *pages, int cpu)
> struct buffer_page *bpage, *tmp;
> long i;
>
> + /* Check if the available memory is there first */
> + i = si_mem_available();
> + if (i < nr_pages)
> + return -ENOMEM;
> +
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> /*
Hi Steve, It works as my previous patch does.
On Fri 30-03-18 10:20:38, Steven Rostedt wrote:
>
> [ Adding memory management folks to discuss the issue ]
>
> On Thu, 29 Mar 2018 18:41:44 +0800
> Zhaoyang Huang <[email protected]> wrote:
>
> > It is reported that some user app would like to echo a huge
> > number to "/sys/kernel/debug/tracing/buffer_size_kb" regardless
> > of the available memory, which will cause the coinstantaneous
> > page allocation failed and introduce OOM. The commit checking the
> > val against the available mem first to avoid the consequence allocation.
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
> > ---
> > kernel/trace/trace.c | 39 ++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 38 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 2d0ffcc..a4a4237 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -43,6 +43,8 @@
> > #include <linux/trace.h>
> > #include <linux/sched/rt.h>
> >
> > +#include <linux/mm.h>
> > +#include <linux/swap.h>
> > #include "trace.h"
> > #include "trace_output.h"
> >
> > @@ -5967,6 +5969,39 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> > return ret;
> > }
> >
> > +static long get_available_mem(void)
> > +{
> > + struct sysinfo i;
> > + long available;
> > + unsigned long pagecache;
> > + unsigned long wmark_low = 0;
> > + unsigned long pages[NR_LRU_LISTS];
> > + struct zone *zone;
> > + int lru;
> > +
> > + si_meminfo(&i);
> > + si_swapinfo(&i);
> > +
> > + for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> > + pages[lru] = global_page_state(NR_LRU_BASE + lru);
> > +
> > + for_each_zone(zone)
> > + wmark_low += zone->watermark[WMARK_LOW];
> > +
> > + available = i.freeram - wmark_low;
> > +
> > + pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
> > + pagecache -= min(pagecache / 2, wmark_low);
> > + available += pagecache;
> > +
> > + available += global_page_state(NR_SLAB_RECLAIMABLE) -
> > + min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);
> > +
> > + if (available < 0)
> > + available = 0;
> > + return available;
> > +}
> > +
>
> As I stated in my other reply, the above function does not belong in
> tracing.
>
> That said, it appears you are having issues that were caused by the
> change by commit 848618857d2 ("tracing/ring_buffer: Try harder to
> allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of
> NORETRY was to keep allocations of the tracing ring-buffer from causing
> OOMs. But the RETRY was too strong in that case, because there were
> those that wanted to allocate large ring buffers but it would fail due
> to memory being used that could be reclaimed. Supposedly, RETRY_MAYFAIL
> is to allocate with reclaim but still allow to fail, and isn't suppose
> to trigger an OOM. From my own tests, this is obviously not the case.
>
> Perhaps this is because the ring buffer allocates one page at a time,
> and by doing so, it can get every last available page, and if anything
> in the mean time does an allocation without MAYFAIL, it will cause an
> OOM. For example, when I stressed this I triggered this:
Yes, this is indeed the case.
> pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
> pool cpuset=/ mems_allowed=0
> CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663
> Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> Call Trace:
> dump_stack+0x8e/0xce
> dump_header.isra.30+0x6e/0x28f
> ? _raw_spin_unlock_irqrestore+0x30/0x60
> oom_kill_process+0x218/0x400
> ? has_capability_noaudit+0x17/0x20
> out_of_memory+0xe3/0x5c0
> __alloc_pages_slowpath+0xa8e/0xe50
> __alloc_pages_nodemask+0x206/0x220
> alloc_pages_current+0x6a/0xe0
> __page_cache_alloc+0x6a/0xa0
> filemap_fault+0x208/0x5f0
> ? __might_sleep+0x4a/0x80
> ext4_filemap_fault+0x31/0x44
> __do_fault+0x20/0xd0
> __handle_mm_fault+0xc08/0x1160
> handle_mm_fault+0x76/0x110
> __do_page_fault+0x299/0x580
> do_page_fault+0x2d/0x110
> ? page_fault+0x2f/0x50
> page_fault+0x45/0x50
>
> I wonder if I should have the ring buffer allocate groups of pages, to
> avoid this. Or try to allocate with NORETRY, one page at a time, and
> when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> may keep it from causing an OOM?
I wonder why it really matters. The interface is root only and we expect
some sanity from an admin, right? So allocating such a large ring buffer
that it sends the system to the OOM is a sign that the admin should be
more careful. Balancing on the OOM edge is always a risk and the result
will highly depend on the workload running in parallel.
--
Michal Hocko
SUSE Labs
On Tue, 3 Apr 2018 13:06:12 +0200
Michal Hocko <[email protected]> wrote:
> > I wonder if I should have the ring buffer allocate groups of pages, to
> > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > may keep it from causing an OOM?
>
> I wonder why it really matters. The interface is root only and we expect
> some sanity from an admin, right? So allocating such a large ring buffer
> that it sends the system to the OOM is a sign that the admin should be
> more careful. Balancing on the OOM edge is always a risk and the result
> will highly depend on the workload running in parallel.
This came up because there's scripts or programs that set the size of
the ring buffer. The complaint was that the application would just set
the size to something bigger than what was available and cause an OOM
killing other applications. The final solution is to simply check the
available memory before allocating the ring buffer:
/* Check if the available memory is there first */
i = si_mem_available();
if (i < nr_pages)
return -ENOMEM;
And it works well.
-- Steve
On Tue 03-04-18 07:51:58, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 13:06:12 +0200
> Michal Hocko <[email protected]> wrote:
>
> > > I wonder if I should have the ring buffer allocate groups of pages, to
> > > avoid this. Or try to allocate with NORETRY, one page at a time, and
> > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that
> > > may keep it from causing an OOM?
> >
> > I wonder why it really matters. The interface is root only and we expect
> > some sanity from an admin, right? So allocating such a large ring buffer
> > that it sends the system to the OOM is a sign that the admin should be
> > more careful. Balancing on the OOM edge is always a risk and the result
> > will highly depend on the workload running in parallel.
>
> This came up because there's scripts or programs that set the size of
> the ring buffer. The complaint was that the application would just set
> the size to something bigger than what was available and cause an OOM
> killing other applications. The final solution is to simply check the
> available memory before allocating the ring buffer:
>
> /* Check if the available memory is there first */
> i = si_mem_available();
> if (i < nr_pages)
> return -ENOMEM;
>
> And it works well.
Except that it doesn't work. si_mem_available is not really suitable for
any allocation estimations. Its only purpose is to provide a very rough
estimation for userspace. Any other use is basically abuse. The
situation can change really quickly. Really it is really hard to be
clever here with the volatility the memory allocations can cause.
--
Michal Hocko
SUSE Labs
On Tue, 3 Apr 2018 14:16:14 +0200
Michal Hocko <[email protected]> wrote:
> > This came up because there's scripts or programs that set the size of
> > the ring buffer. The complaint was that the application would just set
> > the size to something bigger than what was available and cause an OOM
> > killing other applications. The final solution is to simply check the
> > available memory before allocating the ring buffer:
> >
> > /* Check if the available memory is there first */
> > i = si_mem_available();
> > if (i < nr_pages)
> > return -ENOMEM;
> >
> > And it works well.
>
> Except that it doesn't work. si_mem_available is not really suitable for
> any allocation estimations. Its only purpose is to provide a very rough
> estimation for userspace. Any other use is basically abuse. The
> situation can change really quickly. Really it is really hard to be
> clever here with the volatility the memory allocations can cause.
OK, then what do you suggest? Because currently, it appears to work. A
rough estimate may be good enough.
If we use NORETRY, then we have those that complain that we do not try
hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
issue of taking up all memory before we get what we want.
Perhaps I should try to allocate a large group of pages with
RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
that the large allocation may reclaim some memory that would allow the
NORETRY to succeed with smaller allocations (one page at a time)?
-- Steve
On Tue 03-04-18 08:23:48, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 14:16:14 +0200
> Michal Hocko <[email protected]> wrote:
>
> > > This came up because there's scripts or programs that set the size of
> > > the ring buffer. The complaint was that the application would just set
> > > the size to something bigger than what was available and cause an OOM
> > > killing other applications. The final solution is to simply check the
> > > available memory before allocating the ring buffer:
> > >
> > > /* Check if the available memory is there first */
> > > i = si_mem_available();
> > > if (i < nr_pages)
> > > return -ENOMEM;
> > >
> > > And it works well.
> >
> > Except that it doesn't work. si_mem_available is not really suitable for
> > any allocation estimations. Its only purpose is to provide a very rough
> > estimation for userspace. Any other use is basically abuse. The
> > situation can change really quickly. Really it is really hard to be
> > clever here with the volatility the memory allocations can cause.
>
> OK, then what do you suggest? Because currently, it appears to work. A
> rough estimate may be good enough.
>
> If we use NORETRY, then we have those that complain that we do not try
> hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
> issue of taking up all memory before we get what we want.
Just try to do what admin asks for and trust it will not try to shoot
his foot? I mean there are other ways admin can shoot the machine down.
Being clever is OK if it doesn't add a tricky code. And relying on
si_mem_available is definitely tricky and obscure.
> Perhaps I should try to allocate a large group of pages with
> RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> that the large allocation may reclaim some memory that would allow the
> NORETRY to succeed with smaller allocations (one page at a time)?
That again relies on a subtle dependencies of the current
implementation. So I would rather ask whether this is something that
really deserves special treatment. If admin asks for a buffer of a
certain size then try to do so. If we get OOM then bad luck you cannot
get large memory buffers for free...
--
Michal Hocko
SUSE Labs
On Tue, 3 Apr 2018 14:35:14 +0200
Michal Hocko <[email protected]> wrote:
> > If we use NORETRY, then we have those that complain that we do not try
> > hard enough to reclaim memory. If we use RETRY_MAYFAIL we have this
> > issue of taking up all memory before we get what we want.
>
> Just try to do what admin asks for and trust it will not try to shoot
> his foot? I mean there are other ways admin can shoot the machine down.
Allowing the admin to just shoot her foot is not an option.
Yes there are many ways to bring down a machine, but this shouldn't be
one of them. All one needs to do is echo too big of a number
into /sys/kernel/tracing/buffer_size_kb and OOM may kill a critical
program on a production machine. Tracing is made for production, and
should not allow an easy way to trigger OOM.
> Being clever is OK if it doesn't add a tricky code. And relying on
> si_mem_available is definitely tricky and obscure.
Can we get the mm subsystem to provide a better method to know if an
allocation will possibly succeed or not before trying it? It doesn't
have to be free of races. Just "if I allocate this many pages right
now, will it work?" If that changes from the time it asks to the time
it allocates, that's fine. I'm not trying to prevent OOM to never
trigger. I just don't want to to trigger consistently.
>
> > Perhaps I should try to allocate a large group of pages with
> > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > that the large allocation may reclaim some memory that would allow the
> > NORETRY to succeed with smaller allocations (one page at a time)?
>
> That again relies on a subtle dependencies of the current
> implementation. So I would rather ask whether this is something that
> really deserves special treatment. If admin asks for a buffer of a
> certain size then try to do so. If we get OOM then bad luck you cannot
> get large memory buffers for free...
That is not acceptable to me nor to the people asking for this.
The problem is known. The ring buffer allocates memory page by page,
and this can allow it to easily take all memory in the system before it
fails to allocate and free everything it had done.
If you don't like the use of si_mem_available() I'll do the larger
pages method. Yes it depends on the current implementation of memory
allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
number of pages, and fail if it can't (leaving memory for other
allocations to succeed).
The allocation of the ring buffer isn't critical. It can fail to
expand, and we can tell the user -ENOMEM. I original had NORETRY
because I rather have it fail than cause an OOM. But there's folks
(like Joel) that want it to succeed when there's available memory in
page caches.
I'm fine if the admin shoots herself in the foot if the ring buffer
gets big enough to start causing OOMs, but I don't want it to cause
OOMs if there's not even enough memory to fulfill the ring buffer size
itself.
-- Steve
On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 14:35:14 +0200
> Michal Hocko <[email protected]> wrote:
[...]
> > Being clever is OK if it doesn't add a tricky code. And relying on
> > si_mem_available is definitely tricky and obscure.
>
> Can we get the mm subsystem to provide a better method to know if an
> allocation will possibly succeed or not before trying it? It doesn't
> have to be free of races. Just "if I allocate this many pages right
> now, will it work?" If that changes from the time it asks to the time
> it allocates, that's fine. I'm not trying to prevent OOM to never
> trigger. I just don't want to to trigger consistently.
How do you do that without an actuall allocation request? And more
fundamentally, what if your _particular_ request is just fine but it
will get us so close to the OOM edge that the next legit allocation
request simply goes OOM? There is simply no sane interface I can think
of that would satisfy a safe/sensible "will it cause OOM" semantic.
> > > Perhaps I should try to allocate a large group of pages with
> > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > > that the large allocation may reclaim some memory that would allow the
> > > NORETRY to succeed with smaller allocations (one page at a time)?
> >
> > That again relies on a subtle dependencies of the current
> > implementation. So I would rather ask whether this is something that
> > really deserves special treatment. If admin asks for a buffer of a
> > certain size then try to do so. If we get OOM then bad luck you cannot
> > get large memory buffers for free...
>
> That is not acceptable to me nor to the people asking for this.
>
> The problem is known. The ring buffer allocates memory page by page,
> and this can allow it to easily take all memory in the system before it
> fails to allocate and free everything it had done.
Then do not allow buffers that are too large. How often do you need
buffers that are larger than few megs or small % of the available
memory? Consuming excessive amount of memory just to trace workload
which will need some memory on its own sounds just dubious to me.
> If you don't like the use of si_mem_available() I'll do the larger
> pages method. Yes it depends on the current implementation of memory
> allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
> number of pages, and fail if it can't (leaving memory for other
> allocations to succeed).
>
> The allocation of the ring buffer isn't critical. It can fail to
> expand, and we can tell the user -ENOMEM. I original had NORETRY
> because I rather have it fail than cause an OOM. But there's folks
> (like Joel) that want it to succeed when there's available memory in
> page caches.
Then implement a retry logic on top of NORETRY. You can control how hard
to retry to satisfy the request yourself. You still risk that your
allocation will get us close to OOM for _somebody_ else though.
> I'm fine if the admin shoots herself in the foot if the ring buffer
> gets big enough to start causing OOMs, but I don't want it to cause
> OOMs if there's not even enough memory to fulfill the ring buffer size
> itself.
I simply do not see the difference between the two. Both have the same
deadly effect in the end. The direct OOM has an arguable advantage that
the effect is immediate rather than subtle with potential performance
side effects until the machine OOMs after crawling for quite some time.
--
Michal Hocko
SUSE Labs
On Tue, 3 Apr 2018 15:56:07 +0200
Michal Hocko <[email protected]> wrote:
> On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> > On Tue, 3 Apr 2018 14:35:14 +0200
> > Michal Hocko <[email protected]> wrote:
> [...]
> > > Being clever is OK if it doesn't add a tricky code. And relying on
> > > si_mem_available is definitely tricky and obscure.
> >
> > Can we get the mm subsystem to provide a better method to know if an
> > allocation will possibly succeed or not before trying it? It doesn't
> > have to be free of races. Just "if I allocate this many pages right
> > now, will it work?" If that changes from the time it asks to the time
> > it allocates, that's fine. I'm not trying to prevent OOM to never
> > trigger. I just don't want to to trigger consistently.
>
> How do you do that without an actuall allocation request? And more
> fundamentally, what if your _particular_ request is just fine but it
> will get us so close to the OOM edge that the next legit allocation
> request simply goes OOM? There is simply no sane interface I can think
> of that would satisfy a safe/sensible "will it cause OOM" semantic.
That's where I'm fine with the admin shooting herself in the foot. If
they ask for almost all memory, and then the system needs more, that's
not our problem.
I'm more worried about putting in a couple of extra zeros by mistake,
which will pretty much guarantee an OOM on an active system.
>
> > > > Perhaps I should try to allocate a large group of pages with
> > > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
> > > > that the large allocation may reclaim some memory that would allow the
> > > > NORETRY to succeed with smaller allocations (one page at a time)?
> > >
> > > That again relies on a subtle dependencies of the current
> > > implementation. So I would rather ask whether this is something that
> > > really deserves special treatment. If admin asks for a buffer of a
> > > certain size then try to do so. If we get OOM then bad luck you cannot
> > > get large memory buffers for free...
> >
> > That is not acceptable to me nor to the people asking for this.
> >
> > The problem is known. The ring buffer allocates memory page by page,
> > and this can allow it to easily take all memory in the system before it
> > fails to allocate and free everything it had done.
>
> Then do not allow buffers that are too large. How often do you need
> buffers that are larger than few megs or small % of the available
> memory? Consuming excessive amount of memory just to trace workload
> which will need some memory on its own sounds just dubious to me.
For recording 100s of million events per second, it requires hundreds
of megs of memory. Large buffers are required for tracing. That's
extremely common.
>
> > If you don't like the use of si_mem_available() I'll do the larger
> > pages method. Yes it depends on the current implementation of memory
> > allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
> > number of pages, and fail if it can't (leaving memory for other
> > allocations to succeed).
> >
> > The allocation of the ring buffer isn't critical. It can fail to
> > expand, and we can tell the user -ENOMEM. I original had NORETRY
> > because I rather have it fail than cause an OOM. But there's folks
> > (like Joel) that want it to succeed when there's available memory in
> > page caches.
>
> Then implement a retry logic on top of NORETRY. You can control how hard
> to retry to satisfy the request yourself. You still risk that your
> allocation will get us close to OOM for _somebody_ else though.
>
> > I'm fine if the admin shoots herself in the foot if the ring buffer
> > gets big enough to start causing OOMs, but I don't want it to cause
> > OOMs if there's not even enough memory to fulfill the ring buffer size
> > itself.
>
> I simply do not see the difference between the two. Both have the same
> deadly effect in the end. The direct OOM has an arguable advantage that
> the effect is immediate rather than subtle with potential performance
> side effects until the machine OOMs after crawling for quite some time.
The difference is if the allocation succeeds or not. If it doesn't
succeed, we free all memory that we tried to allocate. If it succeeds
and causes issues, then yes, that's the admins fault. I'm worried about
the accidental putting in too big of a number, either by an admin by
mistake, or some stupid script that just thinks the current machines
has terabytes of memory.
I'm under the assumption that if I allocate an allocation of 32 pages
with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and
while my allocation is reclaiming memory, and another task comes in and
asks for a single page, it can still succeed. This would be why I would
be using RETRY_MAYFAIL with higher orders of pages, that it doesn't
take all memory in the system if it fails. Is this assumption incorrect?
The current approach of allocating 1 page at a time with RETRY_MAYFAIL
is that it will succeed to get any pages that are available, until
there are none, and if some unlucky task asks for memory during that
time, it is guaranteed to fail its allocation triggering an OOM.
I was thinking of doing something like:
large_pages = nr_pages / 32;
if (large_pages) {
pages = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_RETRY_MAYFAIL, 5);
if (pages)
/* break up pages */
else
/* try to allocate with NORETRY */
}
Now it will allocate memory in 32 page chunks using reclaim. If it
fails to allocate them, it would not have taken up any smaller chunks
that were available, leaving them for other users. It would then go
back to singe pages, allocating with RETRY. Or I could just say screw
it, and make the allocation of the ring buffer always be 32 page chunks
(or at least make it user defined).
-- Steve
On Tue 03-04-18 10:17:53, Steven Rostedt wrote:
> On Tue, 3 Apr 2018 15:56:07 +0200
> Michal Hocko <[email protected]> wrote:
[...]
> > I simply do not see the difference between the two. Both have the same
> > deadly effect in the end. The direct OOM has an arguable advantage that
> > the effect is immediate rather than subtle with potential performance
> > side effects until the machine OOMs after crawling for quite some time.
>
> The difference is if the allocation succeeds or not. If it doesn't
> succeed, we free all memory that we tried to allocate. If it succeeds
> and causes issues, then yes, that's the admins fault.
What am I trying to say is that this is so extremely time and workload
sensitive that you can hardly have a stable behavior. It will become a
pure luck whether the failure happens.
> I'm worried about
> the accidental putting in too big of a number, either by an admin by
> mistake, or some stupid script that just thinks the current machines
> has terabytes of memory.
I would argue that stupid scripts should have no business calling root
only interfaces which can allocate a lot of memory and cause OOMs.
> I'm under the assumption that if I allocate an allocation of 32 pages
> with RETRY_MAYFAIL, and there's 2 pages available, but not 32, and
> while my allocation is reclaiming memory, and another task comes in and
> asks for a single page, it can still succeed. This would be why I would
> be using RETRY_MAYFAIL with higher orders of pages, that it doesn't
> take all memory in the system if it fails. Is this assumption incorrect?
Yes. There is no guarantee that the allocation will get the memory it
reclaimed in the direct reclaim. Pages are simply freed back into the
pool and it is a matter of timing who gets them.
> The current approach of allocating 1 page at a time with RETRY_MAYFAIL
> is that it will succeed to get any pages that are available, until
> there are none, and if some unlucky task asks for memory during that
> time, it is guaranteed to fail its allocation triggering an OOM.
>
> I was thinking of doing something like:
>
> large_pages = nr_pages / 32;
> if (large_pages) {
> pages = alloc_pages_node(cpu_to_node(cpu),
> GFP_KERNEL | __GFP_RETRY_MAYFAIL, 5);
> if (pages)
> /* break up pages */
> else
> /* try to allocate with NORETRY */
> }
You can do so, of course. In fact it would have some advantages over
single pages because you would fragment the memory less but this is not
a reliable prevention from OOM killing and the complete memory
depletion if you allow arbitrary trace buffer sizes.
> Now it will allocate memory in 32 page chunks using reclaim. If it
> fails to allocate them, it would not have taken up any smaller chunks
> that were available, leaving them for other users. It would then go
> back to singe pages, allocating with RETRY. Or I could just say screw
> it, and make the allocation of the ring buffer always be 32 page chunks
> (or at least make it user defined).
yes a fallback is questionable. Whether to make the batch size
configuration is a matter of how much internal details you want to
expose to userspace.
--
Michal Hocko
SUSE Labs
On Tue, 3 Apr 2018 18:11:19 +0200
Michal Hocko <[email protected]> wrote:
> yes a fallback is questionable. Whether to make the batch size
> configuration is a matter of how much internal details you want to
> expose to userspace.
Well, it is tracing the guts of the kernel, so internal details are
generally exposed to userspace ;-)
-- Steve
On Tue, 3 Apr 2018 18:11:19 +0200
Michal Hocko <[email protected]> wrote:
> You can do so, of course. In fact it would have some advantages over
> single pages because you would fragment the memory less but this is not
> a reliable prevention from OOM killing and the complete memory
> depletion if you allow arbitrary trace buffer sizes.
You are right that this doesn't prevent OOM killing. I tried various
methods, and the only thing that currently "works" the way I'm happy
with, is this original patch.
From your earlier email:
> Except that it doesn't work. si_mem_available is not really suitable for
> any allocation estimations. Its only purpose is to provide a very rough
> estimation for userspace. Any other use is basically abuse. The
> situation can change really quickly. Really it is really hard to be
> clever here with the volatility the memory allocations can cause.
Now can you please explain to me why si_mem_available is not suitable
for my purpose. If it's wrong and states there is less memory than
actually exists, we simply fail to increase the buffer.
If it is wrong and states that there is more memory than actually
exists, then we do nothing different than what we do today, and trigger
an OOM.
But for the ring buffer use case, it is "good enough". Can you
please explain to me what issues you see that can happen if we apply
this patch?
-- Steve
On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko <[email protected]> wrote:
> On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
>> On Tue, 3 Apr 2018 14:35:14 +0200
>> Michal Hocko <[email protected]> wrote:
> [...]
>> > Being clever is OK if it doesn't add a tricky code. And relying on
>> > si_mem_available is definitely tricky and obscure.
>>
>> Can we get the mm subsystem to provide a better method to know if an
>> allocation will possibly succeed or not before trying it? It doesn't
>> have to be free of races. Just "if I allocate this many pages right
>> now, will it work?" If that changes from the time it asks to the time
>> it allocates, that's fine. I'm not trying to prevent OOM to never
>> trigger. I just don't want to to trigger consistently.
>
> How do you do that without an actuall allocation request? And more
> fundamentally, what if your _particular_ request is just fine but it
> will get us so close to the OOM edge that the next legit allocation
> request simply goes OOM? There is simply no sane interface I can think
> of that would satisfy a safe/sensible "will it cause OOM" semantic.
>
The point is the app which try to allocate the size over the line will escape
the OOM and let other innocent to be sacrificed. However, the one which you
mentioned above will be possibly selected by OOM that triggered by consequnce
failed allocation.
>> > > Perhaps I should try to allocate a large group of pages with
>> > > RETRY_MAYFAIL, and if that fails go back to NORETRY, with the thinking
>> > > that the large allocation may reclaim some memory that would allow the
>> > > NORETRY to succeed with smaller allocations (one page at a time)?
>> >
>> > That again relies on a subtle dependencies of the current
>> > implementation. So I would rather ask whether this is something that
>> > really deserves special treatment. If admin asks for a buffer of a
>> > certain size then try to do so. If we get OOM then bad luck you cannot
>> > get large memory buffers for free...
>>
>> That is not acceptable to me nor to the people asking for this.
>>
>> The problem is known. The ring buffer allocates memory page by page,
>> and this can allow it to easily take all memory in the system before it
>> fails to allocate and free everything it had done.
>
> Then do not allow buffers that are too large. How often do you need
> buffers that are larger than few megs or small % of the available
> memory? Consuming excessive amount of memory just to trace workload
> which will need some memory on its own sounds just dubious to me.
>
>> If you don't like the use of si_mem_available() I'll do the larger
>> pages method. Yes it depends on the current implementation of memory
>> allocation. It will depend on RETRY_MAYFAIL trying to allocate a large
>> number of pages, and fail if it can't (leaving memory for other
>> allocations to succeed).
>>
>> The allocation of the ring buffer isn't critical. It can fail to
>> expand, and we can tell the user -ENOMEM. I original had NORETRY
>> because I rather have it fail than cause an OOM. But there's folks
>> (like Joel) that want it to succeed when there's available memory in
>> page caches.
>
> Then implement a retry logic on top of NORETRY. You can control how hard
> to retry to satisfy the request yourself. You still risk that your
> allocation will get us close to OOM for _somebody_ else though.
>
>> I'm fine if the admin shoots herself in the foot if the ring buffer
>> gets big enough to start causing OOMs, but I don't want it to cause
>> OOMs if there's not even enough memory to fulfill the ring buffer size
>> itself.
>
> I simply do not see the difference between the two. Both have the same
> deadly effect in the end. The direct OOM has an arguable advantage that
> the effect is immediate rather than subtle with potential performance
> side effects until the machine OOMs after crawling for quite some time.
>
> --
> Michal Hocko
> SUSE Labs
On Tue 03-04-18 18:56:27, Steven Rostedt wrote:
[...]
> From your earlier email:
>
> > Except that it doesn't work. si_mem_available is not really suitable for
> > any allocation estimations. Its only purpose is to provide a very rough
> > estimation for userspace. Any other use is basically abuse. The
> > situation can change really quickly. Really it is really hard to be
> > clever here with the volatility the memory allocations can cause.
>
> Now can you please explain to me why si_mem_available is not suitable
> for my purpose.
Several problems. It is overly optimistic especially when we are close
to OOM. The available pagecache or slab reclaimable objects might be pinned
long enough that your allocation based on that estimation will just make
the situation worse and result in OOM. More importantly though, your
allocations are GFP_KERNEL, right, that means that such an allocation
will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
pagecache will. So you will get an overestimate of how much you can
allocate.
Really si_mem_available is for proc/meminfo and a rough estimate of the
free memory because users tend to be confused by seeing MemFree too low
and complaining that the system has eaten all their memory. I have some
skepticism about how useful it is in practice apart from showing it in
top or alike tools. The memory is simply not usable immediately or
without an overall and visible effect on the whole system.
HTH
--
Michal Hocko
SUSE Labs
On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote:
> On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
> >> On Tue, 3 Apr 2018 14:35:14 +0200
> >> Michal Hocko <[email protected]> wrote:
> > [...]
> >> > Being clever is OK if it doesn't add a tricky code. And relying on
> >> > si_mem_available is definitely tricky and obscure.
> >>
> >> Can we get the mm subsystem to provide a better method to know if an
> >> allocation will possibly succeed or not before trying it? It doesn't
> >> have to be free of races. Just "if I allocate this many pages right
> >> now, will it work?" If that changes from the time it asks to the time
> >> it allocates, that's fine. I'm not trying to prevent OOM to never
> >> trigger. I just don't want to to trigger consistently.
> >
> > How do you do that without an actuall allocation request? And more
> > fundamentally, what if your _particular_ request is just fine but it
> > will get us so close to the OOM edge that the next legit allocation
> > request simply goes OOM? There is simply no sane interface I can think
> > of that would satisfy a safe/sensible "will it cause OOM" semantic.
> >
> The point is the app which try to allocate the size over the line will escape
> the OOM and let other innocent to be sacrificed. However, the one which you
> mentioned above will be possibly selected by OOM that triggered by consequnce
> failed allocation.
If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
which will automatically select the current process as an oom victim and
kill it.
--
Michal Hocko
SUSE Labs
On Wed, Apr 4, 2018 at 2:23 PM, Michal Hocko <[email protected]> wrote:
> On Wed 04-04-18 10:58:39, Zhaoyang Huang wrote:
>> On Tue, Apr 3, 2018 at 9:56 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 03-04-18 09:32:45, Steven Rostedt wrote:
>> >> On Tue, 3 Apr 2018 14:35:14 +0200
>> >> Michal Hocko <[email protected]> wrote:
>> > [...]
>> >> > Being clever is OK if it doesn't add a tricky code. And relying on
>> >> > si_mem_available is definitely tricky and obscure.
>> >>
>> >> Can we get the mm subsystem to provide a better method to know if an
>> >> allocation will possibly succeed or not before trying it? It doesn't
>> >> have to be free of races. Just "if I allocate this many pages right
>> >> now, will it work?" If that changes from the time it asks to the time
>> >> it allocates, that's fine. I'm not trying to prevent OOM to never
>> >> trigger. I just don't want to to trigger consistently.
>> >
>> > How do you do that without an actuall allocation request? And more
>> > fundamentally, what if your _particular_ request is just fine but it
>> > will get us so close to the OOM edge that the next legit allocation
>> > request simply goes OOM? There is simply no sane interface I can think
>> > of that would satisfy a safe/sensible "will it cause OOM" semantic.
>> >
>> The point is the app which try to allocate the size over the line will escape
>> the OOM and let other innocent to be sacrificed. However, the one which you
>> mentioned above will be possibly selected by OOM that triggered by consequnce
>> failed allocation.
>
> If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
> which will automatically select the current process as an oom victim and
> kill it.
But we can not call the function on behalf of the current process
which maybe don't want
to be killed for memory reason. It is proper to tell it ENOMEM and let
it make further decision.
> --
> Michal Hocko
> SUSE Labs
On Tue, Apr 3, 2018 at 11:20 PM, Michal Hocko <[email protected]> wrote:
> On Tue 03-04-18 18:56:27, Steven Rostedt wrote:
> [...]
>> From your earlier email:
>>
>> > Except that it doesn't work. si_mem_available is not really suitable for
>> > any allocation estimations. Its only purpose is to provide a very rough
>> > estimation for userspace. Any other use is basically abuse. The
>> > situation can change really quickly. Really it is really hard to be
>> > clever here with the volatility the memory allocations can cause.
>>
>> Now can you please explain to me why si_mem_available is not suitable
>> for my purpose.
>
> Several problems. It is overly optimistic especially when we are close
> to OOM. The available pagecache or slab reclaimable objects might be pinned
> long enough that your allocation based on that estimation will just make
> the situation worse and result in OOM. More importantly though, your
> allocations are GFP_KERNEL, right, that means that such an allocation
> will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
> pagecache will. So you will get an overestimate of how much you can
> allocate.
Yes, but right now it is assumed that there is all the memory in the
world to allocate. Clearly an overestimate is better than that. Right
now ftrace will just allocate memory until it actually fails to
allocate, at which point it may have caused other processes to be
likely to OOM. As Steve pointed out, it doesn't need to be accurate
but does solve the problem here.. (or some other similar method seems
to be needed to solve it).
>
> Really si_mem_available is for proc/meminfo and a rough estimate of the
> free memory because users tend to be confused by seeing MemFree too low
> and complaining that the system has eaten all their memory. I have some
By why is a rough estimate not good in this case, that's what I don't get.
> skepticism about how useful it is in practice apart from showing it in
> top or alike tools. The memory is simply not usable immediately or
> without an overall and visible effect on the whole system.
Sure there can be false positives but it does reduce the problem for
the most part I feel. May be we can use this as an interim solution
(better than leaving the issue hanging)? Or you could propose a
solution of how to get an estimate and prevent other tasks from
experiencing memory pressure for no reason.
Another thing we could do is check how much total memory there is on
the system and cap by that (which will prevent impossibly large
allocations) but that still doesn't address the problem completely.
thanks,
- Joel
On Wed, 4 Apr 2018 08:20:39 +0200
Michal Hocko <[email protected]> wrote:
> > Now can you please explain to me why si_mem_available is not suitable
> > for my purpose.
>
> Several problems. It is overly optimistic especially when we are close
> to OOM. The available pagecache or slab reclaimable objects might be pinned
> long enough that your allocation based on that estimation will just make
> the situation worse and result in OOM. More importantly though, your
> allocations are GFP_KERNEL, right, that means that such an allocation
> will not reach to ZONE_MOVABLE or ZONE_HIGMEM (32b systems) while the
> pagecache will. So you will get an overestimate of how much you can
> allocate.
>
> Really si_mem_available is for proc/meminfo and a rough estimate of the
> free memory because users tend to be confused by seeing MemFree too low
> and complaining that the system has eaten all their memory. I have some
> skepticism about how useful it is in practice apart from showing it in
> top or alike tools. The memory is simply not usable immediately or
> without an overall and visible effect on the whole system.
What you are telling me is that this is perfect for my use case.
I'm not looking for a "if this tells me have enough memory, I then have
enough memory". I'm looking for a "If I screwed up and asked for a
magnitude more than I really need, don't OOM the system".
Really, I don't care if the number is not truly accurate. In fact, what
you tell me above is exactly what I wanted. I'm more worried it will
return a smaller number than what is available. I much rather have an
over estimate.
This is not about trying to get as much memory for tracing as possible.
Where we slowly increase the buffer size till we have pretty much every
page for tracing. If someone does that, then the system should OOM and
become unstable.
This is about doing what I've (and others) have done several times,
which is put in one or two more zeros than I really wanted. Or forgot
that writing in a number to buffer_size_kb is the buffer size for each
CPU. Yes, the number you write in there is multiplied by every CPU on
the system. It is easy to over allocate by mistake.
I'm looking to protect against gross mistakes where it is obvious that
the allocation isn't going to succeed before the allocating begins. I'm
not looking to be perfect here.
As I've stated before, the current method is to say F*ck You to the
rest of the system and OOM anything else.
If you want, I can change the comment above the code to be:
+ /*
+ * Check if the available memory is there first.
+ * Note, si_mem_available() only gives us a rough estimate of available
+ * memory. It may not be accurate. But we don't care, we just want
+ * to prevent doing any allocation when it is obvious that it is
+ * not going to succeed.
+ */
+ i = si_mem_available();
+ if (i < nr_pages)
+ return -ENOMEM;
+
Better?
-- Steve
On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
[...]
> + /*
> + * Check if the available memory is there first.
> + * Note, si_mem_available() only gives us a rough estimate of available
> + * memory. It may not be accurate. But we don't care, we just want
> + * to prevent doing any allocation when it is obvious that it is
> + * not going to succeed.
> + */
> + i = si_mem_available();
> + if (i < nr_pages)
> + return -ENOMEM;
> +
>
> Better?
I must be really missing something here. How can that work at all for
e.g. the zone_{highmem/movable}. You will get false on the above tests
even when you will have hard time to allocate anything from your
destination zones.
--
Michal Hocko
SUSE Labs
On Wed, 4 Apr 2018 08:23:40 +0200
Michal Hocko <[email protected]> wrote:
> If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
> which will automatically select the current process as an oom victim and
> kill it.
Would it even receive the signal? Does alloc_pages_node() even respond
to signals? Because the OOM happens while the allocation loop is
running.
I tried it out, I did the following:
set_current_oom_origin();
for (i = 0; i < nr_pages; i++) {
struct page *page;
/*
* __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
* gracefully without invoking oom-killer and the system is not
* destabilized.
*/
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL | __GFP_RETRY_MAYFAIL,
cpu_to_node(cpu));
if (!bpage)
goto free_pages;
list_add(&bpage->list, pages);
page = alloc_pages_node(cpu_to_node(cpu),
GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
if (!page)
goto free_pages;
bpage->page = page_address(page);
rb_init_page(bpage->page);
}
clear_current_oom_origin();
The first time I ran my ring buffer memory stress test, it killed the
stress test. The second time I ran it, it killed polkitd.
Still doesn't help as much as the original patch.
You haven't convinced me that using si_mem_available() is a bad idea.
If anything, you've solidified my confidence in it.
-- Steve
On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 08:23:40 +0200
> Michal Hocko <[email protected]> wrote:
>
> > If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
> > which will automatically select the current process as an oom victim and
> > kill it.
>
> Would it even receive the signal? Does alloc_pages_node() even respond
> to signals? Because the OOM happens while the allocation loop is
> running.
Well, you would need to do something like:
>
> I tried it out, I did the following:
>
> set_current_oom_origin();
> for (i = 0; i < nr_pages; i++) {
> struct page *page;
> /*
> * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> * gracefully without invoking oom-killer and the system is not
> * destabilized.
> */
> bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> cpu_to_node(cpu));
> if (!bpage)
> goto free_pages;
>
> list_add(&bpage->list, pages);
>
> page = alloc_pages_node(cpu_to_node(cpu),
> GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> if (!page)
> goto free_pages;
if (fatal_signal_pending())
fgoto free_pages;
> bpage->page = page_address(page);
> rb_init_page(bpage->page);
> }
> clear_current_oom_origin();
If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
trigger the OOM killer and this user context would get killed. If you
drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
it would still be the selected victim.
--
Michal Hocko
SUSE Labs
On Wed, 4 Apr 2018 16:10:52 +0200
Michal Hocko <[email protected]> wrote:
> On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> [...]
> > + /*
> > + * Check if the available memory is there first.
> > + * Note, si_mem_available() only gives us a rough estimate of available
> > + * memory. It may not be accurate. But we don't care, we just want
> > + * to prevent doing any allocation when it is obvious that it is
> > + * not going to succeed.
> > + */
> > + i = si_mem_available();
> > + if (i < nr_pages)
> > + return -ENOMEM;
> > +
> >
> > Better?
>
> I must be really missing something here. How can that work at all for
> e.g. the zone_{highmem/movable}. You will get false on the above tests
> even when you will have hard time to allocate anything from your
> destination zones.
You mean we will get true on the above tests? Again, the current
method is to just say screw it and try to allocate.
I originally just used NORETRY which would only allocate memory that is
currently available and not try to reclaim anything. But people like
Joel at Google that required increasing the buffer when memory was
mostly taken up by page cache changed it from NORETRY to RETRY_MAYFAIL.
But this now causes the issue that a large allocation can take up all
memory even when the allocation requested is guaranteed to fail,
because there is not enough memory to pull this off.
We just want a way to say "hey, is there enough memory in the system to
allocate all these pages before we try? We don't need specifics, we
just want to make sure we are not allocating way too much".
The answer I want is "yes there may be enough (but you may not be able
to use it)" or "no, there is definitely not enough for that".
Currently si_mem_available() is the closest thing we have to answering
that question. I'm fine if the answer is "Yes" even if I can't allocate
that memory.
I'm looking for something where "yes" means "there may be enough, but
there may not be, buyer beware", and "no" means "forget it, don't even
start, because you just asked for more than possible".
-- Steve
On Wed, 4 Apr 2018 16:23:29 +0200
Michal Hocko <[email protected]> wrote:
> On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> > On Wed, 4 Apr 2018 08:23:40 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
> > > which will automatically select the current process as an oom victim and
> > > kill it.
> >
> > Would it even receive the signal? Does alloc_pages_node() even respond
> > to signals? Because the OOM happens while the allocation loop is
> > running.
>
> Well, you would need to do something like:
>
> >
> > I tried it out, I did the following:
> >
> > set_current_oom_origin();
> > for (i = 0; i < nr_pages; i++) {
> > struct page *page;
> > /*
> > * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> > * gracefully without invoking oom-killer and the system is not
> > * destabilized.
> > */
> > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > cpu_to_node(cpu));
> > if (!bpage)
> > goto free_pages;
> >
> > list_add(&bpage->list, pages);
> >
> > page = alloc_pages_node(cpu_to_node(cpu),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > if (!page)
> > goto free_pages;
>
> if (fatal_signal_pending())
> fgoto free_pages;
But wouldn't page be NULL in this case?
>
> > bpage->page = page_address(page);
> > rb_init_page(bpage->page);
> > }
> > clear_current_oom_origin();
>
> If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
> trigger the OOM killer and this user context would get killed. If you
> drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
> it would still be the selected victim.
Then we guarantee to kill the process instead of just sending a
-ENOMEM, which would change user space ABI, and is a NO NO.
Ideally, we want to avoid an OOM. I could add the above as well, when
si_mem_avaiable() returns something that is greater than what is
available, and at least this is the process that will get the OOM if it
fails to allocate.
Would that work for you?
-- Steve
On Wed 04-04-18 10:25:27, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 16:10:52 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> > [...]
> > > + /*
> > > + * Check if the available memory is there first.
> > > + * Note, si_mem_available() only gives us a rough estimate of available
> > > + * memory. It may not be accurate. But we don't care, we just want
> > > + * to prevent doing any allocation when it is obvious that it is
> > > + * not going to succeed.
> > > + */
> > > + i = si_mem_available();
> > > + if (i < nr_pages)
> > > + return -ENOMEM;
> > > +
> > >
> > > Better?
> >
> > I must be really missing something here. How can that work at all for
> > e.g. the zone_{highmem/movable}. You will get false on the above tests
> > even when you will have hard time to allocate anything from your
> > destination zones.
>
> You mean we will get true on the above tests? Again, the current
> method is to just say screw it and try to allocate.
No, you will get false on that test. Say that you have a system with
large ZONE_MOVABLE. Now your kernel allocations can fit only into
!movable zones (say we have 1G for !movable and 3G for movable). Now say
that !movable zones are getting close to the edge while movable zones
are full of reclaimable pages. si_mem_available will tell you there is a
_lot_ of memory available while your GFP_KERNEL request will happily
consume the rest of !movable zones and trigger OOM. See?
[...]
> I'm looking for something where "yes" means "there may be enough, but
> there may not be, buyer beware", and "no" means "forget it, don't even
> start, because you just asked for more than possible".
We do not have _that_ something other than try to opportunistically
allocate and see what happens. Sucks? Maybe yes but I really cannot
think of an interface with sane semantic that would catch all the
different scenarios.
--
Michal Hocko
SUSE Labs
On Wed 04-04-18 10:31:11, Steven Rostedt wrote:
> On Wed, 4 Apr 2018 16:23:29 +0200
> Michal Hocko <[email protected]> wrote:
>
> > On Wed 04-04-18 10:11:49, Steven Rostedt wrote:
> > > On Wed, 4 Apr 2018 08:23:40 +0200
> > > Michal Hocko <[email protected]> wrote:
> > >
> > > > If you are afraid of that then you can have a look at {set,clear}_current_oom_origin()
> > > > which will automatically select the current process as an oom victim and
> > > > kill it.
> > >
> > > Would it even receive the signal? Does alloc_pages_node() even respond
> > > to signals? Because the OOM happens while the allocation loop is
> > > running.
> >
> > Well, you would need to do something like:
> >
> > >
> > > I tried it out, I did the following:
> > >
> > > set_current_oom_origin();
> > > for (i = 0; i < nr_pages; i++) {
> > > struct page *page;
> > > /*
> > > * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> > > * gracefully without invoking oom-killer and the system is not
> > > * destabilized.
> > > */
> > > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > > GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > > cpu_to_node(cpu));
> > > if (!bpage)
> > > goto free_pages;
> > >
> > > list_add(&bpage->list, pages);
> > >
> > > page = alloc_pages_node(cpu_to_node(cpu),
> > > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > > if (!page)
> > > goto free_pages;
> >
> > if (fatal_signal_pending())
> > fgoto free_pages;
>
> But wouldn't page be NULL in this case?
__GFP_RETRY_MAYFAIL itself fails rather than triggers the OOM killer.
You still might get killed from other allocation context which can
trigger the OOM killer though. In any case you would back off and fail,
no?
> > > bpage->page = page_address(page);
> > > rb_init_page(bpage->page);
> > > }
> > > clear_current_oom_origin();
> >
> > If you use __GFP_RETRY_MAYFAIL it would have to be somedy else to
> > trigger the OOM killer and this user context would get killed. If you
> > drop __GFP_RETRY_MAYFAIL it would be this context to trigger the OOM but
> > it would still be the selected victim.
>
> Then we guarantee to kill the process instead of just sending a
> -ENOMEM, which would change user space ABI, and is a NO NO.
I see. Although I would expect it would be echo writing to a file most
of the time. But I am not really familiar what traces usually do so I
will not speculate.
> Ideally, we want to avoid an OOM. I could add the above as well, when
> si_mem_avaiable() returns something that is greater than what is
> available, and at least this is the process that will get the OOM if it
> fails to allocate.
>
> Would that work for you?
I have responded wrt si_mem_avaiable in other email but yes, using the
oom_origin would reduce the immediate damage at least.
--
Michal Hocko
SUSE Labs
On Wed, 4 Apr 2018 16:42:55 +0200
Michal Hocko <[email protected]> wrote:
> On Wed 04-04-18 10:25:27, Steven Rostedt wrote:
> > On Wed, 4 Apr 2018 16:10:52 +0200
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Wed 04-04-18 08:59:01, Steven Rostedt wrote:
> > > [...]
> > > > + /*
> > > > + * Check if the available memory is there first.
> > > > + * Note, si_mem_available() only gives us a rough estimate of available
> > > > + * memory. It may not be accurate. But we don't care, we just want
> > > > + * to prevent doing any allocation when it is obvious that it is
> > > > + * not going to succeed.
> > > > + */
> > > > + i = si_mem_available();
> > > > + if (i < nr_pages)
> > > > + return -ENOMEM;
> > > > +
> > > >
> > > > Better?
> > >
> > > I must be really missing something here. How can that work at all for
> > > e.g. the zone_{highmem/movable}. You will get false on the above tests
> > > even when you will have hard time to allocate anything from your
> > > destination zones.
> >
> > You mean we will get true on the above tests? Again, the current
> > method is to just say screw it and try to allocate.
>
> No, you will get false on that test. Say that you have a system with
Ah, I'm thinking backwards, I looked at false meaning "not enough
memory", where if it's true (i < nr_pages), false means there is enough
memory. OK, we are in agreement.
> large ZONE_MOVABLE. Now your kernel allocations can fit only into
> !movable zones (say we have 1G for !movable and 3G for movable). Now say
> that !movable zones are getting close to the edge while movable zones
> are full of reclaimable pages. si_mem_available will tell you there is a
> _lot_ of memory available while your GFP_KERNEL request will happily
> consume the rest of !movable zones and trigger OOM. See?
Which is still better than what we have today. I'm fine with it.
Really, I am.
>
> [...]
> > I'm looking for something where "yes" means "there may be enough, but
> > there may not be, buyer beware", and "no" means "forget it, don't even
> > start, because you just asked for more than possible".
>
> We do not have _that_ something other than try to opportunistically
> allocate and see what happens. Sucks? Maybe yes but I really cannot
> think of an interface with sane semantic that would catch all the
> different scenarios.
And I'm fine with that too. I don't want to catch all different
scenarios. I want to just catch the crazy ones. Like trying to allocate
gigs of memory when there's only a few megs left. Those can easily
happen with the current interface that can't change.
I'm not looking for perfect. In fact, I love what si_mem_available()
gives me now! Sure, it can say "there's enough memory" even if I can't
use it. Because most of the OOM allocations that happen with increasing
the size of the ring buffer isn't due to "just enough memory
allocated", but it's due to "trying to allocate crazy amounts of
memory". That's because it does the allocation one page at a time, and
if you try to allocate crazy amounts of memory, it will allocate all
memory before it fails. I don't want that. I want crazy allocations to
fail from the start. A "maybe this will allocate" is fine even if it
will end up causing an OOM.
-- Steve
On Wed 04-04-18 11:04:42, Steven Rostedt wrote:
[...]
> I'm not looking for perfect. In fact, I love what si_mem_available()
> gives me now! Sure, it can say "there's enough memory" even if I can't
> use it. Because most of the OOM allocations that happen with increasing
> the size of the ring buffer isn't due to "just enough memory
> allocated", but it's due to "trying to allocate crazy amounts of
> memory". That's because it does the allocation one page at a time, and
> if you try to allocate crazy amounts of memory, it will allocate all
> memory before it fails. I don't want that. I want crazy allocations to
> fail from the start. A "maybe this will allocate" is fine even if it
> will end up causing an OOM.
OK, fair enough. It's your code ;) I would recommend using the
oom_origin thingy to reduce the immediate damage and to have a clear
culprit so that I do not have to scratch my head why we see an OOM
report with a lot of unaccounted memory...
I am afraid I cannot help you much more though.
--
Michal Hocko
SUSE Labs
On Wed, 4 Apr 2018 17:27:13 +0200
Michal Hocko <[email protected]> wrote:
> I am afraid I cannot help you much more though.
No, you have actually been a great help. I'm finishing up a patch on
top of this one. I'll be posting it soon.
Thanks for your help and your time!
-- Steve
On Wed, 4 Apr 2018 16:23:29 +0200
Michal Hocko <[email protected]> wrote:
> >
> > I tried it out, I did the following:
> >
> > set_current_oom_origin();
> > for (i = 0; i < nr_pages; i++) {
> > struct page *page;
> > /*
> > * __GFP_RETRY_MAYFAIL flag makes sure that the allocation fails
> > * gracefully without invoking oom-killer and the system is not
> > * destabilized.
> > */
> > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> > cpu_to_node(cpu));
> > if (!bpage)
> > goto free_pages;
> >
> > list_add(&bpage->list, pages);
> >
> > page = alloc_pages_node(cpu_to_node(cpu),
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL, 0);
> > if (!page)
> > goto free_pages;
>
> if (fatal_signal_pending())
> fgoto free_pages;
I originally was going to remove the RETRY_MAYFAIL, but adding this
check (at the end of the loop though) appears to have OOM consistently
kill this task.
I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
nothing comes in and tries to do an allocation, but instead will fail
nicely with -ENOMEM.
-- Steve
>
> > bpage->page = page_address(page);
> > rb_init_page(bpage->page);
> > }
> > clear_current_oom_origin();
On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> I originally was going to remove the RETRY_MAYFAIL, but adding this
> check (at the end of the loop though) appears to have OOM consistently
> kill this task.
>
> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> nothing comes in and tries to do an allocation, but instead will fail
> nicely with -ENOMEM.
I still don't get why you want RETRY_MAYFAIL. You know that tries
*harder* to allocate memory than plain GFP_KERNEL does, right? And
that seems like the exact opposite of what you want.
On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
>> I originally was going to remove the RETRY_MAYFAIL, but adding this
>> check (at the end of the loop though) appears to have OOM consistently
>> kill this task.
>>
>> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
>> nothing comes in and tries to do an allocation, but instead will fail
>> nicely with -ENOMEM.
>
> I still don't get why you want RETRY_MAYFAIL. You know that tries
> *harder* to allocate memory than plain GFP_KERNEL does, right? And
> that seems like the exact opposite of what you want.
No. We do want it to try harder but not if its already setup for failure.
On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox <[email protected]> wrote:
> > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> >> I originally was going to remove the RETRY_MAYFAIL, but adding this
> >> check (at the end of the loop though) appears to have OOM consistently
> >> kill this task.
> >>
> >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> >> nothing comes in and tries to do an allocation, but instead will fail
> >> nicely with -ENOMEM.
> >
> > I still don't get why you want RETRY_MAYFAIL. You know that tries
> > *harder* to allocate memory than plain GFP_KERNEL does, right? And
> > that seems like the exact opposite of what you want.
>
> No. We do want it to try harder but not if its already setup for failure.
I understand you don't want GFP_NORETRY. But why is it more important for
this allocation to succeed than other normal GFP_KERNEL allocations?
On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox <[email protected]> wrote:
> > > On Wed, Apr 04, 2018 at 11:47:30AM -0400, Steven Rostedt wrote:
> > >> I originally was going to remove the RETRY_MAYFAIL, but adding this
> > >> check (at the end of the loop though) appears to have OOM consistently
> > >> kill this task.
> > >>
> > >> I still like to keep RETRY_MAYFAIL, because it wont trigger OOM if
> > >> nothing comes in and tries to do an allocation, but instead will fail
> > >> nicely with -ENOMEM.
> > >
> > > I still don't get why you want RETRY_MAYFAIL. You know that tries
> > > *harder* to allocate memory than plain GFP_KERNEL does, right? And
> > > that seems like the exact opposite of what you want.
> >
> > No. We do want it to try harder but not if its already setup for failure.
>
> I understand you don't want GFP_NORETRY. But why is it more important for
> this allocation to succeed than other normal GFP_KERNEL allocations?
I guess they simply want a failure rather than OOM even when they can
shoot themselves into head by using oom_origin. It is still quite ugly
to see OOM report...
--
Michal Hocko
SUSE Labs
On Thu, 5 Apr 2018 07:22:58 -0700
Matthew Wilcox <[email protected]> wrote:
> I understand you don't want GFP_NORETRY. But why is it more important for
> this allocation to succeed than other normal GFP_KERNEL allocations?
Not sure what you mean by "more important"? Does saying "RETRY_MAYFAIL"
make it more important? The difference is, if GFP_KERNEL fails, we
don't want to trigger an OOM, and simply clean up and report -ENOMEM to
the user. It has nothing to do with being more important than other
allocations.
If there's 100 Megs of memory available, and the user requests a gig of
memory, it's going to fail. Ideally, it doesn't trigger OOM, but
instead simply reports -ENOMEM to the user.
-- Steve
On Thu, 5 Apr 2018 16:27:49 +0200
Michal Hocko <[email protected]> wrote:
> > I understand you don't want GFP_NORETRY. But why is it more important for
> > this allocation to succeed than other normal GFP_KERNEL allocations?
>
> I guess they simply want a failure rather than OOM even when they can
> shoot themselves into head by using oom_origin. It is still quite ugly
> to see OOM report...
Exactly!
-- Steve
On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote:
> On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox <[email protected]> wrote:
> > > > I still don't get why you want RETRY_MAYFAIL. You know that tries
> > > > *harder* to allocate memory than plain GFP_KERNEL does, right? And
> > > > that seems like the exact opposite of what you want.
Argh. The comment confused me. OK, now I've read the source and
understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
as GFP_KERNEL *except* that it won't cause OOM itself. But any other
simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
cause an OOM. (And that's why we're having a conversation)
That's a problem because we have places in the kernel that call
kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
which will do the exact same thing, only it will trigger OOM all by itself
(assuming the largest free chunk of address space in the vmalloc area
is larger than the amount of free memory).
I considered an alloc_page_array(), but that doesn't fit well with the
design of the ring buffer code. We could have:
struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr);
and link the allocated pages together through page->lru.
We could also have a GFP flag that says to only succeed if we're further
above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW),
if you like. That would give us the desired behaviour of trying all of
the reclaim methods that GFP_KERNEL would, but not being able to exhaust
all the memory that GFP_KERNEL allocations would take.
On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> On Thu, Apr 05, 2018 at 04:27:49PM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 07:22:58, Matthew Wilcox wrote:
> > > On Wed, Apr 04, 2018 at 09:12:52PM -0700, Joel Fernandes wrote:
> > > > On Wed, Apr 4, 2018 at 7:58 PM, Matthew Wilcox <[email protected]> wrote:
> > > > > I still don't get why you want RETRY_MAYFAIL. You know that tries
> > > > > *harder* to allocate memory than plain GFP_KERNEL does, right? And
> > > > > that seems like the exact opposite of what you want.
>
> Argh. The comment confused me. OK, now I've read the source and
> understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> as GFP_KERNEL *except* that it won't cause OOM itself. But any other
> simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> cause an OOM. (And that's why we're having a conversation)
Well, I can udnerstand how this can be confusing. The all confusion
boils down to the small-never-fails semantic we have. So all reclaim
modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
modify the _default_ behavior.
> That's a problem because we have places in the kernel that call
> kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> which will do the exact same thing, only it will trigger OOM all by itself
> (assuming the largest free chunk of address space in the vmalloc area
> is larger than the amount of free memory).
well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
herritage that you are not so proud of.
> I considered an alloc_page_array(), but that doesn't fit well with the
> design of the ring buffer code. We could have:
>
> struct page *alloc_page_list_node(int nid, gfp_t gfp_mask, unsigned long nr);
>
> and link the allocated pages together through page->lru.
>
> We could also have a GFP flag that says to only succeed if we're further
> above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW),
> if you like. That would give us the desired behaviour of trying all of
> the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> all the memory that GFP_KERNEL allocations would take.
Well, I would be really careful with yet another gfp mask. They are so
incredibly hard to define properly and then people kinda tend to screw
your best intentions with their usecases ;)
Failing on low wmark is very close to __GFP_NORETRY or even
__GFP_NOWAIT, btw. So let's try to not overthink this...
--
Michal Hocko
SUSE Labs
On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote:
> On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> > Argh. The comment confused me. OK, now I've read the source and
> > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> > as GFP_KERNEL *except* that it won't cause OOM itself. But any other
> > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> > cause an OOM. (And that's why we're having a conversation)
>
> Well, I can udnerstand how this can be confusing. The all confusion
> boils down to the small-never-fails semantic we have. So all reclaim
> modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
> modify the _default_ behavior.
Now that I understand the flag, I'll try to write a more clear
explanation.
> > That's a problem because we have places in the kernel that call
> > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> > which will do the exact same thing, only it will trigger OOM all by itself
> > (assuming the largest free chunk of address space in the vmalloc area
> > is larger than the amount of free memory).
>
> well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
> herritage that you are not so proud of.
Certainly not, but that's not what I'm concerned about; I'm concerned
about the allocation of the pages, not the allocation of the array
containing the page pointers.
> > We could also have a GFP flag that says to only succeed if we're further
> > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW),
> > if you like. That would give us the desired behaviour of trying all of
> > the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> > all the memory that GFP_KERNEL allocations would take.
>
> Well, I would be really careful with yet another gfp mask. They are so
> incredibly hard to define properly and then people kinda tend to screw
> your best intentions with their usecases ;)
> Failing on low wmark is very close to __GFP_NORETRY or even
> __GFP_NOWAIT, btw. So let's try to not overthink this...
Oh, indeed. We must be able to clearly communicate to users when they
should use this flag. I have in mind something like this:
* __GFP_HIGH indicates that the caller is high-priority and that granting
* the request is necessary before the system can make forward progress.
* For example, creating an IO context to clean pages.
*
* __GFP_LOW indicates that the caller is low-priority and that it should
* not be allocated pages that would cause the system to get into an
* out-of-memory situation. For example, allocating multiple individual
* pages in order to satisfy a larger request.
I think this should actually replace __GFP_RETRY_MAYFAIL. It makes sense
to a user: "This is a low priority GFP_KERNEL allocation". I doubt there's
one kernel hacker in a hundred who could explain what GFP_RETRY_MAYFAIL
does, exactly, and I'm not just saying that because I got it wrong ;-)
On Thu 05-04-18 09:15:01, Matthew Wilcox wrote:
> On Thu, Apr 05, 2018 at 05:32:40PM +0200, Michal Hocko wrote:
> > On Thu 05-04-18 08:13:59, Matthew Wilcox wrote:
> > > Argh. The comment confused me. OK, now I've read the source and
> > > understand that GFP_KERNEL | __GFP_RETRY_MAYFAIL tries exactly as hard
> > > as GFP_KERNEL *except* that it won't cause OOM itself. But any other
> > > simultaneous GFP_KERNEL allocation without __GFP_RETRY_MAYFAIL will
> > > cause an OOM. (And that's why we're having a conversation)
> >
> > Well, I can udnerstand how this can be confusing. The all confusion
> > boils down to the small-never-fails semantic we have. So all reclaim
> > modificators (__GFP_NOFAIL, __GFP_RETRY_MAYFAIL, __GFP_NORETRY) only
> > modify the _default_ behavior.
>
> Now that I understand the flag, I'll try to write a more clear
> explanation.
Good luck with that. It took me several iterations to land with the
current state. It is quite hard to be not misleading yet understandable.
> > > That's a problem because we have places in the kernel that call
> > > kv[zm]alloc(very_large_size, GFP_KERNEL), and that will turn into vmalloc,
> > > which will do the exact same thing, only it will trigger OOM all by itself
> > > (assuming the largest free chunk of address space in the vmalloc area
> > > is larger than the amount of free memory).
> >
> > well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
> > herritage that you are not so proud of.
>
> Certainly not, but that's not what I'm concerned about; I'm concerned
> about the allocation of the pages, not the allocation of the array
> containing the page pointers.
Those pages will use the gfp flag you give to vmalloc IIRC. It is page
tables that are GFP_KERNEL unconditionally.
> > > We could also have a GFP flag that says to only succeed if we're further
> > > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW),
> > > if you like. That would give us the desired behaviour of trying all of
> > > the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> > > all the memory that GFP_KERNEL allocations would take.
> >
> > Well, I would be really careful with yet another gfp mask. They are so
> > incredibly hard to define properly and then people kinda tend to screw
> > your best intentions with their usecases ;)
> > Failing on low wmark is very close to __GFP_NORETRY or even
> > __GFP_NOWAIT, btw. So let's try to not overthink this...
>
> Oh, indeed. We must be able to clearly communicate to users when they
> should use this flag. I have in mind something like this:
>
> * __GFP_HIGH indicates that the caller is high-priority and that granting
> * the request is necessary before the system can make forward progress.
> * For example, creating an IO context to clean pages.
> *
> * __GFP_LOW indicates that the caller is low-priority and that it should
> * not be allocated pages that would cause the system to get into an
> * out-of-memory situation. For example, allocating multiple individual
> * pages in order to satisfy a larger request.
So how exactly the low fits into GFP_NOWAIT, GFP_NORETRY and
GFP_RETRY_MAFAIL? We _do_have several levels of how hard to try and we
have users relying on that. And do not forget about costly vs.
non-costly sizes.
That being said, we should not hijack this thread more and further
enhancements should be discussed separatelly. I am all for making the
whole allocation api less obscure but keep in mind that we have really
hard historical restrictions.
--
Michal Hocko
SUSE Labs
On Thu, Apr 05, 2018 at 08:54:44PM +0200, Michal Hocko wrote:
> On Thu 05-04-18 09:15:01, Matthew Wilcox wrote:
> > > well, hardcoded GFP_KERNEL from vmalloc guts is yet another, ehm,
> > > herritage that you are not so proud of.
> >
> > Certainly not, but that's not what I'm concerned about; I'm concerned
> > about the allocation of the pages, not the allocation of the array
> > containing the page pointers.
>
> Those pages will use the gfp flag you give to vmalloc IIRC. It is page
> tables that are GFP_KERNEL unconditionally.
Right. But if I call vmalloc(1UL << 40, GFP_KERNEL) on a machine with
only half a terabyte of RAM, it will OOM in the exact same way that
Steven is reporting.
> > > > We could also have a GFP flag that says to only succeed if we're further
> > > > above the existing watermark than normal. __GFP_LOW (==ALLOC_LOW),
> > > > if you like. That would give us the desired behaviour of trying all of
> > > > the reclaim methods that GFP_KERNEL would, but not being able to exhaust
> > > > all the memory that GFP_KERNEL allocations would take.
> > >
> > > Well, I would be really careful with yet another gfp mask. They are so
> > > incredibly hard to define properly and then people kinda tend to screw
> > > your best intentions with their usecases ;)
> > > Failing on low wmark is very close to __GFP_NORETRY or even
> > > __GFP_NOWAIT, btw. So let's try to not overthink this...
> >
> > Oh, indeed. We must be able to clearly communicate to users when they
> > should use this flag. I have in mind something like this:
> >
> > * __GFP_HIGH indicates that the caller is high-priority and that granting
> > * the request is necessary before the system can make forward progress.
> > * For example, creating an IO context to clean pages.
> > *
> > * __GFP_LOW indicates that the caller is low-priority and that it should
> > * not be allocated pages that would cause the system to get into an
> > * out-of-memory situation. For example, allocating multiple individual
> > * pages in order to satisfy a larger request.
>
> So how exactly the low fits into GFP_NOWAIT, GFP_NORETRY and
> GFP_RETRY_MAFAIL? We _do_have several levels of how hard to try and we
> have users relying on that. And do not forget about costly vs.
> non-costly sizes.
>
> That being said, we should not hijack this thread more and further
> enhancements should be discussed separatelly. I am all for making the
> whole allocation api less obscure but keep in mind that we have really
> hard historical restrictions.
Dropping the non-mm participants ...
From a "user guide" perspective:
When allocating memory, you can choose:
- What kind of memory to allocate (DMA, NORMAL, HIGHMEM)
- Where to get the pages from
- Local node only (THISNODE)
- Only in compliance with cpuset policy (HARDWALL)
- Spread the pages between zones (WRITE)
- The movable zone (MOVABLE)
- The reclaimable zone (RECLAIMABLE)
- What you are willing to do if no free memory is available:
- Nothing at all (NOWAIT)
- Use my own time to free memory (DIRECT_RECLAIM)
- But only try once (NORETRY)
- Can call into filesystems (FS)
- Can start I/O (IO)
- Can sleep (!ATOMIC)
- Steal time from other processes to free memory (KSWAPD_RECLAIM)
- Kill other processes to get their memory (!RETRY_MAYFAIL)
- All of the above, and wait forever (NOFAIL)
- Take from emergency reserves (HIGH)
- ... but not the last parts of the regular reserves (LOW)
How does that look as an overview?
On Thu 05-04-18 13:15:57, Matthew Wilcox wrote:
> On Thu, Apr 05, 2018 at 08:54:44PM +0200, Michal Hocko wrote:
[...]
> >From a "user guide" perspective:
>
> When allocating memory, you can choose:
OK, we already split the documentation into these categories. So we got
at least the structure right ;)
> - What kind of memory to allocate (DMA, NORMAL, HIGHMEM)
> - Where to get the pages from
> - Local node only (THISNODE)
> - Only in compliance with cpuset policy (HARDWALL)
> - Spread the pages between zones (WRITE)
> - The movable zone (MOVABLE)
> - The reclaimable zone (RECLAIMABLE)
> - What you are willing to do if no free memory is available:
> - Nothing at all (NOWAIT)
> - Use my own time to free memory (DIRECT_RECLAIM)
> - But only try once (NORETRY)
> - Can call into filesystems (FS)
> - Can start I/O (IO)
> - Can sleep (!ATOMIC)
> - Steal time from other processes to free memory (KSWAPD_RECLAIM)
What does that mean? If I drop the flag, do not steal? Well I do because
they will hit direct reclaim sooner...
> - Kill other processes to get their memory (!RETRY_MAYFAIL)
Not really for costly orders.
> - All of the above, and wait forever (NOFAIL)
> - Take from emergency reserves (HIGH)
> - ... but not the last parts of the regular reserves (LOW)
What does that mean and how it is different from NOWAIT? Is this about
the low watermark and if yes do we want to teach users about this and
make the whole thing even more complicated? Does it wake
kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
RETRY_MAYFAIL, NOFAIL?
--
Michal Hocko
SUSE Labs
On Fri, Apr 06, 2018 at 08:09:53AM +0200, Michal Hocko wrote:
> OK, we already split the documentation into these categories. So we got
> at least the structure right ;)
Yes, this part of the documentation makes sense to me :-)
> > - What kind of memory to allocate (DMA, NORMAL, HIGHMEM)
> > - Where to get the pages from
> > - Local node only (THISNODE)
> > - Only in compliance with cpuset policy (HARDWALL)
> > - Spread the pages between zones (WRITE)
> > - The movable zone (MOVABLE)
> > - The reclaimable zone (RECLAIMABLE)
> > - What you are willing to do if no free memory is available:
> > - Nothing at all (NOWAIT)
> > - Use my own time to free memory (DIRECT_RECLAIM)
> > - But only try once (NORETRY)
> > - Can call into filesystems (FS)
> > - Can start I/O (IO)
> > - Can sleep (!ATOMIC)
> > - Steal time from other processes to free memory (KSWAPD_RECLAIM)
>
> What does that mean? If I drop the flag, do not steal? Well I do because
> they will hit direct reclaim sooner...
If they allocate memory, sure. A process which stays in its working
set won't, unless it's preempted by kswapd.
> > - Kill other processes to get their memory (!RETRY_MAYFAIL)
>
> Not really for costly orders.
Yes, need to be more precise there.
> > - All of the above, and wait forever (NOFAIL)
> > - Take from emergency reserves (HIGH)
> > - ... but not the last parts of the regular reserves (LOW)
>
> What does that mean and how it is different from NOWAIT? Is this about
> the low watermark and if yes do we want to teach users about this and
> make the whole thing even more complicated? Does it wake
> kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
> RETRY_MAYFAIL, NOFAIL?
LOW doesn't quite fit into the eagerness scale with the other flags;
instead it's composable with them. So you can specify NOWAIT | LOW,
NORETRY | LOW, NOFAIL | LOW, etc. All I have in mind is something
like this:
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
+ if (alloc_flags & ALLOC_LOW)
+ min += min / 2;
The idea is that a GFP_KERNEL | __GFP_LOW allocation cannot force a
GFP_KERNEL allocation into an OOM situation because it cannot take
the last pages of memory before the watermark. It can still make a
GFP_KERNEL allocation *more likely* to hit OOM (just like any other kind
of allocation can), but it can't do it by itself.
---
I've been wondering about combining the DIRECT_RECLAIM, NORETRY,
RETRY_MAYFAIL and NOFAIL flags together into a single field:
0 => RECLAIM_NEVER, /* !DIRECT_RECLAIM */
1 => RECLAIM_ONCE, /* NORETRY */
2 => RECLAIM_PROGRESS, /* RETRY_MAYFAIL */
3 => RECLAIM_FOREVER, /* NOFAIL */
The existance of __GFP_RECLAIM makes this a bit tricky. I honestly don't
know what this code is asking for:
kernel/power/swap.c: __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
but I suspect I'll have to find out. There's about 60 places to look at.
I also want to add __GFP_KILL (to be part of the GFP_KERNEL definition).
That way, each bit that you set in the GFP mask increases the things the
page allocator can do to get memory for you. At the moment, RETRY_MAYFAIL
subtracts the ability to kill other tasks, which is unusual. For example,
this test in kvmalloc_node:
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
doesn't catch RETRY_MAYFAIL being set.
On Sat 07-04-18 21:27:09, Matthew Wilcox wrote:
> On Fri, Apr 06, 2018 at 08:09:53AM +0200, Michal Hocko wrote:
> > OK, we already split the documentation into these categories. So we got
> > at least the structure right ;)
>
> Yes, this part of the documentation makes sense to me :-)
>
> > > - What kind of memory to allocate (DMA, NORMAL, HIGHMEM)
> > > - Where to get the pages from
> > > - Local node only (THISNODE)
> > > - Only in compliance with cpuset policy (HARDWALL)
> > > - Spread the pages between zones (WRITE)
> > > - The movable zone (MOVABLE)
> > > - The reclaimable zone (RECLAIMABLE)
> > > - What you are willing to do if no free memory is available:
> > > - Nothing at all (NOWAIT)
> > > - Use my own time to free memory (DIRECT_RECLAIM)
> > > - But only try once (NORETRY)
> > > - Can call into filesystems (FS)
> > > - Can start I/O (IO)
> > > - Can sleep (!ATOMIC)
> > > - Steal time from other processes to free memory (KSWAPD_RECLAIM)
> >
> > What does that mean? If I drop the flag, do not steal? Well I do because
> > they will hit direct reclaim sooner...
>
> If they allocate memory, sure. A process which stays in its working
> set won't, unless it's preempted by kswapd.
Well, I was probably not clear here. KSWAPD_RECLAIM is not something you
want to drop because this is a cooperative flag. If you do not use it
then you are effectivelly pushing others to the direct reclaim because
the kswapd won't be woken up and won't do the background work. Your
working make it sound as a good thing to drop.
> > > - Kill other processes to get their memory (!RETRY_MAYFAIL)
> >
> > Not really for costly orders.
>
> Yes, need to be more precise there.
>
> > > - All of the above, and wait forever (NOFAIL)
> > > - Take from emergency reserves (HIGH)
> > > - ... but not the last parts of the regular reserves (LOW)
> >
> > What does that mean and how it is different from NOWAIT? Is this about
> > the low watermark and if yes do we want to teach users about this and
> > make the whole thing even more complicated? Does it wake
> > kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
> > RETRY_MAYFAIL, NOFAIL?
>
> LOW doesn't quite fit into the eagerness scale with the other flags;
> instead it's composable with them. So you can specify NOWAIT | LOW,
> NORETRY | LOW, NOFAIL | LOW, etc. All I have in mind is something
> like this:
>
> if (alloc_flags & ALLOC_HIGH)
> min -= min / 2;
> + if (alloc_flags & ALLOC_LOW)
> + min += min / 2;
>
> The idea is that a GFP_KERNEL | __GFP_LOW allocation cannot force a
> GFP_KERNEL allocation into an OOM situation because it cannot take
> the last pages of memory before the watermark.
So what are we going to do if the LOW watermark cannot succeed?
> It can still make a
> GFP_KERNEL allocation *more likely* to hit OOM (just like any other kind
> of allocation can), but it can't do it by itself.
So who would be a user of __GFP_LOW?
> ---
>
> I've been wondering about combining the DIRECT_RECLAIM, NORETRY,
> RETRY_MAYFAIL and NOFAIL flags together into a single field:
> 0 => RECLAIM_NEVER, /* !DIRECT_RECLAIM */
> 1 => RECLAIM_ONCE, /* NORETRY */
> 2 => RECLAIM_PROGRESS, /* RETRY_MAYFAIL */
> 3 => RECLAIM_FOREVER, /* NOFAIL */
>
> The existance of __GFP_RECLAIM makes this a bit tricky. I honestly don't
> know what this code is asking for:
I am not sure I follow here. Is the RECLAIM_ an internal thing to the
allocator?
> kernel/power/swap.c: __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
> but I suspect I'll have to find out. There's about 60 places to look at.
Well, it would be more understandable if this was written as
(GFP_KERNEL | __GFP_HIGH) & ~(__GFP_FS|__GFP_IO)
> I also want to add __GFP_KILL (to be part of the GFP_KERNEL definition).
What does __GFP_KILL means?
> That way, each bit that you set in the GFP mask increases the things the
> page allocator can do to get memory for you. At the moment, RETRY_MAYFAIL
> subtracts the ability to kill other tasks, which is unusual.
Well, it is not all that great because some flags add capabilities while
some remove them but, well, life is hard when you try to extend an
interface which was not all that great from the very beginning.
> For example,
> this test in kvmalloc_node:
>
> WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>
> doesn't catch RETRY_MAYFAIL being set.
It doesn't really have to. We want to catch obviously broken gfp flags
here. That means mostly GFP_NO{FS,IO} because those might simply
deadlock. RETRY_MAYFAIL is even supported to some extend.
--
Michal Hocko
SUSE Labs
On Mon, Apr 09, 2018 at 09:34:07AM +0200, Michal Hocko wrote:
> On Sat 07-04-18 21:27:09, Matthew Wilcox wrote:
> > > > - Steal time from other processes to free memory (KSWAPD_RECLAIM)
> > >
> > > What does that mean? If I drop the flag, do not steal? Well I do because
> > > they will hit direct reclaim sooner...
> >
> > If they allocate memory, sure. A process which stays in its working
> > set won't, unless it's preempted by kswapd.
>
> Well, I was probably not clear here. KSWAPD_RECLAIM is not something you
> want to drop because this is a cooperative flag. If you do not use it
> then you are effectivelly pushing others to the direct reclaim because
> the kswapd won't be woken up and won't do the background work. Your
> working make it sound as a good thing to drop.
If memory is low, *somebody* has to reclaim. As I understand it, kswapd
was originally introduced because networking might do many allocations
from interrupt context, and so was unable to do its own reclaiming. On a
machine which was used only for routing, there was no userspace process to
do the reclaiming, so it ran out of memory. But if you're an HPC person
who's expecting their long-running tasks to be synchronised and not be
unnecessarily disturbed, having kswapd preempting your task is awful.
I'm not arguing in favour of removing kswapd or anything like that,
but if you're not willing/able to reclaim memory yourself, then you're
necessarily stealing time from other tasks in order to have reclaim
happen.
> > > What does that mean and how it is different from NOWAIT? Is this about
> > > the low watermark and if yes do we want to teach users about this and
> > > make the whole thing even more complicated? Does it wake
> > > kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
> > > RETRY_MAYFAIL, NOFAIL?
> >
> > LOW doesn't quite fit into the eagerness scale with the other flags;
> > instead it's composable with them. So you can specify NOWAIT | LOW,
> > NORETRY | LOW, NOFAIL | LOW, etc. All I have in mind is something
> > like this:
> >
> > if (alloc_flags & ALLOC_HIGH)
> > min -= min / 2;
> > + if (alloc_flags & ALLOC_LOW)
> > + min += min / 2;
> >
> > The idea is that a GFP_KERNEL | __GFP_LOW allocation cannot force a
> > GFP_KERNEL allocation into an OOM situation because it cannot take
> > the last pages of memory before the watermark.
>
> So what are we going to do if the LOW watermark cannot succeed?
Depends on the other flags. GFP_NOWAIT | GFP_LOW will just return NULL
(somewhat more readily than a plain GFP_NOWAIT would). GFP_NORETRY |
GFP_LOW will do one pass through reclaim. If it gets enough pages
to drag the zone above the watermark, then it'll succeed, otherwise
return NULL. NOFAIL | LOW will keep retrying forever. GFP_KERNEL |
GFP_LOW ... hmm, that'll OOM-kill another process more eagerly that
a regular GFP_KERNEL allocation would. We'll need a little tweak so
GFP_LOW implies __GFP_RETRY_MAYFAIL.
> > It can still make a
> > GFP_KERNEL allocation *more likely* to hit OOM (just like any other kind
> > of allocation can), but it can't do it by itself.
>
> So who would be a user of __GFP_LOW?
vmalloc and Steven's ringbuffer. If I write a kernel module that tries
to vmalloc 1TB of space, it'll OOM-kill everything on the machine trying
to get enough memory to fill the page array. Probably everyone using
__GFP_RETRY_MAYFAIL today, to be honest. It's more likely to accomplish
what they want -- trying slightly less hard to get memory than GFP_KERNEL
allocations would.
> > I've been wondering about combining the DIRECT_RECLAIM, NORETRY,
> > RETRY_MAYFAIL and NOFAIL flags together into a single field:
> > 0 => RECLAIM_NEVER, /* !DIRECT_RECLAIM */
> > 1 => RECLAIM_ONCE, /* NORETRY */
> > 2 => RECLAIM_PROGRESS, /* RETRY_MAYFAIL */
> > 3 => RECLAIM_FOREVER, /* NOFAIL */
> >
> > The existance of __GFP_RECLAIM makes this a bit tricky. I honestly don't
> > know what this code is asking for:
>
> I am not sure I follow here. Is the RECLAIM_ an internal thing to the
> allocator?
No, I'm talking about changing the __GFP flags like this:
@@ -24,10 +24,8 @@ struct vm_area_struct;
#define ___GFP_HIGH 0x20u
#define ___GFP_IO 0x40u
#define ___GFP_FS 0x80u
+#define ___GFP_ACCOUNT 0x100u
#define ___GFP_NOWARN 0x200u
-#define ___GFP_RETRY_MAYFAIL 0x400u
-#define ___GFP_NOFAIL 0x800u
-#define ___GFP_NORETRY 0x1000u
#define ___GFP_MEMALLOC 0x2000u
#define ___GFP_COMP 0x4000u
#define ___GFP_ZERO 0x8000u
@@ -35,8 +33,10 @@ struct vm_area_struct;
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
#define ___GFP_ATOMIC 0x80000u
-#define ___GFP_ACCOUNT 0x100000u
-#define ___GFP_DIRECT_RECLAIM 0x400000u
+#define ___GFP_RECLAIM_NEVER 0x00000u
+#define ___GFP_RECLAIM_ONCE 0x10000u
+#define ___GFP_RECLAIM_PROGRESS 0x20000u
+#define ___GFP_RECLAIM_FOREVER 0x30000u
#define ___GFP_WRITE 0x800000u
#define ___GFP_KSWAPD_RECLAIM 0x1000000u
#ifdef CONFIG_LOCKDEP
> > kernel/power/swap.c: __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
> > but I suspect I'll have to find out. There's about 60 places to look at.
>
> Well, it would be more understandable if this was written as
> (GFP_KERNEL | __GFP_HIGH) & ~(__GFP_FS|__GFP_IO)
Yeah, I think it's really (GFP_NOIO | __GFP_HIGH)
> > I also want to add __GFP_KILL (to be part of the GFP_KERNEL definition).
>
> What does __GFP_KILL means?
Allows OOM killing. So it's the inverse of the GFP_RETRY_MAYFAIL bit.
> > That way, each bit that you set in the GFP mask increases the things the
> > page allocator can do to get memory for you. At the moment, RETRY_MAYFAIL
> > subtracts the ability to kill other tasks, which is unusual.
>
> Well, it is not all that great because some flags add capabilities while
> some remove them but, well, life is hard when you try to extend an
> interface which was not all that great from the very beginning.
That's the story of Linux ;-)
On Mon 09-04-18 08:51:57, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 09:34:07AM +0200, Michal Hocko wrote:
> > On Sat 07-04-18 21:27:09, Matthew Wilcox wrote:
> > > > > - Steal time from other processes to free memory (KSWAPD_RECLAIM)
> > > >
> > > > What does that mean? If I drop the flag, do not steal? Well I do because
> > > > they will hit direct reclaim sooner...
> > >
> > > If they allocate memory, sure. A process which stays in its working
> > > set won't, unless it's preempted by kswapd.
> >
> > Well, I was probably not clear here. KSWAPD_RECLAIM is not something you
> > want to drop because this is a cooperative flag. If you do not use it
> > then you are effectivelly pushing others to the direct reclaim because
> > the kswapd won't be woken up and won't do the background work. Your
> > working make it sound as a good thing to drop.
>
> If memory is low, *somebody* has to reclaim. As I understand it, kswapd
> was originally introduced because networking might do many allocations
> from interrupt context, and so was unable to do its own reclaiming. On a
> machine which was used only for routing, there was no userspace process to
> do the reclaiming, so it ran out of memory. But if you're an HPC person
> who's expecting their long-running tasks to be synchronised and not be
> unnecessarily disturbed, having kswapd preempting your task is awful.
>
> I'm not arguing in favour of removing kswapd or anything like that,
> but if you're not willing/able to reclaim memory yourself, then you're
> necessarily stealing time from other tasks in order to have reclaim
> happen.
Sure, you are right, except if you do not wake kswapd then other
allocators will pay when they hit the wmark and that will happen sooner
or later with our caching semantic. The more users who drop
KSWAPD_RECLAIM we have the more this will be visible. That is all I want
to say.
> > > > What does that mean and how it is different from NOWAIT? Is this about
> > > > the low watermark and if yes do we want to teach users about this and
> > > > make the whole thing even more complicated? Does it wake
> > > > kswapd? What is the eagerness ordering? LOW, NOWAIT, NORETRY,
> > > > RETRY_MAYFAIL, NOFAIL?
> > >
> > > LOW doesn't quite fit into the eagerness scale with the other flags;
> > > instead it's composable with them. So you can specify NOWAIT | LOW,
> > > NORETRY | LOW, NOFAIL | LOW, etc. All I have in mind is something
> > > like this:
> > >
> > > if (alloc_flags & ALLOC_HIGH)
> > > min -= min / 2;
> > > + if (alloc_flags & ALLOC_LOW)
> > > + min += min / 2;
> > >
> > > The idea is that a GFP_KERNEL | __GFP_LOW allocation cannot force a
> > > GFP_KERNEL allocation into an OOM situation because it cannot take
> > > the last pages of memory before the watermark.
> >
> > So what are we going to do if the LOW watermark cannot succeed?
>
> Depends on the other flags. GFP_NOWAIT | GFP_LOW will just return NULL
> (somewhat more readily than a plain GFP_NOWAIT would). GFP_NORETRY |
> GFP_LOW will do one pass through reclaim. If it gets enough pages
> to drag the zone above the watermark, then it'll succeed, otherwise
> return NULL. NOFAIL | LOW will keep retrying forever. GFP_KERNEL |
> GFP_LOW ... hmm, that'll OOM-kill another process more eagerly that
s@more eagerly@less eagerly@? And what does that mean actually?
> a regular GFP_KERNEL allocation would. We'll need a little tweak so
> GFP_LOW implies __GFP_RETRY_MAYFAIL.
I am not really convinced we really need to make the current code, which
is already dreadfully complicated, even more complicated. So color me
unconvinced but I do not really think we absolutely need yet another
flag with a nontrivial semantic. E.g. GFP_KERNEL | __GFP_LOW sounds
quite subtle to me.
> > > It can still make a
> > > GFP_KERNEL allocation *more likely* to hit OOM (just like any other kind
> > > of allocation can), but it can't do it by itself.
> >
> > So who would be a user of __GFP_LOW?
>
> vmalloc and Steven's ringbuffer. If I write a kernel module that tries
> to vmalloc 1TB of space, it'll OOM-kill everything on the machine trying
> to get enough memory to fill the page array.
Yes, we assume certain etiquette and sanity from the code running in
ring 0. I am not really sure we want to make life of any code that
thinks "let's allocated a bunch of memory just in case" very much. Sure
there are places which want to allocate optimistically with some
reasonable fallback but we _do_ have flags for those.
> Probably everyone using
> __GFP_RETRY_MAYFAIL today, to be honest. It's more likely to accomplish
> what they want -- trying slightly less hard to get memory than GFP_KERNEL
> allocations would.
The main point of __GFP_RETRY_MAYFAIL was to have a consistent failure
semantic regardless of the request size. __GFP_REPEAT was unusable for
that purpose because it was costly order only.
> > > I've been wondering about combining the DIRECT_RECLAIM, NORETRY,
> > > RETRY_MAYFAIL and NOFAIL flags together into a single field:
> > > 0 => RECLAIM_NEVER, /* !DIRECT_RECLAIM */
> > > 1 => RECLAIM_ONCE, /* NORETRY */
> > > 2 => RECLAIM_PROGRESS, /* RETRY_MAYFAIL */
> > > 3 => RECLAIM_FOREVER, /* NOFAIL */
> > >
> > > The existance of __GFP_RECLAIM makes this a bit tricky. I honestly don't
> > > know what this code is asking for:
> >
> > I am not sure I follow here. Is the RECLAIM_ an internal thing to the
> > allocator?
>
> No, I'm talking about changing the __GFP flags like this:
OK. I was just confused because we have ALLOC_* for internal use so I
though you want to introduce another concept of RECLAIM_*
Anyway...
>
> @@ -24,10 +24,8 @@ struct vm_area_struct;
> #define ___GFP_HIGH 0x20u
> #define ___GFP_IO 0x40u
> #define ___GFP_FS 0x80u
> +#define ___GFP_ACCOUNT 0x100u
> #define ___GFP_NOWARN 0x200u
> -#define ___GFP_RETRY_MAYFAIL 0x400u
> -#define ___GFP_NOFAIL 0x800u
> -#define ___GFP_NORETRY 0x1000u
> #define ___GFP_MEMALLOC 0x2000u
> #define ___GFP_COMP 0x4000u
> #define ___GFP_ZERO 0x8000u
> @@ -35,8 +33,10 @@ struct vm_area_struct;
> #define ___GFP_HARDWALL 0x20000u
> #define ___GFP_THISNODE 0x40000u
> #define ___GFP_ATOMIC 0x80000u
> -#define ___GFP_ACCOUNT 0x100000u
> -#define ___GFP_DIRECT_RECLAIM 0x400000u
> +#define ___GFP_RECLAIM_NEVER 0x00000u
> +#define ___GFP_RECLAIM_ONCE 0x10000u
> +#define ___GFP_RECLAIM_PROGRESS 0x20000u
> +#define ___GFP_RECLAIM_FOREVER 0x30000u
> #define ___GFP_WRITE 0x800000u
> #define ___GFP_KSWAPD_RECLAIM 0x1000000u
> #ifdef CONFIG_LOCKDEP
... I really do not care about names much. It would be a bit pain to do
the flag day and all the renaming but maybe we do not have hundreds of
those. I haven't checked. But I do not see a good reason to do that TBH.
I am pretty sure that you will gate 10 different opinions when you close
3-5 people in the room and let them discuss for more than 10 minutes.
> > > kernel/power/swap.c: __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
> > > but I suspect I'll have to find out. There's about 60 places to look at.
> >
> > Well, it would be more understandable if this was written as
> > (GFP_KERNEL | __GFP_HIGH) & ~(__GFP_FS|__GFP_IO)
>
> Yeah, I think it's really (GFP_NOIO | __GFP_HIGH)
>
> > > I also want to add __GFP_KILL (to be part of the GFP_KERNEL definition).
> >
> > What does __GFP_KILL means?
>
> Allows OOM killing. So it's the inverse of the GFP_RETRY_MAYFAIL bit.
We have GFP_NORETRY for that. Because we absolutely do not want to give
users any control over the OOM killer behavior. This is and should be an
implementation detail of the MM implementation. GFP_NORETRY on the other
hand is not about the OOM killer. It is about to give up early.
--
Michal Hocko
SUSE Labs
On Tue, Apr 10, 2018 at 03:12:37PM +0300, Дмитрий Леонтьев wrote:
> First, I've noticed the network drivers were allocating memory in interrupt
> handlers. That sounds strange to me, because as far as I know, this
> behaviour is discouraged and may lead to DDOS attack.
Linux supports allocating memory in interrupt context. We also support
allocating memory while holding locks.
Doing it any other way would require the network stack to preallocate
all of the memory it's going to use. You can pop over to the netdev
mailing list and ask them to stop this behaviour, but I don't think
they'll be very sympathetic.