Received: by 10.213.65.68 with SMTP id h4csp628925imn; Fri, 30 Mar 2018 12:12:07 -0700 (PDT) X-Google-Smtp-Source: AIpwx49ze522dSKSbbEgYO95s2gWgSQJIQVW89TGhGX7xO/NZuNIigXyQuIWyOqwTgeMQ6I2vPsg X-Received: by 10.98.229.21 with SMTP id n21mr185418pff.158.1522437127368; Fri, 30 Mar 2018 12:12:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522437127; cv=none; d=google.com; s=arc-20160816; b=Qk1GnZ8vyibI04W6og1YnJqNhhRoGsyLKhw/nVjsrrSVnknX+4nm5dp8pSkuc4Ilh7 COozxZxAXWVXHXj6W8iSB067WWmZB5OTBBAs/ENMAuDTw0jwB28x8X7jtlAl1uw8nxte YPhn73f7hg3CfrOmqpRQoHmdrz/eFD15GAh40Z1A11FS9Klrz5t+wEMTt5SFwZ38okP/ h7iPaye08DNkgOtSwlgLGPoBRZECeof6P5DoM0bYYLqYDWdO/YXyCBwLqmHnTfax0fX0 qvHjsYPmAnj3dsSvE3iBLI8Zoi/vHVqVRsDeLUEv8dh62vLMLNohegpqFVmqaL3AaeK0 DSQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dmarc-filter:arc-authentication-results; bh=sdMwuKfAO3ABKBVuU33rp8zU7q9XqSfgeQK0iPw8Hq0=; b=G9NMbkHeyzQggkovqne40JXHTf70Fl07GVNBFQ3MlKiIYv41nHicLKyXrK0rk672Vl r0VGqtsdOcASk3n00+EQEztb/xzEEDrBm4sF9xmhyZYc+4GOwmp3TIEehhxOCs9GNCVq a4QLJUbIYyQh7tuZpCp3ON9UlpZ2WvDUJCSL8NsshIeIBgrr9FPwPMfK6lPZGGWzgl2i HYPyuWdJCgjxwL95yT9DXZvmZPv35+AopUUjWw0VSobb5IFX/qs4Dq1AMH4WiDq3+nnx szB4x6+FvbtqWTVR2+UN5pBO9uHEfLhWaB6CnmbJady1fsdfSt+jGF4cDxFWkiqvoUW0 6VaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 1-v6si8785112plz.631.2018.03.30.12.11.53; Fri, 30 Mar 2018 12:12:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752479AbeC3TKl (ORCPT + 99 others); Fri, 30 Mar 2018 15:10:41 -0400 Received: from mail.kernel.org ([198.145.29.99]:56968 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752106AbeC3TKk (ORCPT ); Fri, 30 Mar 2018 15:10:40 -0400 Received: from gandalf.local.home (cpe-66-24-56-78.stny.res.rr.com [66.24.56.78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id DD4F8217D2; Fri, 30 Mar 2018 19:10:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD4F8217D2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=goodmis.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=rostedt@goodmis.org Date: Fri, 30 Mar 2018 15:10:37 -0400 From: Steven Rostedt To: Joel Fernandes Cc: Zhaoyang Huang , Ingo Molnar , LKML , kernel-patch-test@lists.linaro.org, Andrew Morton , Michal Hocko , "open list:MEMORY MANAGEMENT" , Vlastimil Babka , Michal Hocko Subject: Re: [PATCH v1] kernel/trace:check the val against the available mem Message-ID: <20180330151037.30d2ac6d@gandalf.local.home> In-Reply-To: References: <1522320104-6573-1-git-send-email-zhaoyang.huang@spreadtrum.com> <20180330102038.2378925b@gandalf.local.home> X-Mailer: Claws Mail 3.16.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 30 Mar 2018 09:37:58 -0700 Joel Fernandes wrote: > > That said, it appears you are having issues that were caused by the > > change by commit 848618857d2 ("tracing/ring_buffer: Try harder to > > allocate"), where we replaced NORETRY with RETRY_MAYFAIL. The point of > > NORETRY was to keep allocations of the tracing ring-buffer from causing > > OOMs. But the RETRY was too strong in that case, because there were > > Yes this was discussed with -mm folks. Basically the problem we were > seeing is devices with tonnes of free memory (but free as in free but > used by page cache) were not being used so it was unnecessarily > failing to allocate ring buffer on the system with otherwise lots of > memory. Right. > > IIRC, the OOM that my patch was trying to avoid, was being triggered > in the path/context of the write to buffer_size_kb itself (when not > doing the NORETRY), not by other processes. Yes, that is correct. > > > Perhaps this is because the ring buffer allocates one page at a time, > > and by doing so, it can get every last available page, and if anything > > in the mean time does an allocation without MAYFAIL, it will cause an > > OOM. For example, when I stressed this I triggered this: > > > > pool invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 > > pool cpuset=/ mems_allowed=0 > > CPU: 7 PID: 1040 Comm: pool Not tainted 4.16.0-rc4-test+ #663 > > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 > > Call Trace: > > dump_stack+0x8e/0xce > > dump_header.isra.30+0x6e/0x28f > > ? _raw_spin_unlock_irqrestore+0x30/0x60 > > oom_kill_process+0x218/0x400 > > ? has_capability_noaudit+0x17/0x20 > > out_of_memory+0xe3/0x5c0 > > __alloc_pages_slowpath+0xa8e/0xe50 > > __alloc_pages_nodemask+0x206/0x220 > > alloc_pages_current+0x6a/0xe0 > > __page_cache_alloc+0x6a/0xa0 > > filemap_fault+0x208/0x5f0 > > ? __might_sleep+0x4a/0x80 > > ext4_filemap_fault+0x31/0x44 > > __do_fault+0x20/0xd0 > > __handle_mm_fault+0xc08/0x1160 > > handle_mm_fault+0x76/0x110 > > __do_page_fault+0x299/0x580 > > do_page_fault+0x2d/0x110 > > ? page_fault+0x2f/0x50 > > page_fault+0x45/0x50 > > But this OOM is not in the path of the buffer_size_kb write, right? So > then what does it have to do with buffer_size_kb write failure? Yep. I'll explain below. > > I guess the original issue reported is that the buffer_size_kb write > causes *other* applications to fail allocation. So in that case, > capping the amount that ftrace writes makes sense. Basically my point > is I don't see how the patch you mentioned introduces the problem here > - in the sense the patch just makes ftrace allocate from memory it > couldn't before and to try harder. The issue is that ftrace allocates its ring buffer one page at a time. Thus, when a RETRY_MAYFAIL succeeds, that memory is allocated. Since it does it one page at a time, even if ftrace does not get all the memory it needs at the end, it will take all memory from the system before it finds that out. Then, if something else (like the above splat) tries to allocate anything, it will fail and trigger an OOM. > > > > > I wonder if I should have the ring buffer allocate groups of pages, to > > avoid this. Or try to allocate with NORETRY, one page at a time, and > > when that fails, allocate groups of pages with RETRY_MAYFAIL, and that > > may keep it from causing an OOM? > > > > I don't see immediately how that can prevent an OOM in other > applications here? If ftrace allocates lots of memory with > RETRY_MAYFAIL, then we would still OOM in other applications if memory > isn't available. Sorry if I missed something. Here's the idea. Allocate one page at a time with NORETRY. If that fails, then allocate larger amounts (higher order of pages) with RETRY_MAYFAIL. Then if it can't get all the memory it needs, it wont take up all memory in the system before it finds out that it can't have any more. Or perhaps the memory management system can provide a get_available_mem() function that ftrace could call before it tries to increase the ring buffer and take up all the memory of the system before it realizes that it can't get all the memory it wants. The main problem I have with Zhaoyang's patch is that get_available_mem() does not belong in the tracing code. It should be something that the mm subsystem provides. -- Steve