Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754050AbbLPXfQ (ORCPT ); Wed, 16 Dec 2015 18:35:16 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:59145 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751763AbbLPXfO (ORCPT ); Wed, 16 Dec 2015 18:35:14 -0500 Date: Wed, 16 Dec 2015 15:35:13 -0800 From: Andrew Morton To: Michal Hocko Cc: Linus Torvalds , Johannes Weiner , Mel Gorman , David Rientjes , Tetsuo Handa , Hillf Danton , KAMEZAWA Hiroyuki , , LKML Subject: Re: [PATCH 0/3] OOM detection rework v4 Message-Id: <20151216153513.e432dc70e035e5d07984710c@linux-foundation.org> In-Reply-To: <1450203586-10959-1-git-send-email-mhocko@kernel.org> References: <1450203586-10959-1-git-send-email-mhocko@kernel.org> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2180 Lines: 48 On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko wrote: > This is an attempt to make the OOM detection more deterministic and > easier to follow because each reclaimer basically tracks its own > progress which is implemented at the page allocator layer rather spread > out between the allocator and the reclaim. The more on the implementation > is described in the first patch. We've been futzing with this stuff for many years and it still isn't working well. This makes me expect that the new implementation will take a long time to settle in. To aid and accelerate this process I suggest we lard this code up with lots of debug info, so when someone reports an issue we have the best possible chance of understanding what went wrong. This is easy in the case of oom-too-early - it's all slowpath code and we can just do printk(everything). It's not so easy in the case of oom-too-late-or-never. The reporter's machine just hangs or it twiddles thumbs for five minutes then goes oom. But there are things we can do here as well, such as: - add an automatic "nearly oom" detection which detects when things start going wrong and turns on diagnostics (this would need an enable knob, possibly in debugfs). - forget about an autodetector and simply add a debugfs knob to turn on the diagnostics. - sprinkle tracepoints everywhere and provide a set of instructions/scripts so that people who know nothing about kernel internals or tracing can easily gather the info we need to understand issues. - add a sysrq key to turn on diagnostics. Pretty essential when the machine is comatose and doesn't respond to keystrokes. - something else So... please have a think about it? What can we add in here to make it as easy as possible for us (ie: you ;)) to get this code working well? At this time, too much developer support code will be better than too little. We can take it out later on. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/