Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7361061imu; Thu, 27 Dec 2018 18:46:34 -0800 (PST) X-Google-Smtp-Source: ALg8bN77fSsCUhrGKj5AiyPxOnM69wfs4PqzuiawXQoJBTvCak4aW1UHRIByecYVfKB/s9LRlPoy X-Received: by 2002:a17:902:9305:: with SMTP id bc5mr25750459plb.86.1545965194934; Thu, 27 Dec 2018 18:46:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545965194; cv=none; d=google.com; s=arc-20160816; b=hNKWHTPYIBaQFU5xdpPZFeOZMQu8cPsRTTH8wH4lMnrfXZiA7+Z4EUMsu1RAOwQD4I xSUaYoRlukKpcnRXFqIEH2p0OXGm4uOoW4y++IhG8V0JsXLDeoKkb4FaaADXKikGOgUf V8Au5dYMLCuLMoWIVOr5cZ9XfAu0PAWyyzeED9GFWtfl+b5K4yFLammFNjP7XZ8S2zfd aBiSmbfsJqm8rX0F3dBITqBvTZQubOWyFIG5evQGBBcwiTeVxmLO+4cBBX0q2SlsoxZH uC2o6z0yz9AJ3/Ws5gM391j2gp6sJzVt89XDx3odLagHkBMElzpyI+vuEcS/oShtBaKg ABRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=I9RMAeqsuUV1s+Si1n89efU3rhrtYQOlEEfI7I+qLSI=; b=0SMKHfaG7VPsfNE088b2hVv+/v8C2vz+jQ2z3x13ZZB9BcSAVgNzdccGiDMW/sWx9b AIUgZqBEOTkpPgxbWbibqEQ92EvCZRW5U0x8+wbpP8/kWKUvdUvq00hfEntvk4tOiH/T JkSatL4tq8JoXA3pUnwXr0laXDhTcW8/dKZlYPZPnwgHAiNY1U7GE9uFCCsfn3LgOh7K ot3aJzUu6spxfnod8vZDMxwwsk2rw55UUaYEtn6t6FcQ87OiIZhml7qJEoDkqwAFUjZ5 r6X0YfSW9JZkj68sKCKKLSRIZowQfYMUDy3kLdJ2hz3dvEBv1AoGYNpJClrs6Fw6uBxu m0cg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f23si18738036pgv.431.2018.12.27.18.46.18; Thu, 27 Dec 2018 18:46:34 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731175AbeL0QqM (ORCPT + 99 others); Thu, 27 Dec 2018 11:46:12 -0500 Received: from mx2.suse.de ([195.135.220.15]:39532 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728083AbeL0QqM (ORCPT ); Thu, 27 Dec 2018 11:46:12 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E226CAF3F; Thu, 27 Dec 2018 16:46:09 +0000 (UTC) Date: Thu, 27 Dec 2018 17:46:08 +0100 From: Michal Hocko To: Konstantin Khorenko Cc: Andrew Morton , Andrey Ryabinin , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook Subject: Re: [RFC PATCH 0/1] mm: add a warning about high order allocations Message-ID: <20181227164608.GM16738@dhcp22.suse.cz> References: <20181225153927.2873-1-khorenko@virtuozzo.com> <20181226083505.GF16738@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 27-12-18 15:18:54, Konstantin Khorenko wrote: > Hi Michal, > > thank you very much for your questions, please see my notes below. > > On 12/26/2018 11:35 AM, Michal Hocko wrote: > > On Tue 25-12-18 18:39:26, Konstantin Khorenko wrote: > >> Q: Why do we need to bother at all? > >> A: If a node is highly loaded and its memory is significantly fragmented > >> (unfortunately almost any node with serious load has highly fragmented memory) > >> then any high order memory allocation can trigger massive memory shrink and > >> result in quite a big allocation latency. And the node becomes less responsive > >> and users don't like it. > >> The ultimate solution here is to get rid of large allocations, but we need an > >> instrument to detect them. > > > > Can you point to an example of the problem you are referring here? At > > least for costly orders we do bail out early and try to not cause > > massive reclaim. So what is the order that you are concerned about? > > Well, this is the most difficult question to answer. > Unfortunately i don't have a reproducer for that, usually we get into situation > when someone experiences significant node slowdown, nodes most often have a lot of RAM, > we check what is going on there and see the node is busy with reclaim. > And almost every time the reason was - fragmented memory and high order allocations. > Mostly of 2nd and 3rd (which is still considered not costly) order. > > Recent related issues we faced were about FUSE dev pipe: > d6d931adce11 ("fuse: use kvmalloc to allocate array of pipe_buffer structs.") > > and about bnx driver + mtu 9000 which for each packet required page of 2nd order > (and it even failed sometimes, though it was not the root cause): > kswapd0: page allocation failure: order:2, mode:0x4020 > Call Trace: > dump_stack+0x19/0x1b > warn_alloc_failed+0x110/0x180 > __alloc_pages_nodemask+0x7bf/0xc60 > alloc_pages_current+0x98/0x110 > kmalloc_order+0x18/0x40 > kmalloc_order_trace+0x26/0xa0 > __kmalloc+0x279/0x290 > bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x] > bnx2x_rx_int+0x227/0x17c0 [bnx2x] > bnx2x_poll+0x1dd/0x260 [bnx2x] > net_rx_action+0x179/0x390 > __do_softirq+0x10f/0x2aa > call_softirq+0x1c/0x30 > do_softirq+0x65/0xa0 > irq_exit+0x105/0x110 > do_IRQ+0x56/0xe0 > common_interrupt+0x6d/0x6d > > And as both places were called very often - the system latency was high. > > This warning can be also used to catch allocation of 4th order and higher which may > easily fail. Those places which are ready to get allocation errors and have > fallbacks are marked with __GFP_NOWARN. This is not true in general, though. [...] > But after it's done and there are no (almost) unmarked high order allocations - > why not? This will reveal new cases of high order allocations soon. There will always be legitimate high order allocations. I believe that for your particular use case it is much better to simply enable reclaim and page allocator tracepoints which will give you not only the source of the allocation but also a much better picture > i think people who run systems with "kernel.panic_on_warn" enabled do care > about reporting issues. You surely do not want to put the system down just because of the high order allocation though, right? > >> Q: Why compile time config option? > >> A: In order not to decrease the performance even a bit in case someone does not > >> want to hunt for large allocations. > >> In an ideal life i'd prefer this check/warning is enabled by default and may be > >> even without a config option so it works on every node. Once we find and rework > >> or mark all large allocations that would be good by default. Until that though > >> it will be noisy. > > > > So who is going to enable this option? > > At the beginning - people who want to debug kernel and verify their fallbacks on > memory allocations failures in the code or just speed up their code on nodes > with fragmented memory - for 2nd and 3rd orders. > > mm performance issues are tough, you know, and this is just another way to > gain more performance. It won't avoid the necessity of digging mm for sure, > but might decrease the pressure level. But the warning alone will not give us useful information I am afraid. It will only give us, there are warnings but not whether those are actually a problem or not. So I really believe that using existing tracepoints or add some that will fill missing gaps will be much more better long term. And we do not have to add another config and touch the code as a bonus. -- Michal Hocko SUSE Labs