Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1919940imm; Tue, 10 Jul 2018 10:01:10 -0700 (PDT) X-Google-Smtp-Source: AAOMgpd8EIChONzPXtSXPaUtJZOaO1mOPjFQmyCXiEv7hlmlXR8cPoccWOVPcvpF6NP2QiN9J82R X-Received: by 2002:a62:df4e:: with SMTP id u75-v6mr26717104pfg.195.1531242070361; Tue, 10 Jul 2018 10:01:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531242070; cv=none; d=google.com; s=arc-20160816; b=GNcvO4/Hhai3QlIoNEA0KzRFMenFzwtFL7MRbarkba3bXgFdj1w00SN7c5tkcqDv6L Rf3jlnra/Iy0JIcj+PaNQDZiHwgwiPeuiQA3q4jbgOQBo0LX7eECbdBLvYj0o607zYFz UpiOA0zRCooRIgKxci7NErfBD9HYby6bcooDH/SJsPYplcOTUH0nMczE/yu1E/dUGuu7 sfPC0dheEq++UqXjtD5yPb6l1uyRzh59cyw1SXDDqZHmEdZsULVKcsS67RBJa4Olir2F JJsUBH9TTiQ7TP9enKBlwFXfN5AF4z4h0/83yBancGxpa3fwywoyJItucb9FtMXoBtl/ Okeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=VGUOXdg2xcu1so76nnSljdn0KC6M7ZqMFRIRVARooG8=; b=QgyJ0zSrlLkUj53Sa5oRAhNTg/+afWMdbShNoyvqncNK608zMAetYTASYJk4zGvHPH nsx8cK66Xs9eNo1JBOc7GpxDAAWC2FsWfZ8bzDTerIO8u/iCyyF47HAy5X2ehL34wYD/ GJrcJjP+2jmJsLbhM6UWRgd4nnEfIXJmbbFG4JBHIPugDlY8rqzpfAirE7vfV8FPGaAc ++dH+kqSlrAv8fs/TYTcjcfPlAhJC8Q1+FTvsxQYgZacI4NL7mq1rYR4K77+YOZt8Onu tdKfGaTtIijdUmlOV1iYdDzideJESEQpVUkASZY1fTjrAxZp/U9R/X5CTe6AOqm667vk hmaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x190-v6si15937286pgb.158.2018.07.10.10.00.47; Tue, 10 Jul 2018 10:01:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934460AbeGJQJX convert rfc822-to-8bit (ORCPT + 99 others); Tue, 10 Jul 2018 12:09:23 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:60720 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S934322AbeGJQJU (ORCPT ); Tue, 10 Jul 2018 12:09:20 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DC8DE814F0A5; Tue, 10 Jul 2018 16:09:19 +0000 (UTC) Received: from llong.remote.csb (ovpn-122-125.rdu2.redhat.com [10.10.122.125]) by smtp.corp.redhat.com (Postfix) with ESMTP id 107882156889; Tue, 10 Jul 2018 16:09:17 +0000 (UTC) Subject: Re: [PATCH v6 0/7] fs/dcache: Track & limit # of negative dentries To: Michal Hocko Cc: Alexander Viro , Jonathan Corbet , "Luis R. Rodriguez" , Kees Cook , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, Linus Torvalds , Jan Kara , "Paul E. McKenney" , Andrew Morton , Ingo Molnar , Miklos Szeredi , Matthew Wilcox , Larry Woodman , James Bottomley , "Wangkai (Kevin C)" References: <1530905572-817-1-git-send-email-longman@redhat.com> <20180709081920.GD22049@dhcp22.suse.cz> <62275711-e01d-7dbe-06f1-bf094b618195@redhat.com> <20180710142740.GQ14284@dhcp22.suse.cz> From: Waiman Long Organization: Red Hat Message-ID: Date: Tue, 10 Jul 2018 12:09:17 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <20180710142740.GQ14284@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Content-Language: en-US X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Tue, 10 Jul 2018 16:09:20 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Tue, 10 Jul 2018 16:09:20 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'longman@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/10/2018 10:27 AM, Michal Hocko wrote: > On Mon 09-07-18 12:01:04, Waiman Long wrote: >> On 07/09/2018 04:19 AM, Michal Hocko wrote: > [...] >>> later needs a special treatment while the first one is ok? There are >>> quite some resources which allow a non privileged user to consume a lot >>> of memory and the memory controller is the only reliable way to mitigate >>> the risk. >> Yes, memory controller is the only reliable way to mitigate the risk, >> but not all tasks are under the control of a memory controller with >> kernel memory limit. > But those which you do not trust should. So why do we need yet another > mechanism for the reclaim? Sometimes it could be a programming error in the code. I had seen a customer report about the negative dentries because of a bug in their code that generated a lot of negative dentries causing problem. In such a controlled environment, they may not want to run their applications under a memory cgroup as there is overhead involved in that. So a mechanism to highlight and notify the problem is probably good to have. > > [...] >>>> Patch 1 tracks the number of negative dentries present in the LRU >>>> lists and reports it in /proc/sys/fs/dentry-state. >>> If anything I _think_ vmstat would benefit from this because behavior of >>> the memory reclaim does depend on the amount of neg. dentries. >>> >>>> Patch 2 adds a "neg-dentry-pc" sysctl parameter that can be used to to >>>> specify a soft limit on the number of negative allowed as a percentage >>>> of total system memory. This parameter is 0 by default which means no >>>> negative dentry limiting will be performed. >>> percentage has turned out to be a really wrong unit for many tunables >>> over time. Even 1% can be just too much on really large machines. >> Yes, that is true. Do you have any suggestion of what kind of unit >> should be used? I can scale down the unit to 0.1% of the system memory. >> Alternatively, one unit can be 10k/cpu thread, so a 20-thread system >> corresponds to 200k, etc. > I simply think this is a strange user interface. How much is a > reasonable number? How can any admin figure that out? Without the optional enforcement, the limit is essentially just a notification mechanism where the system signals that there is something wrong going on and the system administrator need to take a look. So it is perfectly OK if the limit is sufficiently high that normally we won't need to use that many negative dentries. The goal is to prevent negative dentries from consuming a significant portion of the system memory. I am going to reduce the granularity of each unit to 1/1000 of the total system memory so that for large system with TB of memory, a smaller amount of memory can be specified. >>>> Patch 3 enables automatic pruning of least recently used negative >>>> dentries when the total number is close to the preset limit. >>> Please explain why this cannot be done in a standard dcache shrinking >>> way. I strongly suspect that you are developing yet another reclaim with >>> its own sets of tunable and bypassing the existing infrastructure. I >>> haven't read patches yet but the cover letter doesn't really explain >>> design much so I am only guessing. >> The standard dcache shrinking happens when the system is almost running >> out of free memory. > Well, the standard reclaim happens when somebody needs memory. We are > usually quite far away from "almost running out of memory". We do > reclaim fs metadata including dentries so I really do not see why > negative ones should be any special here. That is fine. I can certainly live without the new reclaim mechanism. > >> This new shrinker will be turned on when the number >> of negative dentries is closed to the limit even when there are still >> plenty of free memory left. It will stop when the number of negative >> dentries is lowered to a safe level. The new shrinker is designed to >> impose as little overhead to the currently running tasks. That is not >> true for the standard shrinker which will have a rather significant >> performance impact to the currently running tasks. > Do you have any numbers to back your claim? The memory reclaim is > usually quite lightweight. Especially when we have a lot of clean > fs {meta}data In the case of dentries, it is the lock hold time of the LRU list that can impact the normal filesystem operation. The new shrinker that I add purposely limit the lock hold time whereas the standard shrinker can hold the LRU for quite a long time if there are a lot of dentries to get rid of. I have some performance numbers in the cover letter of this patch about this. >> I can remove the new shrinker if people really don't want to add a new >> one as long as I can keep the option to kill off newly created negative >> dentries when the limit is exceeded. > Please let's not add yet another memory reclaim mechanism. It will just > backfire sooner or later. As said above, I am going to remove the new shrinker in the next version of the patch. We can always add it back later on if we feel there is a need to do it. Cheers, Longman