Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp5118173ybl; Mon, 26 Aug 2019 23:03:30 -0700 (PDT) X-Google-Smtp-Source: APXvYqy/DTvJK89TLi0QTlHu+HzDCSsoTCgRjwXZAq0bKgtk71GMRNX6/+1HFlHW9Gfq3A4/C/zt X-Received: by 2002:a17:902:b605:: with SMTP id b5mr22864582pls.103.1566885810469; Mon, 26 Aug 2019 23:03:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566885810; cv=none; d=google.com; s=arc-20160816; b=LNMw77vVctqZlDCVvKalDvwpYonCnkDwWDTWu0XAlm7mwgOfhCXZCcEjXhY6V+eio5 WYIPVNwkHR2rsOez8lBTVmYoPGGa1oJV7cSq+cvZ/HnmLUvR9aacV0j2nWKZohKi9Rbg DoPzZr7W2bTYCcALc+QaufR8RCQxaPwdNmHU+su52WUKOBbPnIaWT78fPZlh9YFXUCVF nWrf2Gleb26HRVwfN9n8oGUBvAb8HYMQ7zXUF94FnlMBWeJ0kM9Wjd+udwDOVnDqGfyr kzluC4+JpBBer/F8YuJYVOQbX4e5zbcxlBUyKxUO/aY5R3m7X2kcxsGPNMSOCJHDWBIQ 6uAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=4oXrR96WYNrzgP69lXltBNY9FLMy74qBAjXp5ac1LAo=; b=dDVJbgcZ/ijspyqtlIHlm4kV7bPObk8iDoA4526bVvy11ItHcfiLYnIXoC6m3Zq4Wg KZa8WrLNw5gh+LBbOJW2bDy0/aCcU1+bsh25Gxw5lYxU3f/GGT6qp3ItxGIlMnWl+m6O c4+uVS++LT3Du3rcn4tq+Xs9uq7ZYu4XQIpldXL3jWm3zoSjL5roxbD4cpkBUXujPaQD 5gZfal4eAEmT7emj1veDacOkYfUDkzKCdIPxu+K0DFA9JHoo/JH8H4Xh6J7UqVk7IRTQ tEdH+CdeVBoCOU3ghwm7skqiq6dCoX92QTJHnIXq2rOQbn4+hpf8ON7w7oW4iKlGbuDk Y9Vg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id az4si1597011pjb.99.2019.08.26.23.03.10; Mon, 26 Aug 2019 23:03:30 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729254AbfH0F7p (ORCPT + 99 others); Tue, 27 Aug 2019 01:59:45 -0400 Received: from mx2.suse.de ([195.135.220.15]:51194 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725811AbfH0F7o (ORCPT ); Tue, 27 Aug 2019 01:59:44 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id CBBB5ABC6; Tue, 27 Aug 2019 05:59:42 +0000 (UTC) Date: Tue, 27 Aug 2019 07:59:41 +0200 From: Michal Hocko To: Yang Shi Cc: kirill.shutemov@linux.intel.com, hannes@cmpxchg.org, vbabka@suse.cz, rientjes@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable Message-ID: <20190827055941.GL7538@dhcp22.suse.cz> References: <1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com> <20190822080434.GF12785@dhcp22.suse.cz> <9e4ba38e-0670-7292-ab3a-38af391598ec@linux.alibaba.com> <20190826074350.GE7538@dhcp22.suse.cz> <416daa85-44d4-1ef9-cc4c-6b91a8354c79@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <416daa85-44d4-1ef9-cc4c-6b91a8354c79@linux.alibaba.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 26-08-19 21:27:38, Yang Shi wrote: > > > On 8/26/19 12:43 AM, Michal Hocko wrote: > > On Thu 22-08-19 08:33:40, Yang Shi wrote: > > > > > > On 8/22/19 1:04 AM, Michal Hocko wrote: > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote: > > [...] > > > > > And, they seems very common with the common workloads when THP is > > > > > enabled. A simple run with MariaDB test of mmtest with THP enabled as > > > > > always shows it could generate over fifteen thousand deferred split THPs > > > > > (accumulated around 30G in one hour run, 75% of 40G memory for my VM). > > > > > It looks worth accounting in MemAvailable. > > > > OK, this makes sense. But your above numbers are really worrying. > > > > Accumulating such a large amount of pages that are likely not going to > > > > be used is really bad. They are essentially blocking any higher order > > > > allocations and also push the system towards more memory pressure. > > > That is accumulated number, during the running of the test, some of them > > > were freed by shrinker already. IOW, it should not reach that much at any > > > given time. > > Then the above description is highly misleading. What is the actual > > number of lingering THPs that wait for the memory pressure in the peak? > > By rerunning sysbench mariadb test of mmtest, I didn't see too many THPs in > the peak. I saw around 2K THPs sometimes on my VM with 40G memory. But they > were short-lived (should be freed when the test exit). And, the number of > accumulated THPs are variable. > > And, this reminded me to go back double check our internal bug report which > lead to the "make deferred split shrinker memcg aware" patchset. > > In that case, a mysql instance with real production load was running in a > memcg with ~86G limit, the number of deferred split THPs may reach to ~68G > (~34K deferred split THPs) in a few hours. The deferred split THP shrinker > was not invoked since global memory pressure is still fine since the host > has 256G memory, but memcg limit reclaim was triggered. > > And, I can't tell if all those deferred split THPs came from mysql or not > since there were some other processes run in that container too according to > the oom log. > > I will update the commit log with the more solid data from production > environment. This is a very useful information. Thanks! > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues > > > > during splitting, right? This is not really an optimization to cache > > > > THPs for reuse or something like that. What is the reason this is not > > > > done from a worker context? At least THPs which would be freed > > > > completely sound like a good candidate for kworker tear down, no? > > > Yes, deferred split THP was introduced to avoid locking issues according to > > > the document. Memcg awareness would help to trigger the shrinker more often. > > > > > > I think it could be done in a worker context, but when to trigger to worker > > > is a subtle problem. > > Why? What is the problem to trigger it after unmap of a batch worth of > > THPs? > > This leads to another question, how many THPs are "a batch of worth"? Some arbitrary reasonable number. Few dozens of THPs waiting for split are no big deal. Going into GB as you pointed out above is definitely a problem. -- Michal Hocko SUSE Labs