Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp832773pxk; Sat, 12 Sep 2020 01:40:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxxFjRAyCh063ol/J1djppWulBascqCmYoVz3VF/gIAmQvy+T1FWyZHw6H/Zc7nTSP51gWR X-Received: by 2002:a17:906:354c:: with SMTP id s12mr5468992eja.370.1599900008446; Sat, 12 Sep 2020 01:40:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599900008; cv=none; d=google.com; s=arc-20160816; b=twJnUExzeaCV4qcWOKSsQq0CuHyTg8f54JXfgmuM8QzlXwFsmd7IuRCwrPBEP3G3Sd w+WfWya2RgjybuLwzXZw2xcQOZ3+aMvUVHfLnqYen76OmDhTNbf/ZNwbNLUSdmXSH6CU KZWkk990WKhEilU8Cxs8U3ZCcAHpSmQMJBUJK9xvr0C6nQVVcJHymbv9NZIZVAAvhyL2 Z41liciJLgmWFyxbP+mCJM8HAeJV5L3t4ZPsjOEXhRtA6yJxoOrgXq9nS7kUObzgs8Kw nqXk6nCFOOsXSQ4gUZaXTYNFE1HjbJeg6ZLdykCMRlQxigzOH8YTyFJO3HfeY/2f76gn VIFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=t+vGWEoJxp91SsNDolPTSq+7zjC/jYdCePf+SxUM8GU=; b=b0YJQ6NtxfuqoCTJp8vnl5M1HY9HtYrgkjz/DUBF/fNdbKBFsu0QP1nRsVLxxW1aWC WYxfQi9nX2SHXy5BrvtTduSz8WEjzklWXkZykZ+AVPpQAvaJ5uMOMYcSpfMkMqWQaDJx PTz/JkiNm8LyYooFDIPllscIMqI1VpEBno4EZooR4xdU/eWlUjSc3whhNUQd7hxRx/G8 rS+K0K1ffGkfTtDhSDYbqNX+yAxBSi3kxvKztX2AseyZnghf6+XXG5hxHMA9vlvush6q un4U7nwVqTrtIp66V93zH2fPDb+YR9HJQQMMkqe67WGkR6BfGx4ihRimbNZhhz9rZpQj AZcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=DGtRL6Tb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z26si2785587ejr.489.2020.09.12.01.39.46; Sat, 12 Sep 2020 01:40:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=DGtRL6Tb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725839AbgILIjJ (ORCPT + 99 others); Sat, 12 Sep 2020 04:39:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52210 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725809AbgILIjE (ORCPT ); Sat, 12 Sep 2020 04:39:04 -0400 Received: from mail-qv1-xf44.google.com (mail-qv1-xf44.google.com [IPv6:2607:f8b0:4864:20::f44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C134C061573 for ; Sat, 12 Sep 2020 01:39:03 -0700 (PDT) Received: by mail-qv1-xf44.google.com with SMTP id cy2so6453466qvb.0 for ; Sat, 12 Sep 2020 01:39:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=t+vGWEoJxp91SsNDolPTSq+7zjC/jYdCePf+SxUM8GU=; b=DGtRL6Tbceqb0N+n02ZUW5GRPqfqmNmUDurGwj8oZMmd4F5Lhsl0tIGQjWUxvMSNQN yhi92u1IZOFzeBw7bslzjFppRiKI/kQ6DUTv+OfbT9814NinYRuckT9C32fdbn1pDTan AEFBUKu5KM/gk/EVPmSUo0F4aGg/rRIvIrINtKxaQw8YkL3UfiZi1PuxGCwYRZD9Qizp GEawDzYMcm09lgvcqR22jxvUTZ2i7w0n/mJbeta+hGleLeTgsDEJgDFiB+DRYJ3SGnM5 Ke1QWlZ2LQAf9tSSijK8ZBr2m0Eg6OKZeB0uVCaKJr4PvsICy9mBOIjf/oGVqHOeRnfU m2Kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=t+vGWEoJxp91SsNDolPTSq+7zjC/jYdCePf+SxUM8GU=; b=dUpWyhC+/M93gjfWMIkgJFlPRA6CnhSw4JtnQ/Nyys9RMD5N3Fk+INrWsq2CIzG9qq yOUPVE4AYbdWz63rWbYS3I/1IdIYWL7oaW5BRDQxFXDYXwVw4c8zs6n53RSurX8rXujb cPwnzO16YyscUmAxzf4tJQ62o2ntuj6Cz1jp76vbopngivltp3sDmcD37JWyg4FNg1Vc Lmx1vSXoGnVF0pRYDgGkV0YkMOMa54ga3xRzJczYdwGdxZAaoyXPOxP/yX4YO9d52XeF 6p80vT3dLPL3ezrqlwuT5xGoWzyGFJ6ICw/CpN16xS1jUqDyStCbYHYAxW+xmFKTfL91 D8pQ== X-Gm-Message-State: AOAM531cLvCb3pmLVugIZ51OUoXFjq243oRBAOrlOuc63weN+mCTsg0Y tfEOh0IcIhaSfyLzOHoGNbbXjg== X-Received: by 2002:ad4:4523:: with SMTP id l3mr5389790qvu.109.1599899942234; Sat, 12 Sep 2020 01:39:02 -0700 (PDT) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id 205sm6565315qki.118.2020.09.12.01.38.57 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Sat, 12 Sep 2020 01:38:59 -0700 (PDT) Date: Sat, 12 Sep 2020 01:38:56 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Alex Shi cc: Hugh Dickins , Andrew Morton , mgorman@techsingularity.net, tj@kernel.org, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, willy@infradead.org, hannes@cmpxchg.org, lkp@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, shakeelb@google.com, iamjoonsoo.kim@lge.com, richard.weiyang@gmail.com, kirill@shutemov.name, alexander.duyck@gmail.com, rong.a.chen@intel.com, mhocko@suse.com, vdavydov.dev@gmail.com, shy828301@gmail.com, vbabka@suse.cz, minchan@kernel.org, cai@lca.pw Subject: Re: [PATCH v18 00/32] per memcg lru_lock: reviews In-Reply-To: <61a42a87-eec9-e300-f710-992756f70de6@linux.alibaba.com> Message-ID: References: <1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com> <20200824114204.cc796ca182db95809dd70a47@linux-foundation.org> <61a42a87-eec9-e300-f710-992756f70de6@linux.alibaba.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-916415362-1599899939=:23961" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-916415362-1599899939=:23961 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 9 Sep 2020, Alex Shi wrote: > =E5=9C=A8 2020/9/9 =E4=B8=8A=E5=8D=887:41, Hugh Dickins =E5=86=99=E9=81= =93: > >=20 > > The use of lock_page_memcg() in __munlock_pagevec() in 20/32, > > introduced in patchset v17, looks good but it isn't: I was lucky that > > systemd at reboot did some munlocking that exposed the problem to lockd= ep. > > The first time into the loop, lock_page_memcg() is done before lru_lock > > (as 06/32 has allowed); but the second time around the loop, it is done > > while still holding lru_lock. >=20 > I don't know the details of lockdep show. Just wondering could it possibl= e=20 > to solid the move_lock/lru_lock sequence? > or try other blocking way which mentioned in commit_charge()? >=20 > >=20 > > lock_page_memcg() really needs to be absorbed into (a variant of) > > relock_page_lruvec(), and I do have that (it's awkward because of > > the different ways in which the IRQ flags are handled). And out of > > curiosity, I've also tried using that in mm/swap.c too, instead of the > > TestClearPageLRU technique: lockdep is happy, but an update_lru_size() > > warning showed that it cannot safely be mixed with the TestClearPageLRU > > technique (that I'd left in isolate_lru_page()). So I'll stash away > > that relock_page_lruvec(), and consider what's best for mm/mlock.c: > > now that I've posted these comments so far, that's my priority, then > > to get the result under testing again, before resuming these comments. >=20 > No idea of your solution, but looking forward for your good news! :) Yes, it is good news, and simpler than anything suggested above. The main difficulties will probably be to look good in the 80 columns (I know that limit has been lifted recently, but some of us use xterms side by side), and to explain it. mm/mlock.c has not been kept up-to-date very well: and in particular, you have taken too seriously that "Serialize with any parallel __split_huge_page_refcount()..." comment that you updated to two comments "Serialize split tail pages in __split_huge_page_tail()...". Delete them! The original comment was by Vlastimil for v3.14 in 2014. But Kirill redesigned THP refcounting for v4.5 in 2016: that's when __split_huge_page_refcount() went away. And with the new refcounting, the THP splitting races that lru_lock protected munlock_vma_page() and __munlock_pagevec() from: those races have become impossible. Or maybe there never was such a race in __munlock_pagevec(): you have added the comment there, assuming lru_lock was for that purpose, but that was probably just the convenient place to take it, to cover all the del_page_from_lru()s. Observe how split_huge_page_to_list() uses unmap_page() to remove all pmds and all ptes for the huge page being split, and remap_page() only replaces the migration entries (used for anon but not for shmem or file) after doing all of the __split_huge_page_tail()s, before unlocking any of the pages. Recall that munlock_vma_page() and __munlock_pagevec() are being applied to pages found mapped into userspace, by ptes or pmd: there are none of those while __split_huge_page_tail() is being used, so no race to protect from. (Could a newly detached tail be freshly faulted into userspace just before __split_huge_page() has reached the head? Not quite, the fault has to wait to get the tail's page lock. But even if it could, how would that be a problem for __munlock_pagevec()?) There's lots more that could be said: for example, PageMlocked will always be clear on the THP head during __split_huge_page_tail(), because the last unmap of a PageMlocked page does clear_page_mlock(). But that's not required to prove the case, it's just another argument against the "Serialize" comment you have in __munlock_pagevec(). So, no need for the problematic lock_page_memcg(page) there in __munlock_pagevec(), nor to lock (or relock) lruvec just below it. __munlock_pagevec() still needs lru_lock to del_page_from_lru_list(), of course, but that must be done after your TestClearPageMlocked has stabilized page->memcg. Use relock_page_lruvec_irq() here? I suppose that will be easiest, but notice how __munlock_pagevec_fill() has already made sure that all the pages in the pagevec are from the same zone (and it cannot do the same for memcg without locking page memcg); so some of relock's work will be redundant. Otherwise, I'm much happier with your mm/mlock.c since looking at it in more detail: a couple of nits though - drop the clear_page_mlock() hunk from 25/32 - kernel style says do it the way you are undoing by -=09if (!isolate_lru_page(page)) { +=09if (!isolate_lru_page(page)) =09=09putback_lru_page(page); -=09} else { +=09else { I don't always follow that over-braced style when making changes, but you should not touch otherwise untouched code just to make it go against the approved style. And in munlock_vma_page(), -=09if (!TestClearPageMlocked(page)) { +=09if (!TestClearPageMlocked(page)) =09=09/* Potentially, PTE-mapped THP: do not skip the rest PTEs */ -=09=09nr_pages =3D 1; -=09=09goto unlock_out; -=09} +=09=09return 0; please restore the braces: with that comment line in there, the compiler does not need the braces, but the human eye does. Hugh --0-916415362-1599899939=:23961--