Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp3942715ybb; Mon, 13 Apr 2020 19:40:40 -0700 (PDT) X-Google-Smtp-Source: APiQypLFhrokK0zd9JlkidwATjYMwPeTGwwEFCSEufjtf/VcfK5EsN5ESZen4Sv5GF+uwpOeU/t3 X-Received: by 2002:a17:906:1a06:: with SMTP id i6mr18167630ejf.90.1586832040396; Mon, 13 Apr 2020 19:40:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586832040; cv=none; d=google.com; s=arc-20160816; b=yYzSGpXpTWjmsJxtOySIydR1HMbHUSTHEdhFjRtl73/hMu8A9Nlz8Nao9K8RmhXe8Y Ay0XGwy8SsPyhjYUmDXfWMEd28+J3wvSFJCmsFIPzH/3HyHEJ+71vQyR6Y00XXFtRED5 ULp7P6/gcBzmZ00oKh2O9ErEyOIhw+GoC51wfVv4+lh04pW1tGBFJe0UC6LLGttQafgp ICq3CYAJ24+aD7C0+pllrbThfoH9eX3naFEt2itOm/wZ3C7eDtqebxluFJKAl6JIkawA rlQK2fYXEnNg1M0JWwL9cIWEimbn71UuYp7ZKhd3D8vo9FS/eNaWrenlIvVGqWWyqdaw KWmQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=emksRPy6/PfeEKgOyKAUdIqQa9EIWnWxmAB/SKaW52k=; b=P7XfuNhu2wO06L6BU15d890T+fS3X4Oav4sh2a/GW5yZcq7MOk5k5QgMGm/6dpGFMo qb5H81wHbK3j1hlV9yTZc15A9bi4/ki3BIkLwNA5nHACXmX8mS+0U4r29gdeH7xBumfy 2Vng/hT/ybVSDoO5tesRenZc+Ec6GJuGpNayQBb3H8a2pvOSw1FfSbIiwsdg6jY02eiH gH062zS5iDG4zbNPERlGL6RaCdhmBE4zI6xaPxMIRqQEEH0MhSaOY8djGbbD3O4dHbkk xg99twQr5Yq1xEf6MzE07tfXfS6gbKNjrIpYjcsqReuAK8oPybJSxTD2tNuAMV/AB7u3 Y8Yg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r1si8085775edq.467.2020.04.13.19.40.15; Mon, 13 Apr 2020 19:40:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728756AbgDMKsp (ORCPT + 99 others); Mon, 13 Apr 2020 06:48:45 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:50122 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728011AbgDMKso (ORCPT ); Mon, 13 Apr 2020 06:48:44 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=alex.shi@linux.alibaba.com;NM=1;PH=DS;RN=38;SR=0;TI=SMTPD_---0TvP6RXn_1586774914; Received: from IT-FVFX43SYHV2H.local(mailfrom:alex.shi@linux.alibaba.com fp:SMTPD_---0TvP6RXn_1586774914) by smtp.aliyun-inc.com(127.0.0.1); Mon, 13 Apr 2020 18:48:35 +0800 Subject: Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock To: Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, mgorman@techsingularity.net, tj@kernel.org, hughd@google.com, khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com, yang.shi@linux.alibaba.com, willy@infradead.org, shakeelb@google.com, Michal Hocko , Vladimir Davydov , Roman Gushchin , Chris Down , Thomas Gleixner , Vlastimil Babka , Qian Cai , Andrey Ryabinin , "Kirill A. Shutemov" , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Andrea Arcangeli , David Rientjes , "Aneesh Kumar K.V" , swkhack , "Potyra, Stefan" , Mike Rapoport , Stephen Rothwell , Colin Ian King , Jason Gunthorpe , Mauro Carvalho Chehab , Peng Fan , Nikolay Borisov , Ira Weiny , Kirill Tkhai , Yafang Shao References: <1579143909-156105-1-git-send-email-alex.shi@linux.alibaba.com> <1579143909-156105-4-git-send-email-alex.shi@linux.alibaba.com> <20200116215222.GA64230@cmpxchg.org> From: Alex Shi Message-ID: Date: Mon, 13 Apr 2020 18:48:22 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200116215222.GA64230@cmpxchg.org> Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > In a previous review, I pointed out the following race condition > between page charging and compaction: > > compaction: generic_file_buffered_read: > > page_cache_alloc() > > !PageBuddy() > > lock_page_lruvec(page) > lruvec = mem_cgroup_page_lruvec() > spin_lock(&lruvec->lru_lock) > if lruvec != mem_cgroup_page_lruvec() > goto again > > add_to_page_cache_lru() > mem_cgroup_commit_charge() > page->mem_cgroup = foo > lru_cache_add() > __pagevec_lru_add() > SetPageLRU() > > if PageLRU(page): > __isolate_lru_page() > > As far as I can see, you have not addressed this. You have added > lock_page_memcg(), but that prevents charged pages from moving between > cgroups, it does not prevent newly allocated pages from being charged. > > It doesn't matter how many times you check the lruvec before and after > locking - if you're looking at a free page, it might get allocated, > charged and put on a new lruvec after you're done checking, and then > you isolate a page from an unlocked lruvec. > > You simply cannot serialize on page->mem_cgroup->lruvec when > page->mem_cgroup isn't stable. You need to serialize on the page > itself, one way or another, to make this work. > > > So here is a crazy idea that may be worth exploring: > > Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's > linked list. > > Can we make PageLRU atomic and use it to stabilize the lru_lock > instead, and then use the lru_lock only serialize list operations? > > I.e. in compaction, you'd do > > if (!TestClearPageLRU(page)) > goto isolate_fail; > /* > * We isolated the page's LRU state and thereby locked out all > * other isolators, including cgroup page moving, page reclaim, > * page freeing etc. That means page->mem_cgroup is now stable > * and we can safely look up the correct lruvec and take the > * page off its physical LRU list. > */ > lruvec = mem_cgroup_page_lruvec(page); > spin_lock_irq(&lruvec->lru_lock); > del_page_from_lru_list(page, lruvec, page_lru(page)); > > Putback would mostly remain the same (although you could take the > PageLRU setting out of the list update locked section, as long as it's > set after the page is physically linked): > > /* LRU isolation pins page->mem_cgroup */ > lruvec = mem_cgroup_page_lruvec(page) > spin_lock_irq(&lruvec->lru_lock); > add_page_to_lru_list(...); > spin_unlock_irq(&lruvec->lru_lock); > > SetPageLRU(page); > > And you'd have to carefully review and rework other sites that rely on > PageLRU: reclaim, __page_cache_release(), __activate_page() etc. > > Especially things like activate_page(), which used to only check > PageLRU to shuffle the page on the LRU list would now have to briefly > clear PageLRU and then set it again afterwards. > > However, aside from a bit more churn in those cases, and the > unfortunate additional atomic operations, I currently can't think of a > fundamental reason why this wouldn't work. > > Hugh, what do you think? > Hi Johannes As to the idea of TestClearPageLRU, we except the following scenario compaction commit_charge if (TestClearPageLRU) !TestClearPageLRU lock_page_lruvec goto isolate_fail; del_from_lru_list unlock_page_lruvec But there is a difficult situation to handle: compaction commit_charge TestClearPageLRU !TestClearPageLRU page possible state: a, reclaiming, b, moving between lru list, c, migrating, like in compaction d, mlocking, e, split_huge_page, If the page lru bit was cleared in commit_charge with lrucare, we still have no idea if the page was isolated by the reason from a~e or the page is never on LRU, to deal with different reasons is high cost. So as to the above issue you mentioned, Maybe the better idea is to set lrucare when do mem_cgroup_commit_charge(), since the charge action is not often. What's your idea of this solution? Thanks Alex > compaction: generic_file_buffered_read: > > page_cache_alloc() > > !PageBuddy() > > lock_page_lruvec(page) > lruvec = mem_cgroup_page_lruvec() > spin_lock(&lruvec->lru_lock) > if lruvec != mem_cgroup_page_lruvec() > goto again > > add_to_page_cache_lru() > mem_cgroup_commit_charge() //do charge with lrucare spin_lock(&lruvec->lru_lock) > page->mem_cgroup = foo > lru_cache_add() > __pagevec_lru_add() > SetPageLRU() > > if PageLRU(page): > __isolate_lru_page()