Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp151660rwi; Tue, 18 Oct 2022 15:35:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4TWzPMfv9ZlGYVkdvNn0TDVwEEeCTTw/VflmYXR/PAVbu++44tUlsADHVBQtHPNDbsm/Xp X-Received: by 2002:a05:6402:40d6:b0:45d:636c:a12f with SMTP id z22-20020a05640240d600b0045d636ca12fmr4537570edb.233.1666132555352; Tue, 18 Oct 2022 15:35:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666132555; cv=none; d=google.com; s=arc-20160816; b=bVeEbWmAkl7C7z2hka5C1rpBAbqVGUmO4vJR8NcBn04xzCOzkQJrWR9ABHaatL8cJh wC1itQKAl2ec+4e8M5dl+qvOQa7Cx62khDBuhnXPlkXn9+sOZs7ct+nCHGE0PWrdbyPV HvoLIGq4s+pdyx1yR5skcPq2RRWMPvvb57ZpSuwO2ZwqjFK/iJxGAQ4n+Oj1mFLHuYYF 6CB8tvqB9NaK+wPCNAXqDVSc3Yael2EsK/1MctiW1soekuURkUUGT9GOlx8tekzXdyKd ekbw4oehFFxs/wHp6bfAF51KVkWjP1MoX7H5fPVHR9ePRzETGKjNUaheajuv3TUIVP5/ hlWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=tv6ga5pIr2oLl6/dd0UtfOn9ZWbfat5gkAp/ejxHqPI=; b=hKrvNiPCfTbFsHmYrkAowe1QE9IVL/EPiaqPgrbbuhJrnEJ/IDKS1+g1SyV2E5P45z HkOWKcgyVNnWPygZxu2+9XkaFhL0bduwFAEf3wNV9dIOpAszuAyL3fy3BOI8okeGFItE FPd2l1l9fNr7jhnRtJeSQNso/4cp1nRYtDFaEvlz5uxrebx6/a0jD8OI+JpvxyxObYTi T3vGxGuRfGtYLsd1dKfPfo2i7PWytdP68x+2UqUZq6in2yLhyGM8MzwjT4jKMF9F8F3L o2DTYFe8QmIlPzwj3kw29ai7KxNuX9UDtBjDwm3RFNnJu1L1/ZDT9c5F1hT3FnGGOeH9 0/pg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a15-20020a509e8f000000b004517955b673si11562025edf.124.2022.10.18.15.35.28; Tue, 18 Oct 2022 15:35:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229902AbiJRWau (ORCPT + 99 others); Tue, 18 Oct 2022 18:30:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229544AbiJRWas (ORCPT ); Tue, 18 Oct 2022 18:30:48 -0400 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8F941DEFF; Tue, 18 Oct 2022 15:30:46 -0700 (PDT) Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au [49.181.106.210]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 422641102C83; Wed, 19 Oct 2022 09:30:43 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1okv6g-003ckX-1i; Wed, 19 Oct 2022 09:30:42 +1100 Date: Wed, 19 Oct 2022 09:30:42 +1100 From: Dave Chinner To: Matthew Wilcox Cc: Zhaoyang Huang , "zhaoyang.huang" , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, ke.wang@unisoc.com, steve.kang@unisoc.com, baocong.liu@unisoc.com, linux-fsdevel@vger.kernel.org Subject: Re: [RFC PATCH] mm: move xa forward when run across zombie page Message-ID: <20221018223042.GJ2703033@dread.disaster.area> References: <1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=VuxAv86n c=1 sm=1 tr=0 ts=634f2914 a=j6JUzzrSC7wlfFge/rmVbg==:117 a=j6JUzzrSC7wlfFge/rmVbg==:17 a=kj9zAlcOel0A:10 a=Qawa6l4ZSaYA:10 a=JfrnYn6hAAAA:8 a=icsG72s9AAAA:8 a=VwQbUJbxAAAA:8 a=20KFwNOVAAAA:8 a=7-415B0cAAAA:8 a=FVvNlm1tYNsfbB76iTkA:9 a=CjuIK1q_8ugA:10 a=1CNFftbPRP8L7MoqJWF3:22 a=T89tl0cgrjxRNoSN2Dv0:22 a=AjGcO6oz07-iQ99wixmX:22 a=biEYGPWJfzWAr4FL6Ov7:22 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox wrote: > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox wrote: > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > From: Zhaoyang Huang > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > with different kinds of address_spaces page cache, which requires high > > robustness to deal with any corner cases. Take the current problem as > > example, the inode with fault page(refcount=0) could remain on the > > sb's list without live lock problem. > > But it's a corner case that shouldn't happen! What else is going on > at the time? Can you reproduce this problem easily? If so, how? I've been seeing this livelock, too. The reproducer is, unfortunately, something I can't share - it's a massive program that triggers a data corruption I'm working on solving. Now that I've mostly fixed the data corruption, long duration test runs end up livelocking in page cache lookup after several hours. The test is effectively writing a 100MB file with multiple threads doing reverse adjacent racing 1MB unaligned writes. Once the file is written, it is then mmap()d and read back from the filesystem for verification. THis is then run with tens of processes concurrently, and then under a massively confined memcg (e.g. 32 processes/files are run in a memcg with only 200MB of memory allowed). This causes writeback, readahead and memory reclaim to race with incoming mmap read faults and writes. The livelock occurs on file verification and it appears to be an interaction with readahead thrashing. On my test rig, the physical read to write ratio is at least 20:1 - with 32 processes running, the 5s IO rates are: Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 This is reading and writing the same amount of file data at the application level, but once the data has been written and kicked out of the page cache it seems to require an awful lot more read IO to get it back to the application. i.e. this looks like mmap() is readahead thrashing severely, and eventually it livelocks with this sort of report: [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 [175901.995614] Call Trace: [175901.996090] [175901.996594] ? __schedule+0x301/0xa30 [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [175902.000714] ? xas_start+0x53/0xc0 [175902.001484] ? xas_load+0x24/0xa0 [175902.002208] ? xas_load+0x5/0xa0 [175902.002878] ? __filemap_get_folio+0x87/0x340 [175902.003823] ? filemap_fault+0x139/0x8d0 [175902.004693] ? __do_fault+0x31/0x1d0 [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 [175902.006213] ? handle_mm_fault+0xd0/0x2a0 [175902.006998] ? exc_page_fault+0x1d9/0x810 [175902.007789] ? asm_exc_page_fault+0x22/0x30 [175902.008613] Given that filemap_fault on XFS is probably trying to map large folios, I do wonder if this is a result of some kind of race with teardown of a large folio... There is a very simple corruption reproducer script that has been written, but I haven't been using it. I don't know if long term running of the script here: https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ will trigger the livelock as the verification step is significantly different, but it will give you insight into the setup of the environment that leads to the livelock. Maybe you could replace the md5sum verification with a mmap read with xfs_io to simulate the fault load that seems to lead to this issue... Cheers, Dave. -- Dave Chinner david@fromorbit.com