Received: by 2002:ab2:69cc:0:b0:1fd:c486:4f03 with SMTP id n12csp216172lqp; Tue, 11 Jun 2024 02:12:35 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXaJYCpeHqQob7sP2WGzzEWDHfie9PEXLjx+fR3j9sf+o3ayY/OjfPhFtdCJlTo/QVkhU6nA+wIkIxpiiZtQlPh3lPHUCLbKLkCUM4+VA== X-Google-Smtp-Source: AGHT+IEr7/V7zRevHLHQNyKK5fgKELiLncaeUzHv5TUhy3rHFOznIHDLojie86WX+mg6Ouqm2Stb X-Received: by 2002:a50:d5db:0:b0:57c:6bd6:d8db with SMTP id 4fb4d7f45d1cf-57c6bd6de29mr4710479a12.42.1718097155797; Tue, 11 Jun 2024 02:12:35 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1718097155; cv=pass; d=google.com; s=arc-20160816; b=ZCQpg+8SEcNIiuCHuY2sCZC6ts6gTsOsApWYIa9mNHP8PfpEyeMymHqQFKImJsLE7Q pZHb75HtC2HQcYbGnUBgdYSmdCEgLKaRV9Rxn6w14Iu+lJSHTFYXjHgZuP/F1eaBwku5 sGyXM4zP4UEPs+CYOC3HGUvtE0t9IVNO+12oOZUOpGB6u8mybBYDhp5c1RFqILWP5Oyi bQA+c3x0PlqlixslU0q69W5n5bnmMCkvrPDRfsKFSR4Z+VP/tvTUNBtOK7tc3DMfmje9 Zq+kBy+woeDbg8WI1BFHrsy+3uAxWaBA5SDrmpqJ82AJql71rX5qEbNwu0/+G5aK4m56 wPoQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:references :message-id:subject:cc:to:from:date; bh=sRPLBJC/iRR6TSeLRjHSxZEPMoG7F1U7uzmWsL/ArJs=; fh=BiwXcV0RMeXMZADYhG8GZsnpKa4a9tT/zt3hjLRegWQ=; b=EDvnh/AUrZMj/K/Ldt0qAmEGLCDM8YWO+7oD7lTd+ncHvMcISVWPj7P09dTEkv+BQl 7rgWcHe7saECLo321fheSgKeteUoyQqwSsZNWhdwUiD/lUMYkzJnAQ6XhS4V5JQGYbhi ayLFIEoklM7VtXYQWJbw/XfEx27MoQGD5jfPpspWuHOEVyy3GgjjyR6lTLydnBFeZ/0u SsCDVqWHDnyf3p/aOu4My2q0Xqc9Gjm7dSHoeOznMWqi6DWJyALO33Y/cm1fgqB1ZkV/ 2f8iIC0trnn9PgU1+FGqbkp2NNRK2i5hBiQ1EhZq/LO3HjYljQJtVxda/+qpkrASe3/q g5MA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=sk.com); spf=pass (google.com: domain of linux-kernel+bounces-209550-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-209550-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id 4fb4d7f45d1cf-57c8c403219si1341170a12.320.2024.06.11.02.12.35 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jun 2024 02:12:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-209550-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=sk.com); spf=pass (google.com: domain of linux-kernel+bounces-209550-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-209550-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 55F7F1F2594A for ; Tue, 11 Jun 2024 09:12:35 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0D3FA176254; Tue, 11 Jun 2024 09:12:28 +0000 (UTC) Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A4ABC173321 for ; Tue, 11 Jun 2024 09:12:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=166.125.252.92 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718097147; cv=none; b=O0JGLoNRSYyct3yWBhQD0TsPzFF06j9os8L7jeQQ4cvlkHJVSSGK4X5EcUYeBK3w/u8JguqM2rQCAkApZz4RFHwtOuvhT517yqUK+d1d7jpYtBHy4cndSeJYrCSOJInZ7oV2oPwHozXxulLVlRo3N8NByOFm8bkzaffF7rkP46Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718097147; c=relaxed/simple; bh=tmaThG2K7F+eMYaWfxL06epe0PFlDSwEVgbmq94AgYw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=uhWBsaIIoRjbotzzCGZk89KTGIiMgizvtPal6fLaqHNoOnD1fJOzSwURJ9RnFNYWZa+qr7JQ1BQlFNNfTKVJEO+A1kLMDU1rlxLmUXEfIISjWHUJtpwFoEkuu8T7NFIh1BzC2motkCvoBZDcjPaZ7BhLddBB5MlBHRoX8o03opQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com; spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=sk.com X-AuditID: a67dfc5b-d85ff70000001748-1f-666814f3c8b5 Date: Tue, 11 Jun 2024 18:12:14 +0900 From: Byungchul Park To: Dave Hansen Cc: David Hildenbrand , Byungchul Park , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, rjgolo@gmail.com Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Message-ID: <20240611091214.GA16469@system.software.com> References: <20240531092001.30428-1-byungchul@sk.com> <20240531092001.30428-10-byungchul@sk.com> <26dc4594-430b-483c-a26c-7e68bade74b0@redhat.com> <20240603093505.GA12549@system.software.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrKIsWRmVeSWpSXmKPExsXC9ZZnke5nkYw0g6V7bSzmrF/DZvF5wz82 i08vHzBavNjQzmjxdf0vZounn/pYLC7vmsNmcW/Nf1aLo52bmC3O71rLarFj6T4mi0sHFjBZ HO89wGQx/95nNovNm6YyWxyfMpXR4vcPoI6TsyazOAh5fG/tY/HYOesuu8eCTaUem1doeSze 85LJY9OqTjaPTZ8msXu8O3eO3ePEjN8sHvNOBnq833eVzWPrLzuPxqnX2Dw+b5IL4IvisklJ zcksSy3St0vgyrjR+YStoFet4vGH1ywNjN1yXYwcHBICJhIT+iJgzM9TnbsYOTlYBFQlXtxf yQhiswmoS9y48ZMZxBYBsk+tXM7excjFwSxwnFniw8dFYEXCAgUSryZMYgexeQUsJDZsewFW JCSwlFli4evZrBAJQYmTM5+wgNjMAloSN/69ZAJZzCwgLbH8HwdImFPAVuJY322wZaICyhIH th1nArElBPaxS5y/EwthS0ocXHGDZQKjwCwkU2chmToLYeoCRuZVjEKZeWW5iZk5JnoZlXmZ FXrJ+bmbGIGRuaz2T/QOxk8Xgg8xCnAwKvHwnviYlibEmlhWXJl7iFGCg1lJhPdMTHqaEG9K YmVValF+fFFpTmrxIUZpDhYlcV6jb+UpQgLpiSWp2ampBalFMFkmDk6pBsb5pnXS/DsepJSp NS8JUuw/p8s/cY/m7JByT5sTTg48r393BpW8j+RbVfx89Y8DK5pKD949YK5z0zCp9+96+X1a tWmtM96q+Mawz+RUb3Mo8vfv/3/hwXGRhVYb7yveM4wq/pp3wFNNyilpnhFX/PKDZlHfX7DI q01LnVR+61z6781d8h9d7iixFGckGmoxFxUnAgDq1F/GyAIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprHIsWRmVeSWpSXmKPExsXC5WfdrPtZJCPN4G2/tsWc9WvYLD5v+Mdm 8enlA0aLFxvaGS2+rv/FbPH0Ux+LxeG5J1ktLu+aw2Zxb81/VoujnZuYLc7vWstqsWPpPiaL SwcWMFkc7z3AZDH/3mc2i82bpjJbHJ8yldHi9w+gjpOzJrM4CHt8b+1j8dg56y67x4JNpR6b V2h5LN7zkslj06pONo9Nnyaxe7w7d47d48SM3ywe804Gerzfd5XNY/GLD0weW3/ZeTROvcbm 8XmTXAB/FJdNSmpOZllqkb5dAlfGjc4nbAW9ahWPP7xmaWDsluti5OCQEDCR+DzVuYuRk4NF QFXixf2VjCA2m4C6xI0bP5lBbBEg+9TK5exdjFwczALHmSU+fFwEViQsUCDxasIkdhCbV8BC YsO2F2BFQgJLmSUWvp7NCpEQlDg58wkLiM0soCVx499LJpDFzALSEsv/cYCEOQVsJY713QZb JiqgLHFg23GmCYy8s5B0z0LSPQuhewEj8ypGkcy8stzEzBxTveLsjMq8zAq95PzcTYzAWFtW +2fiDsYvl90PMQpwMCrx8J74mJYmxJpYVlyZe4hRgoNZSYT3TEx6mhBvSmJlVWpRfnxRaU5q 8SFGaQ4WJXFer/DUBCGB9MSS1OzU1ILUIpgsEwenVANjquC5R0pJu+tyheKZpr85ty91pehB pidH1jtMXDU9XWffh9qDPkEWW1Kkt63bwCG+8sZ9X8ZNs/p4d5zY8irg9Y6bGjNbhB5vXXz+ +xHNOwzJ+XwR7Lnqd3Wkb0vxuTwKtf/ufmt25HG1l1vX392/ir88SoknWG7y8vUbL89jzfo5 PfhRg4W9nRJLcUaioRZzUXEiAGUAImCxAgAA X-CFilter-Loop: Reflected On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote: > On 6/3/24 02:35, Byungchul Park wrote: > ...> In luf's point of view, the points where the deferred flush should be > > performed are simply: > > > > 1. when changing the vma maps, that might be luf'ed. > > 2. when updating data of the pages, that might be luf'ed. > > It's simple, but the devil is in the details as always. > > > All we need to do is to indentify the points: > > > > 1. when changing the vma maps, that might be luf'ed. > > > > a) mmap and munmap e.i. fault handler or unmap_region(). > > b) permission to writable e.i. mprotect or fault handler. > > c) what I'm missing. > > I'd say it even more generally: anything that installs a PTE which is > inconsistent with the original PTE. That, of course, includes writes. > But it also includes crazy things that we do like uprobes. Take a look > at __replace_page(). > > I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay > there. But it needs some really thorough review. > > But the bigger concern is that, if there was a problem, I can't think of > a systematic way to find it. > > > 2. when updating data of the pages, that might be luf'ed. > > > > a) updating files through vfs e.g. file_end_write(). > > b) updating files through writable maps e.i. 1-a) or 1-b). > > c) what I'm missing. > > Filesystems or block devices that change content without a "write" from > the local system. Network filesystems and block devices come to mind. > I honestly don't know what all the rules are around these, but they > could certainly be troublesome. > > There appear to be some interactions for NFS between file locking and > page cache flushing. > > But, stepping back ... > > I'd honestly be a lot more comfortable if there was even a debugging LUF > mode that enforced a rule that said: > > 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs > 2. A LUF'd page's position in the page cache can't be replaced until > after a luf_flush() I'm thinking a debug mode doing the following *pseudo* code - check the logic only since the grammer might be wrong: 0-a) Introduce new fields in page_ext: #ifdef LUF_DEBUG struct list_head __percpu luf_node; #endif 0-b) Introduce new fields in struct address_space: #ifdef LUF_DEBUG struct list_head __percpu luf_node; #endif 0-c) Introduce new fields in struct task_struct: #ifdef LUF_DEBUG cpumask_t luf_pending_cpus; #endif 0-d) Define percpu list_head to link luf'd folios: #ifdef LUF_DEBUG DEFINE_PER_CPU(struct list_head, luf_folios); DEFINE_PER_CPU(struct list_head, luf_address_spaces); #endif 1) When skipping tlb flush in reclaim or migration for a folio: #ifdef LUF_DEBUG ext = get_page_ext_for_luf_debug(folio); as = folio_mapping(folio); for_each_cpu(cpu, skip_cpus) { list_add(per_cpu_ptr(ext->luf_node, cpu), per_cpu_ptr(luf_folios, cpu)); if (as) list_add(per_cpu_ptr(as->luf_node, cpu), per_cpu_ptr(luf_address_spaces, cpu)); } put_page_ext(ext); #endif 2) When performing tlb flush in try_to_unmap_flush(): Remind luf only works on unmapping during reclaim and migration. #ifdef LUF_DEBUG for_each_cpu(cpu, now_flushing_cpus) { for_each_node_safe(folio, per_cpu_ptr(luf_folios)) { ext = get_page_ext_for_luf_debug(folio); list_del_init(per_cpu_ptr(ext->luf_node, cpu)) put_page_ext(ext); } for_each_node_safe(as, per_cpu_ptr(luf_address_spaces)) list_del_init(per_cpu_ptr(as->luf_node, cpu)) cpumask_clear_cpu(cpu, current->luf_pending_cpus); } #endif 3) In pte_mkwrite(): #ifdef LUF_DEBUG ext = get_page_ext_for_luf_debug(folio); for_each_cpu(cpu, online_cpus) if (!list_empty(per_cpu_ptr(ext->luf_node, cpu))) cpumask_set_cpu(cpu, current->luf_pending_cpus); put_page_ext(ext); #endif 4) On returning to user: #ifdef LUF_DEBUG WARN_ON(!cpumask_empty(current->luf_pending_cpus)); #endif 5) On right after every a_ops->write_end() call: #ifdef LUF_DEBUG as = get_address_space_to_write_to(); for_each_cpu(cpu, online_cpus) if (!list_empty(per_cpu_ptr(as->luf_node, cpu))) cpumask_set_cpu(cpu, current->luf_pending_cpus); #endif luf_flush_or_its_optimized_version(); #ifdef LUF_DEBUG WARN_ON(!cpumask_empty(current->luf_pending_cpus)); #endif I will implement the debug mode this way with all serialized. Do you think it works for what we want? Byungchul > or *some* other independent set of rules that can tell us when something > goes wrong. That uprobes code, for instance, seems like it will work. > But I can also imagine writing it ten other ways where it would break > when combined with LUF.