Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp1793612rwb; Thu, 19 Jan 2023 15:48:52 -0800 (PST) X-Google-Smtp-Source: AMrXdXuzM4PJdlHoEac7+/fU3WAHUzjZWivDzb6GnMlFxavzoXQRpHL4UIaca9gXR3N8UGCeSdUl X-Received: by 2002:a17:902:b587:b0:193:6520:73a4 with SMTP id a7-20020a170902b58700b00193652073a4mr12792788pls.61.1674172132257; Thu, 19 Jan 2023 15:48:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674172132; cv=none; d=google.com; s=arc-20160816; b=FR42oJBHiR6F1U8IUkdJx7gwaGsB1BL0EV2z+gbgUfgobkEjSXM93qYaT0Ft5j8dbR 7UKePTKm6KkDdvVUiqcyUYBblgvwe+4eD03AFkTxfU1q85qH9E+KtSe3j2aH36WP0ya9 FZvaxCfIkTwed1hkpsoWRVzxEOxSI4J4POlnW8WGLDMEGFQigP5J9YO8MBy9wyL69epR d6PGgMrKwxy9jIYHHepxYeGQslRvWXJByk8Dry4MRO5BvOtcqbdewYgsTsnSYFaPT2ZS GQdRKQPmlfS2rYTOwfYu96nsxVbxyC0kggA47MNaxZt7csYvwG5KqBKrBBNjSAa/wa6G h+HA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=3bBBNfWfdOJojqP+jjp+TX9gqoRqSR4QzF0ynIoERPA=; b=XQ59WrloMp4AXRRTBsGIuPz1SnUI+C08JAmxW9TN6s6dHztbiKqiVamCsJc0i7WhW3 sI33u+z2V2lVZiqCfOc7wcK4GRsFbzfnog1APFfooHFZmmQLMktVoiQGZLkhhRD0PIZv 4KkrlPnEgwIuAaJRQTSirLzkqc0uo+mY4IJmS2xcMi6tmWo8td0b0RKgPQ6RDWDS/w/L 71378uuxjMi+LoUyhYvcYP61Psp7laG/S9Up7jgat19gLSZ1ZCdLSJ7XN/07dZHDLE7s RzKkEJAoKgDyrPiw0FktIAsAG7PQn/1xp+dgRT6xKCLOMvK2YcrstjDqCddUmBXcpSH8 xGng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HNveYhQK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k3-20020a170902c40300b0019497c8fea0si13992551plk.436.2023.01.19.15.48.46; Thu, 19 Jan 2023 15:48:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HNveYhQK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230095AbjASXOo (ORCPT + 46 others); Thu, 19 Jan 2023 18:14:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230502AbjASXNr (ORCPT ); Thu, 19 Jan 2023 18:13:47 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E89142114 for ; Thu, 19 Jan 2023 15:07:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674169633; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=3bBBNfWfdOJojqP+jjp+TX9gqoRqSR4QzF0ynIoERPA=; b=HNveYhQKTFOuzJKwszX0DzDgBQKWlUrc8bD6WJoXaROmUVvNgOoZiwHhPRdQycPEmppQAq ouI0Y34AQPp59comqVuQ6RAoC0LvwfpxmaC4Qf8cK/HMAnUecuW9/wj2wmtYcLZn3KywYi XKNqo2zVNvSc5P+HKrUpQhGo+kLYDAo= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-622-YzGjEUQCPGeVHZ_6zic0Fg-1; Thu, 19 Jan 2023 18:07:11 -0500 X-MC-Unique: YzGjEUQCPGeVHZ_6zic0Fg-1 Received: by mail-qk1-f197.google.com with SMTP id bs44-20020a05620a472c00b0070673cd1b05so2286948qkb.22 for ; Thu, 19 Jan 2023 15:07:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=3bBBNfWfdOJojqP+jjp+TX9gqoRqSR4QzF0ynIoERPA=; b=g/NqNnV98+yx0Hay9TjZ5j0MWD9jlOxyRIESAi0W7bBBOSkmdzwzJykY2imnCyJRBS dW8wPTBA5lhnVeqL6PgziC9GF9V9/1cddJLjNd/PCzplgwyVSzbS4ZvZlJ4mM6aRDqRE mIA2UJc6gWfzLkvQOekWYnUKYQLREqVp6fUre7OeBEZiTjgPk/BTwxYLWUhy36ryMoaN vJFbXPID4gV+Bj5bd3TcP/0MSEXrO3ekMK7Cvh2VQGsGZASVjIXBFkzdhhAUFE20LhSE ljtkqHTUYFm6qC+VXvm0MQWqXxwRCMlOFwJ7sTeHqpJA5HFBh1Zszu+djo4K2WDLgD+m E+CA== X-Gm-Message-State: AFqh2kpTPC5dyCaYQHeW9wkpqOnZqk/R0VGyzpGeqm1YLPQ7ZjtaWLTw 727HrPvckkTR6ir7780hRok5wmUhbFKtaKQb1j9HhVWcYFwMs1ig0KRfF3V0FRmUGZ3bHERHWzx OPfcfpdwwxw3VCW/T/Ajn4eYG X-Received: by 2002:ac8:7ed7:0:b0:3b6:3260:fa1d with SMTP id x23-20020ac87ed7000000b003b63260fa1dmr16943263qtj.45.1674169631432; Thu, 19 Jan 2023 15:07:11 -0800 (PST) X-Received: by 2002:ac8:7ed7:0:b0:3b6:3260:fa1d with SMTP id x23-20020ac87ed7000000b003b63260fa1dmr16943241qtj.45.1674169631097; Thu, 19 Jan 2023 15:07:11 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id f8-20020a05620a408800b006b5cc25535fsm25661835qko.99.2023.01.19.15.07.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Jan 2023 15:07:10 -0800 (PST) Date: Thu, 19 Jan 2023 18:07:08 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range Message-ID: References: <6548b3b3-30c9-8f64-7d28-8a434e0a0b80@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote: > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu wrote: > > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote: > > > I do not know much about the (primary) live migration use case. My > > > guess is that page table lock contention may be an issue? In this use > > > case, HGM is only enabled for the duration the live migration operation, > > > then a MADV_COLLAPSE is performed. If contention is likely to be an > > > issue during this time, then yes we would need to pass around with > > > something like hugetlb_pte. > > > > I'm not aware of any such contention issue. IMHO the migration problem is > > majorly about being too slow transferring a page being so large. Shrinking > > the page size should resolve the major problem already here IIUC. > > This will be problematic if you scale up VMs to be quite large. Do you mean that for the postcopy use case one can leverage e.g. 2M mappings (over 1G) to avoid lock contentions when VM is large? I agree it should be more efficient than having 512 4K page installed, but I think it'll make the page fault resolution slower too if some thead is only looking for a 4k portion of it. > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take > the MMU lock for writing in the EPT violation path. We found that this > change is required for VMs >200 or so vCPUs to consistently avoid CPU > soft lockups in the guest. After the kvm mmu rwlock convertion, it'll allow concurrent page faults even if only 4K pages are used, so it seems not directly relevant to what we're discussing here, no? > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on > the same PTL would be problematic in the same way. Pte-level pgtable lock only covers 2M range, so I think it depends on which is the address that the vcpu is faulted on? IIUC the major case should be that the faulted threads are not falling upon the same 2M range. > > > > > AFAIU 4K-only solution should only reduce any lock contention because locks > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating > > the intermediate pgtable entries we can use atomic ops just like generic > > mm, so no lock needed at all. With uncertainty on the size of mappings, > > we'll need to take any of the multiple layers of locks. > > > > Other than taking the HugeTLB VMA lock for reading, walking/allocating > page tables won't need any additional locking. Actually when revisiting the locks I'm getting a bit confused on whether the vma lock is needed if pmd sharing is anyway forbidden for HGM. I raised a question in the other patch of MADV_COLLAPSE, maybe they're related questions so we can keep it there. > > We take the PTL to allocate the next level down, but so does generic > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am > misunderstanding. Sorry you're right, please ignore that. I don't know why I had that impression that spinlocks are not needed in that process. Actually I am also curious why atomics won't work (by holding mmap read lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I think it's possible I just missed something else. -- Peter Xu