Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp2930886rdb; Fri, 22 Sep 2023 12:23:01 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE6dwTG4xn3TDq5hIEJh9ZCazb7K+r1gmLeQO+7FUAtQ9fDP5NO6+4D+4muXtmWsE2FW+i8 X-Received: by 2002:a05:6358:c1d:b0:143:70a1:afcd with SMTP id f29-20020a0563580c1d00b0014370a1afcdmr685747rwj.1.1695410580950; Fri, 22 Sep 2023 12:23:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695410580; cv=none; d=google.com; s=arc-20160816; b=lTdTYUMhDpUVmr4rg2I68lqlXx5rWcFqPSgJ3ZYBPAQKSeo5gEnDmCVQRuayJgnqSm RmMIQiNy55nWSL4Y16YjwtXVuxdqtWCDDj4ELMaCTieYaXgsDOGCf/VhdwOHXe0hV30R k2TIOJoHTmqmuhTEU/oT5sXJy75xkkhL7dlzT4jrh4WvHPZH3q478OMn1tjmfZDQSmWc U4yRRto2KmmMocoL/xDNTSPQ9btvUM17pjPbj3sHPFxtcy2RwIRm8YYP2mOoJ3PLldyz QbcSdY8TVOPLLtDCfTD7IjCNZ4wkoJ/KBWZDM3GbbBLd11PN7OQfN4Ycctg/VU1D06wP gFtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=ZWOBUzel0hTP5zlz7y2Skk8tQ+ei1TM5PAGkEPtLeRI=; fh=q9SQMO40xEi/EnFYzkYLxDJR2yDm5elEChUbTy2RFH8=; b=WFOTqiZeoS/0Rbee6QEzD0PiqhWyj8IAMnUzh0rkDbw5WeHRFypawFbBqk2TXqk1ET rzVwME+D4FTjyu3EeQ5jp0R7/GmzFSOnRhOow1dmVZHF3heq5vEK1JM6A4DahS2dzQjY w2fWYUVr+wEk7ca6syjCd9t4+3UioOIeTt7di5fUfZAF3juuDCncgwwmgvSraP6PZ6K7 2Ul8AuhL3EMRDKw5ZfIRcd2XAx+vLHSEVh4irEB7igsnhRpkmVxR8MQDOlghz4GUtO1P lgWGIlKfhkF/3FJjaiv61uXUjEorHxoO0jgeamCy3DVBze1LxNJPlX0CB9KUhLo53Mqp f80w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id g123-20020a636b81000000b0056a55b14f2bsi4161848pgc.285.2023.09.22.12.23.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 12:23:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 51C0883A0B97; Fri, 22 Sep 2023 12:06:31 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233785AbjIVTG3 (ORCPT + 99 others); Fri, 22 Sep 2023 15:06:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51118 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233756AbjIVTG1 (ORCPT ); Fri, 22 Sep 2023 15:06:27 -0400 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E6C0AC for ; Fri, 22 Sep 2023 12:06:21 -0700 (PDT) Received: from imladris.home.surriel.com ([10.0.13.28] helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qjlTP-0002Wy-1G; Fri, 22 Sep 2023 15:05:55 -0400 From: riel@surriel.com To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, muchun.song@linux.dev, mike.kravetz@oracle.com, leit@meta.com, willy@infradead.org Subject: [PATCH v2 0/3] hugetlbfs: close race between MADV_DONTNEED and page fault Date: Fri, 22 Sep 2023 15:02:28 -0400 Message-ID: <20230922190552.3963067-1-riel@surriel.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: riel@surriel.com X-Spam-Status: No, score=2.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Fri, 22 Sep 2023 12:06:31 -0700 (PDT) X-Spam-Level: ** v2: fix the locking bug found with the libhugetlbfs tests. Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by extending the hugetlb_vma_lock locking scheme to also cover private hugetlb mappings (with resv_map), and pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. The third patch in the series is more of an RFC. Using the invalidate_lock instead of the hugetlb_vma_lock greatly simplifies the code, but at the cost of turning a per-VMA lock into a lock per backing hugetlbfs file, which could slow things down when multiple processes are mapping the same hugetlbfs file.