Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp7175431rwp; Tue, 18 Jul 2023 11:07:06 -0700 (PDT) X-Google-Smtp-Source: APBJJlF2c7Me2YE7Y7T7XctlsGn+TEghEN12dZqNv0GI3VXdTPV04YLtXwf+2hvTZcxuZAOmwydc X-Received: by 2002:a17:906:2086:b0:992:387:44d1 with SMTP id 6-20020a170906208600b00992038744d1mr5957258ejq.7.1689703626600; Tue, 18 Jul 2023 11:07:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689703626; cv=none; d=google.com; s=arc-20160816; b=mpzVEEUgDN4EgnGOoIRKiLZQS4fAc3uNok07R7USDY9wRA7NH/x/CnJrvyEh/AIdtw odV9+W7+puJsUCLfNV4pp6nEQefvCU4TW0uAo3b1B7ovnumGHE8Y6QCnWFdLFjd6BRaB tu5ZBpHTOxAuDnXLC83Sr0+owmrovGO43M4j4rHh36ulSqkr421Asdg1KniucSsz3rTc 7cffB4DJz3J5fMb7lnVqnmJ4BJXIaIl7QC+hmbNsyZF0QaGnpAVHXWLgWXDJ+8CMqopG 5TL73S94m2qXAcLjzAlQD+Bam5idOSyKurO7+FErVaK9IQXb6aEGKZhx5I9PcINlNu+Y 3uvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject:from :content-language:user-agent:mime-version:date:message-id; bh=6iEfMnEAMMWl55I5GEAuWIaFScetFXy2A78ABDvVXKk=; fh=29CuSN40b9sQrP77dS4O5jQm775lYbyJdfhtmEOrYG4=; b=JDVpBANFuo4pR0xJDlLniPZ1XBwSoofgaNzRoF6so/0dKhxPVYKhc7a6sj/B/zH3hT HVq1ieU5uG/plPl8Sil6jabWvYIcuCRbhGS3JBRKlNE0nYqYRqk2+ygo141gql8tVx8V XxK0OtehJYVJYMD1c65yiX3fZxs6TauKNTEpJaEMR5mSAQpn8mJk78Y+QjZcgCjaKrbL AlTAVJH38uoB254mHA44O83T1v22jBv7IJsNj2+RBTn/R8AgWKGhowilNaeLtdMJkC2A 4SsGpWpgS1ZXCC0Uo1Vqm5NRl9rwzlhmRLlkrxZl1E6L/qMBOQHhVvzazH5F7YnqFfgJ +NlA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h7-20020a170906854700b00993150ec3c7si1483893ejy.966.2023.07.18.11.06.43; Tue, 18 Jul 2023 11:07:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232960AbjGRRtb (ORCPT + 99 others); Tue, 18 Jul 2023 13:49:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37510 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232704AbjGRRta (ORCPT ); Tue, 18 Jul 2023 13:49:30 -0400 X-Greylist: delayed 8317 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Tue, 18 Jul 2023 10:49:28 PDT Received: from diomedes.noc.ntua.gr (diomedes.noc.ntua.gr [IPv6:2001:648:2000:de::220]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0CA5FDC; Tue, 18 Jul 2023 10:49:27 -0700 (PDT) Received: from danaos.cslab.ece.ntua.gr (danaos.cslab.ece.ntua.gr [147.102.3.1]) by diomedes.noc.ntua.gr (8.15.2/8.15.2) with ESMTP id 36IEXD69061138; Tue, 18 Jul 2023 17:33:13 +0300 (EEST) (envelope-from jimsiak@cslab.ece.ntua.gr) Received: from [147.102.3.213] (avalanche.cslab.ece.ntua.gr [147.102.3.213]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by danaos.cslab.ece.ntua.gr (Postfix) with ESMTPSA id E2E6720E3B; Tue, 18 Jul 2023 17:33:12 +0300 (EEST) Message-ID: <79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr> Date: Tue, 18 Jul 2023 17:33:12 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Content-Language: en-US From: Dimitris Siakavaras Subject: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released To: viro@zeniv.linux.org.uk Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.1 (diomedes.noc.ntua.gr [147.102.222.220]); Tue, 18 Jul 2023 17:33:14 +0300 (EEST) X-Virus-Scanned: clamav-milter 0.101.4 at dkim.noc.ntua.gr X-Virus-Status: Clean X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, this is my first bug report so I apologise in advance for any missing information and/or difficulty in explaining the problem in my email. I am at your disposal to provide any other necessary information or modify appropriately my email. Problem: Using userfaultfd for a process that uses KVM and triggers the asynchronous page fault handling results in processes to hung forever. Processor: AMD EPYC 7402 24-Core Processor Kernel version: 5.13 (the problem also occurs on 6.4.3 and 6.5-rc2) Unfortunately, my execution environment involves a pretty complex set of components to setup so it is not straightforward for me to share code that can be used to reproduce the issue, so I will try to explain the problem as clearly as possible. I have two processes: 1. A firecracker VM process (https://firecracker-microvm.github.io/) which uses KVM. 2. A second process that handles the userpage faults of the firecracker process. The race condition involves the released field of the userfaultfd_ctx structure. More specifically: * Process 2 invokes the close() system call for the userfaultfd descriptor, thus triggering the execution of userfaultfd_release() in the kernel.   userfaultfd_release() contains the following lines of code:    WRITE_ONCE(ctx->released, true);     if (!mmget_not_zero(mm))         goto wakeup;     /*      * Flush page faults out of all CPUs. NOTE: all page faults      * must be retried without returning VM_FAULT_SIGBUS if      * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx      * changes while handle_userfault released the mmap_lock. So      * it's critical that released is set to true (above), before      * taking the mmap_lock for writing.      */     mmap_write_lock(mm); * Process 1 is getting a page fault while running inside KVM_ENTRY. This triggers the execution of kvm_tdp_page_fault(), and the following function call chain is executed: kvm_tdp_page_fault() -> direct_page_fault() -> try_async_pf() -> kvm_arch_setup_async_pf() -> kvm_setup_async_pf() kvm_setup_async_pf() adds in the workqueue function async_pf_execute:     INIT_WORK(&work->work, async_pf_execute); Then, the following function call chain is executed: async_pf_execute() -> get_user_pages_remote() -> __get_user_pages_remote() -> __get_user_pages_locked() -> __get_user_pages() __get_user_pages() is called with mmap_lock taken and in there is the following code: retry:         /*          * If we have a pending SIGKILL, don't keep faulting pages and          * potentially allocating memory.          */         if (fatal_signal_pending(current)) {             ret = -EINTR;             goto out;         }         cond_resched();         page = follow_page_mask(vma, start, foll_flags, &ctx);         if (!page) {             ret = faultin_page(vma, start, &foll_flags, locked);             switch (ret) {             case 0:                 goto retry; When faultin_page() is called here it will in turn call the following chain of functions: faultin_page() -> handle_mm_fault() -> __handle__mm_fault() -> handle_pte_fault() -> do_anonymous_page() -> handle_userfault() The final handle_userfault() function is the function used by userfaultfd to handle the userfault. In this function we can find the following code: if (unlikely(READ_ONCE(ctx->released))) {         /*          * Don't return VM_FAULT_SIGBUS in this case, so a non          * cooperative manager can close the uffd after the          * last UFFDIO_COPY, without risking to trigger an          * involuntary SIGBUS if the process was starting the          * userfaultfd while the userfaultfd was still armed          * (but after the last UFFDIO_COPY). If the uffd          * wasn't already closed when the userfault reached          * this point, that would normally be solved by          * userfaultfd_must_wait returning 'false'.          *          * If we were to return VM_FAULT_SIGBUS here, the non          * cooperative manager would be instead forced to          * always call UFFDIO_UNREGISTER before it can safely          * close the uffd.          */         ret = VM_FAULT_NOPAGE;         goto out; } The problem is that when ctx->released has been set to 1 by userfaultfd_release() called by Process 2, handle_userfault() will return VM_FAULT_NOPAGE due to the above if statement. This will result in VM_FAULT_NOPAGE returned by handle_mm_fault() in faultin_page() and faultin_page() in turn will return 0. Getting back to the invocation of faultin_page() from __get_user_pages() the "case 0:" statement will cause the execution to go back to the retry label. Given that ctx->released never turns back to 0, this loop will continue forever and Process 1 will be stuck calling faultin_page(), getting 0 as return value, going back to retry, and so on. Given that Process 1 still holds the mmap_lock and will never release it, process 2 will also hang in the call of mmap_write_lock(mm). This results in both processes being stuck in a deadlock/livelock situation. Unfortunately, I have only a minor knowledge of the mm kernel subsystem so I am not able to provide a solution to the problem, but I hope someone else with experience in kernel developing can come up with a proper solution. Thank you very much, Best Regards, Dimitris Siakavaras