Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp12016697rwd; Thu, 22 Jun 2023 23:39:22 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7sUBO82doh7eqjbUts/4hjPeBvP3resTtSF2QjfzOQktBhnOLf7gJDq81feTWYr+EgDTxf X-Received: by 2002:a17:90a:ff15:b0:260:aa73:e406 with SMTP id ce21-20020a17090aff1500b00260aa73e406mr9560679pjb.38.1687502361931; Thu, 22 Jun 2023 23:39:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687502361; cv=none; d=google.com; s=arc-20160816; b=K0GLM0VRibq0fnd1m3+FLK+9TbZ0VSItY1h3U+zcgcVIK8WJfCvR47LnfvM4CiOgvd juCDN8eWoxEY0/NpkuTHzz4qK1s+FlaLOUEX3tqtQZvQYOB5rm0uhY0U3D6i0o9In+U4 z8bYlo7Q04IPyn1Uk4bD2PLQYQmaW0hTzKfuMHi6SVXVcQhE4h+/dk0EZdFIosWA0A56 NtI3g+1MZzShuVyJYotDHeiqjo8AYlmqMPqtz8fKbW4pV+Mr2umDNmzRFwsmEM6RiwnQ fgymm2QN9dOCMwvtqL8nfYLuQlrvUYEKUfiE7wtQbBm5Idwcp3muaiCTfXtQTKFuOC/e BLEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:to:content-language:subject:cc:user-agent:mime-version :date:message-id:dkim-signature; bh=LvxvUqYbUcfX5QQ3e1lDzyl8/ie5VADUQ5SonXObE2Y=; b=jDbcMCKdbDP2aZIZoHtpFoUeiZKh1ZgtQpC0y9/LoW2mDkgQVodETsDqAothUb0z6J Hn1mOxGIWCIBq79zCe9kGhV6F5mYEznwS438yNT0R5hAFacjR1zbuHXpIK5gGIkakcHo u/+ZU04cX+TyKxoB5oFmuRt2tJQDr5df090XN1kleAyp2xpyX8ezFNIpd6Fmp6TAIU63 UMfAIivCSxa9vD2GW66QAV/+rREicpe8UN646IznxYwE7quOJTqR0E7wKjGr4XNSkG7F wMtMfIMLTXVrQoLEVCH+dzsn35OpDl+cnd6mqGP/BDnlGq2cuebSbxENsTULC9pccLYU B5AQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dkyoEqW8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bs185-20020a6328c2000000b005301151503esi7808102pgb.186.2023.06.22.23.39.09; Thu, 22 Jun 2023 23:39:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dkyoEqW8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231646AbjFWGUR (ORCPT + 99 others); Fri, 23 Jun 2023 02:20:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43408 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230163AbjFWGUQ (ORCPT ); Fri, 23 Jun 2023 02:20:16 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40F0F189; Thu, 22 Jun 2023 23:20:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687501214; x=1719037214; h=message-id:date:mime-version:cc:subject:to:references: from:in-reply-to:content-transfer-encoding; bh=2L8LbF2gBGTP8lafaNRC8LtBh5KQEJXn+zvdgG4MunU=; b=dkyoEqW8y4h07tqo89FI/wPN3eCjSJO9reR2swVwZ9PA4zY76gBLcQj3 jtCpuqFYQovjkrtpgteljReUDJ+60t1fihhIJmyuzXDrEjsnd6jEr589Z Afnp6dKDi0BObSjIrZZyX0L96jmiUx4QvgTnxXbKFamgpb9GD7kifdv8q trffrmzE+4eaMN/RwrZMyNSJLjxCYE1GtVY87iOCuETe/R4tiSlTooZ21 vAxY1MlWce9rvH0jg85Drmg7FclcpgWi8IzK5HuIWdoDSgi5BncFuknA2 OPLraqqDOaxpXU4dPQF3YdCkdzJA/REIjnm7+0NNV7jHOGXEnPHtIjL/k g==; X-IronPort-AV: E=McAfee;i="6600,9927,10749"; a="447077709" X-IronPort-AV: E=Sophos;i="6.01,151,1684825200"; d="scan'208";a="447077709" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jun 2023 23:20:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10749"; a="744885207" X-IronPort-AV: E=Sophos;i="6.01,151,1684825200"; d="scan'208";a="744885207" Received: from allen-box.sh.intel.com (HELO [10.239.159.127]) ([10.239.159.127]) by orsmga008.jf.intel.com with ESMTP; 22 Jun 2023 23:19:58 -0700 Message-ID: Date: Fri, 23 Jun 2023 14:18:38 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cc: baolu.lu@linux.intel.com, Kevin Tian , Joerg Roedel , Will Deacon , Robin Murphy , Jean-Philippe Brucker , Nicolin Chen , Yi Liu , Jacob Pan , iommu@lists.linux.dev, linux-kselftest@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space Content-Language: en-US To: Jason Gunthorpe References: <20230530053724.232765-1-baolu.lu@linux.intel.com> From: Baolu Lu In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/31/23 8:33 AM, Jason Gunthorpe wrote: > On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote: >> Hi folks, >> >> This series implements the functionality of delivering IO page faults to >> user space through the IOMMUFD framework. The use case is nested >> translation, where modern IOMMU hardware supports two-stage translation >> tables. The second-stage translation table is managed by the host VMM >> while the first-stage translation table is owned by the user space. >> Hence, any IO page fault that occurs on the first-stage page table >> should be delivered to the user space and handled there. The user space >> should respond the page fault handling result to the device top-down >> through the IOMMUFD response uAPI. >> >> User space indicates its capablity of handling IO page faults by setting >> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD >> will then setup its infrastructure for page fault delivery. Together >> with the iopf-capable flag, user space should also provide an eventfd >> where it will listen on any down-top page fault messages. >> >> On a successful return of the allocation of iopf-capable HWPT, a fault >> fd will be returned. User space can open and read fault messages from it >> once the eventfd is signaled. > This is a performance path so we really need to think about this more, > polling on an eventfd and then reading a different fd is not a good > design. > > What I would like is to have a design from the start that fits into > io_uring, so we can have pre-posted 'recvs' in io_uring that just get > completed at high speed when PRIs come in. > > This suggests that the PRI should be delivered via read() on a single > FD and pollability on the single FD without any eventfd. I will remove the eventfd and provide a single FD for both read() and write(). The userspace reads the FD to retrieve the fault messages while writing the FD to respond the handling of the faults. The user space could leverage the io_uring for asynchronous I/O. A sample userspace design could look like this: [pseudo code for discussion only] struct io_uring ring; io_uring_setup(IOPF_ENTRIES, &ring); while (1) { struct io_uring_prep_read read; struct io_uring_cqe *cqe; read.fd = iopf_fd; read.buf = malloc(IOPF_SIZE); read.len = IOPF_SIZE; read.flags = 0; io_uring_prep_read(&ring, &read); io_uring_submit(&ring); // Wait for the read to complete while ((cqe = io_uring_get_cqe(&ring)) != NULL) { // Check if the read completed if (cqe->res < 0) break; if (page_fault_read_completion(cqe)) { // Get the fault data void *data = cqe->buf; size_t size = cqe->res; // Handle the page fault handle_page_fault(data); // Respond the fault struct io_uring_prep_write write; write.fd = iopf_fd; write.buf = malloc(IOPF_RESPONSE_SIZE); write.len = IOPF_RESPONSE_SIZE; write.flags = 0; io_uring_prep_write(&ring, &write); io_uring_submit(&ring); } // Reap the cqe io_uring_cqe_free(&ring, cqe); } } Did I understand you correctly? Best regards, baolu