Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp4503303imw; Tue, 12 Jul 2022 09:03:54 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uxsM7yozdoespkQ4gz/dmkkUfnSkOLZUqgLrFghSK1q4fTmVXvfMzZP8xC/lz7kJjPqe2Z X-Received: by 2002:a05:6402:2985:b0:439:651b:c1f4 with SMTP id eq5-20020a056402298500b00439651bc1f4mr32677906edb.276.1657641834772; Tue, 12 Jul 2022 09:03:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657641834; cv=none; d=google.com; s=arc-20160816; b=sl59Fuw1ahGkeWy4q72tX2VVWYmB9u0sDLzGDoji0ZgTNCj3HnezZiQAIHevP/h9pj Xmgv9wXE2wsJxv+zTZyVH8nDlt8IfohHohrhYY02te5DeI2ijmFR0qkwII5aeY+HoO9U ceDi3ghlkUxM3o3nk83PDSQuAme5Mq3VBZb/CNGR8+Env2CiFqvwbYAreIoXNVhzc7fg EhzWHGjJuvM5lSL/UAdecG7IXJ0/o+fzMbqAlMAchAPZi3Fl8JB7CE2hYjLImxxqKffz iWPF06I0LdeoqaKvIyN3xMuF95XSm+U7zCgJDQ7h5xSDUo1Iw2Y7AjA+Md9G7/7uvtyJ wASQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:date:from:message-id :dkim-signature; bh=DR3HV1SiIkmW4IR/q6IEoMYBeifoU6ydrhuDi/6K044=; b=zOF+AqFeraG3WAvOqZ9LEYlKd8nv7ioqY28KryTLunHB/zh/TY7FcJor3kS5xTldAg RQ7A+PDXY1/hjtxyVsNZQtaC+gy+cFjJ4rLIeLFgXkhlip/y+bZROn6Jlbgmk9pwASGC WdOjtnz2BPO05qJCd5COV1F6XgSXb4ZMivpBUVTn0nT4zmIYDS+1eEkgKli6YsPeuP99 DdkHZytq+gl24M+3aDSMEmIONYDdwSU54Zy6XhNVLMfQ0L4EsUNxjnbjfwGjxAYjQSOl dQus0f7UU55/dqGlgJ503Uvu7DJgBz4NsUkkgzRuIYfTLZX8CP/+hIm2mdOsPiEJtV2T VwNA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@crudebyte.com header.s=lizzy header.b=JLWBpgHk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=crudebyte.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jo17-20020a170906f6d100b00726b4dfce1esi12735386ejb.167.2022.07.12.09.03.10; Tue, 12 Jul 2022 09:03:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@crudebyte.com header.s=lizzy header.b=JLWBpgHk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=crudebyte.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233795AbiGLPzr (ORCPT + 99 others); Tue, 12 Jul 2022 11:55:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233916AbiGLPzj (ORCPT ); Tue, 12 Jul 2022 11:55:39 -0400 X-Greylist: delayed 2333 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Tue, 12 Jul 2022 08:55:36 PDT Received: from lizzy.crudebyte.com (lizzy.crudebyte.com [91.194.90.13]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23B18C54A7 for ; Tue, 12 Jul 2022 08:55:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=crudebyte.com; s=lizzy; h=Cc:To:Subject:Date:From:Message-Id:Content-Type: Content-Transfer-Encoding:MIME-Version:References:In-Reply-To:Content-ID: Content-Description; bh=DR3HV1SiIkmW4IR/q6IEoMYBeifoU6ydrhuDi/6K044=; b=JLWBp gHkHeaKR9uoySe7mI5RXdMk5eM3CvX4kJVb2YrdP9Twm3QzZ6S+Y6wVFRAp7WWQFdtJ3BuwtWhWN6 8wRfR0767mXO3VYp73AtO09kpTLimCxsrF/3/ZCxz36rMROSaL+yvUVGgXHon14kaFFxQbsd8TwxP UyBbL+8DGTfAto/gBpbuFDXy84qAP2ebfuh3uMDTAewXMmwDdeITuhrWWBnWwP8ozOdyW2x8yNdhM wIv7Or4GdQneIDuoeaDlaIdmZQvathN7kMCPiIg+BoyOF+60uWcuIezRHxkAlxOzrjl9/omsM6awR M5jCAFZ1uohM1amebgFDe2AlMpaxA==; Message-Id: From: Christian Schoenebeck Date: Tue, 12 Jul 2022 16:35:54 +0200 Subject: [PATCH v5 00/11] remove msize limit in virtio transport To: v9fs-developer@lists.sourceforge.net Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, Dominique Martinet , Eric Van Hensbergen , Latchesar Ionkov , Nikolay Kichukov X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series aims to get get rid of the current 500k 'msize' limitation in the 9p virtio transport, which is currently a bottleneck for performance of 9p mounts. To avoid confusion: it does remove the msize limit for the virtio transport, on 9p client level though the anticipated milestone for this series is now a max. 'msize' of 4 MB. See patch 7 for reason why. This is a follow-up of the following series and discussion: https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/ Latest version of this series: https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap OVERVIEW OF PATCHES: * Patches 1..6 remove the msize limitation from the 'virtio' transport (i.e. the 9p 'virtio' transport itself actually supports >4MB now, tested successfully with an experimental QEMU version and some dirty 9p Linux client hacks up to msize=128MB). * Patch 7 limits msize for all transports to 4 MB for now as >4MB would need more work on 9p client level (see commit log of patch 7 for details). * Patches 8..11 tremendously reduce unnecessarily huge 9p message sizes and therefore provide performance gain as well. So far, almost all 9p messages simply allocated message buffers exactly msize large, even for messages that actually just needed few bytes. So these patches make sense by themselves, independent of this overall series, however for this series even more, because the larger msize, the more this issue would have hurt otherwise. PREREQUISITES: If you are testing with QEMU then please either use QEMU 6.2 or higher, or at least apply the following patch on QEMU side: https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/ That QEMU patch is required if you are using a user space app that automatically retrieves an optimum I/O block size by obeying stat's st_blksize, which 'cat' for instance is doing, e.g.: time cat test_rnd.dat > /dev/null Otherwise please use a user space app for performance testing that allows you to force a large block size and to avoid that QEMU issue, like 'dd' for instance, in that case you don't need to patch QEMU. KNOWN LIMITATION: With this series applied I can run QEMU host <-> 9P virtio <-> Linux guest with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I try to run it with exactly 4 MB (4194304) it currently hits a limitation on QEMU side: qemu-system-x86_64: virtio: too many write descriptors in indirect table That's because QEMU currently has a hard coded limit of max. 1024 virtio descriptors per vring slot (i.e. per virtio message), see to do (1.) below. STILL TO DO: 1. Negotiating virtio "Queue Indirect Size" (MANDATORY): The QEMU issue described above must be addressed by negotiating the maximum length of virtio indirect descriptor tables on virtio device initialization. This would not only avoid the QEMU error above, but would also allow msize of >4MB in future. Before that change can be done on Linux and QEMU sides though, it first requires a change to the virtio specs. Work on that on the virtio specs is in progress: https://github.com/oasis-tcs/virtio-spec/issues/122 This is not really an issue for testing this series. Just stick to max. msize=4186112 as described above and you will be fine. However for the final PR this should obviously be addressed in a clean way. 2. Reduce readdir buffer sizes (optional - maybe later): This series already reduced the message buffers for most 9p message types. This does not include Treaddir though yet, which is still simply using msize. It would make sense to benchmark first whether this is actually an issue that hurts. If it does, then one might use already existing vfs knowledge to estimate the Treaddir size, or starting with some reasonable hard coded small Treaddir size first and then increasing it just on the 2nd Treaddir request if there are more directory entries to fetch. 3. Add more buffer caches (optional - maybe later): p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for very large buffers to reduce latency waiting for memory allocation to complete. Currently it does that only if the requested buffer size is exactly msize large. As patch 10 already divided the 9p message types into few message size categories, maybe it would make sense to use e.g. 4 separate caches for those memory category (e.g. 4k, 8k, msize/2, msize). Might be worth a benchmark test. Testing and feedback appreciated! v4 -> v5: * Exclude RDMA transport from buffer size reduction. [patch 11] Christian Schoenebeck (11): 9p/trans_virtio: separate allocation of scatter gather list 9p/trans_virtio: turn amount of sg lists into runtime info 9p/trans_virtio: introduce struct virtqueue_sg net/9p: add trans_maxsize to struct p9_client 9p/trans_virtio: support larger msize values 9p/trans_virtio: resize sg lists to whatever is possible net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports net/9p: split message size argument into 't_size' and 'r_size' pair 9p: add P9_ERRMAX for 9p2000 and 9p2000.u net/9p: add p9_msg_buf_size() net/9p: allocate appropriate reduced message buffers include/net/9p/9p.h | 3 + include/net/9p/client.h | 2 + net/9p/client.c | 68 +++++++-- net/9p/protocol.c | 154 ++++++++++++++++++++ net/9p/protocol.h | 2 + net/9p/trans_virtio.c | 304 +++++++++++++++++++++++++++++++++++----- 6 files changed, 484 insertions(+), 49 deletions(-) -- 2.30.2