2022-07-12 16:03:54

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 00/11] remove msize limit in virtio transport

This series aims to get get rid of the current 500k 'msize' limitation in
the 9p virtio transport, which is currently a bottleneck for performance
of 9p mounts.

To avoid confusion: it does remove the msize limit for the virtio transport,
on 9p client level though the anticipated milestone for this series is now
a max. 'msize' of 4 MB. See patch 7 for reason why.

This is a follow-up of the following series and discussion:
https://lore.kernel.org/all/[email protected]/

Latest version of this series:
https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap


OVERVIEW OF PATCHES:

* Patches 1..6 remove the msize limitation from the 'virtio' transport
(i.e. the 9p 'virtio' transport itself actually supports >4MB now, tested
successfully with an experimental QEMU version and some dirty 9p Linux
client hacks up to msize=128MB).

* Patch 7 limits msize for all transports to 4 MB for now as >4MB would need
more work on 9p client level (see commit log of patch 7 for details).

* Patches 8..11 tremendously reduce unnecessarily huge 9p message sizes and
therefore provide performance gain as well. So far, almost all 9p messages
simply allocated message buffers exactly msize large, even for messages
that actually just needed few bytes. So these patches make sense by
themselves, independent of this overall series, however for this series
even more, because the larger msize, the more this issue would have hurt
otherwise.


PREREQUISITES:

If you are testing with QEMU then please either use QEMU 6.2 or higher, or
at least apply the following patch on QEMU side:

https://lore.kernel.org/qemu-devel/[email protected]/

That QEMU patch is required if you are using a user space app that
automatically retrieves an optimum I/O block size by obeying stat's
st_blksize, which 'cat' for instance is doing, e.g.:

time cat test_rnd.dat > /dev/null

Otherwise please use a user space app for performance testing that allows
you to force a large block size and to avoid that QEMU issue, like 'dd'
for instance, in that case you don't need to patch QEMU.


KNOWN LIMITATION:

With this series applied I can run

QEMU host <-> 9P virtio <-> Linux guest

with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I try
to run it with exactly 4 MB (4194304) it currently hits a limitation on
QEMU side:

qemu-system-x86_64: virtio: too many write descriptors in indirect table

That's because QEMU currently has a hard coded limit of max. 1024 virtio
descriptors per vring slot (i.e. per virtio message), see to do (1.) below.


STILL TO DO:

1. Negotiating virtio "Queue Indirect Size" (MANDATORY):

The QEMU issue described above must be addressed by negotiating the
maximum length of virtio indirect descriptor tables on virtio device
initialization. This would not only avoid the QEMU error above, but would
also allow msize of >4MB in future. Before that change can be done on
Linux and QEMU sides though, it first requires a change to the virtio
specs. Work on that on the virtio specs is in progress:

https://github.com/oasis-tcs/virtio-spec/issues/122

This is not really an issue for testing this series. Just stick to max.
msize=4186112 as described above and you will be fine. However for the
final PR this should obviously be addressed in a clean way.

2. Reduce readdir buffer sizes (optional - maybe later):

This series already reduced the message buffers for most 9p message
types. This does not include Treaddir though yet, which is still simply
using msize. It would make sense to benchmark first whether this is
actually an issue that hurts. If it does, then one might use already
existing vfs knowledge to estimate the Treaddir size, or starting with
some reasonable hard coded small Treaddir size first and then increasing
it just on the 2nd Treaddir request if there are more directory entries
to fetch.

3. Add more buffer caches (optional - maybe later):

p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for very
large buffers to reduce latency waiting for memory allocation to
complete. Currently it does that only if the requested buffer size is
exactly msize large. As patch 10 already divided the 9p message types
into few message size categories, maybe it would make sense to use e.g.
4 separate caches for those memory category (e.g. 4k, 8k, msize/2,
msize). Might be worth a benchmark test.

Testing and feedback appreciated!

v4 -> v5:

* Exclude RDMA transport from buffer size reduction. [patch 11]

Christian Schoenebeck (11):
9p/trans_virtio: separate allocation of scatter gather list
9p/trans_virtio: turn amount of sg lists into runtime info
9p/trans_virtio: introduce struct virtqueue_sg
net/9p: add trans_maxsize to struct p9_client
9p/trans_virtio: support larger msize values
9p/trans_virtio: resize sg lists to whatever is possible
net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports
net/9p: split message size argument into 't_size' and 'r_size' pair
9p: add P9_ERRMAX for 9p2000 and 9p2000.u
net/9p: add p9_msg_buf_size()
net/9p: allocate appropriate reduced message buffers

include/net/9p/9p.h | 3 +
include/net/9p/client.h | 2 +
net/9p/client.c | 68 +++++++--
net/9p/protocol.c | 154 ++++++++++++++++++++
net/9p/protocol.h | 2 +
net/9p/trans_virtio.c | 304 +++++++++++++++++++++++++++++++++++-----
6 files changed, 484 insertions(+), 49 deletions(-)

--
2.30.2


2022-07-12 16:31:36

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 02/11] 9p/trans_virtio: turn amount of sg lists into runtime info

The size of scatter/gather lists used by the virtio transport is
currently hard coded. Refactor this to using a runtime variable.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
net/9p/trans_virtio.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 2693e618080c..18bdfa64b934 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,7 @@
#include <linux/virtio_9p.h>
#include "trans_common.h"

-#define VIRTQUEUE_NUM 128
+#define VIRTQUEUE_DEFAULT_NUM 128

/* a single mutex to manage channel initialization and attachment */
static DEFINE_MUTEX(virtio_9p_lock);
@@ -54,6 +54,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
* @vc_wq: wait queue for waiting for thing to be added to ring buf
* @p9_max_pages: maximum number of pinned pages
* @sg: scatter gather list which is used to pack a request (protected?)
+ * @sg_n: amount of elements in sg array
* @chan_list: linked list of channels
*
* We keep all per-channel information in a structure.
@@ -78,6 +79,7 @@ struct virtio_chan {
unsigned long p9_max_pages;
/* Scatterlist: can be too big for stack. */
struct scatterlist *sg;
+ size_t sg_n;
/**
* @tag: name to identify a mount null terminated
*/
@@ -270,12 +272,12 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
out_sgs = in_sgs = 0;
/* Handle out VirtIO ring buffers */
out = pack_sg_list(chan->sg, 0,
- VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+ chan->sg_n, req->tc.sdata, req->tc.size);
if (out)
sgs[out_sgs++] = chan->sg;

in = pack_sg_list(chan->sg, out,
- VIRTQUEUE_NUM, req->rc.sdata, req->rc.capacity);
+ chan->sg_n, req->rc.sdata, req->rc.capacity);
if (in)
sgs[out_sgs + in_sgs++] = chan->sg + out;

@@ -447,14 +449,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,

/* out data */
out = pack_sg_list(chan->sg, 0,
- VIRTQUEUE_NUM, req->tc.sdata, req->tc.size);
+ chan->sg_n, req->tc.sdata, req->tc.size);

if (out)
sgs[out_sgs++] = chan->sg;

if (out_pages) {
sgs[out_sgs++] = chan->sg + out;
- out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
+ out += pack_sg_list_p(chan->sg, out, chan->sg_n,
out_pages, out_nr_pages, offs, outlen);
}

@@ -466,13 +468,13 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
* allocated memory and payload onto the user buffer.
*/
in = pack_sg_list(chan->sg, out,
- VIRTQUEUE_NUM, req->rc.sdata, in_hdr_len);
+ chan->sg_n, req->rc.sdata, in_hdr_len);
if (in)
sgs[out_sgs + in_sgs++] = chan->sg + out;

if (in_pages) {
sgs[out_sgs + in_sgs++] = chan->sg + out + in;
- in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
+ in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
in_pages, in_nr_pages, offs, inlen);
}

@@ -574,13 +576,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
goto fail;
}

- chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+ chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
sizeof(struct scatterlist), GFP_KERNEL);
if (!chan->sg) {
pr_err("Failed to allocate virtio 9P channel\n");
err = -ENOMEM;
goto out_free_chan_shallow;
}
+ chan->sg_n = VIRTQUEUE_DEFAULT_NUM;

chan->vdev = vdev;

@@ -593,7 +596,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
chan->vq->vdev->priv = chan;
spin_lock_init(&chan->lock);

- sg_init_table(chan->sg, VIRTQUEUE_NUM);
+ sg_init_table(chan->sg, chan->sg_n);

chan->inuse = false;
if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
@@ -777,7 +780,7 @@ static struct p9_trans_module p9_virtio_trans = {
* that are not at page boundary, that can result in an extra
* page in zero copy.
*/
- .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
+ .maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
.def = 1,
.owner = THIS_MODULE,
};
--
2.30.2

2022-07-12 16:31:51

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

So far 'msize' was simply used for all 9p message types, which is far
too much and slowed down performance tremendously with large values
for user configurable 'msize' option.

Let's stop this waste by using the new p9_msg_buf_size() function for
allocating more appropriate, smaller buffers according to what is
actually sent over the wire.

Only exception: RDMA transport is currently excluded from this, as
it would not cope with it. [1]

Link: https://lore.kernel.org/all/[email protected]/ [1]
Signed-off-by: Christian Schoenebeck <[email protected]>
---

Is the !strcmp(c->trans_mod->name, "rdma") check in this patch maybe a bit
too hack-ish? Should there rather be transport API extension instead?

net/9p/client.c | 39 ++++++++++++++++++++++++++++++++-------
1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 56be1658870d..dc1a7b26fab4 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -255,19 +255,35 @@ static struct kmem_cache *p9_req_cache;
* p9_tag_alloc - Allocate a new request.
* @c: Client session.
* @type: Transaction type.
- * @t_size: Buffer size for holding this request.
- * @r_size: Buffer size for holding server's reply on this request.
+ * @t_size: Buffer size for holding this request
+ * (automatic calculation by format template if 0).
+ * @r_size: Buffer size for holding server's reply on this request
+ * (automatic calculation by format template if 0).
+ * @fmt: Format template for assembling 9p request message
+ * (see p9pdu_vwritef).
+ * @ap: Variable arguments to be fed to passed format template
+ * (see p9pdu_vwritef).
*
* Context: Process context.
* Return: Pointer to new request.
*/
static struct p9_req_t *
-p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size)
+p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size,
+ const char *fmt, va_list ap)
{
struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
- int alloc_tsize = min(c->msize, t_size);
- int alloc_rsize = min(c->msize, r_size);
+ int alloc_tsize;
+ int alloc_rsize;
int tag;
+ va_list apc;
+
+ va_copy(apc, ap);
+ alloc_tsize = min_t(size_t, c->msize,
+ t_size ?: p9_msg_buf_size(c, type, fmt, apc));
+ va_end(apc);
+
+ alloc_rsize = min_t(size_t, c->msize,
+ r_size ?: p9_msg_buf_size(c, type + 1, fmt, ap));

if (!req)
return ERR_PTR(-ENOMEM);
@@ -685,6 +701,7 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
{
int err;
struct p9_req_t *req;
+ va_list apc;

p9_debug(P9_DEBUG_MUX, "client %p op %d\n", c, type);

@@ -696,7 +713,9 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
if (c->status == BeginDisconnect && type != P9_TCLUNK)
return ERR_PTR(-EIO);

- req = p9_tag_alloc(c, type, t_size, r_size);
+ va_copy(apc, ap);
+ req = p9_tag_alloc(c, type, t_size, r_size, fmt, apc);
+ va_end(apc);
if (IS_ERR(req))
return req;

@@ -731,9 +750,15 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
int sigpending, err;
unsigned long flags;
struct p9_req_t *req;
+ /* Passing zero to p9_client_prepare_req() tells it to auto determine an
+ * appropriate (small) request/response size according to actual message
+ * data being sent. Currently RDMA transport is excluded from this message
+ * size optimization, as it would not be able to cope with it.
+ */
+ const uint max_size = !strcmp(c->trans_mod->name, "rdma") ? c->msize : 0;

va_start(ap, fmt);
- req = p9_client_prepare_req(c, type, c->msize, c->msize, fmt, ap);
+ req = p9_client_prepare_req(c, type, max_size, max_size, fmt, ap);
va_end(ap);
if (IS_ERR(req))
return req;
--
2.30.2

2022-07-12 16:35:43

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 01/11] 9p/trans_virtio: separate allocation of scatter gather list

The scatter gather list in struct virtio_chan currently
resides as compile-time constant size array directly within the
contiguous struct virtio_chan's memory space.

Separate memory space and allocation of the scatter gather list
from memory space and allocation of struct virtio_chan.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
net/9p/trans_virtio.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index b24a4fb0f0a2..2693e618080c 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -77,7 +77,7 @@ struct virtio_chan {
*/
unsigned long p9_max_pages;
/* Scatterlist: can be too big for stack. */
- struct scatterlist sg[VIRTQUEUE_NUM];
+ struct scatterlist *sg;
/**
* @tag: name to identify a mount null terminated
*/
@@ -574,6 +574,14 @@ static int p9_virtio_probe(struct virtio_device *vdev)
goto fail;
}

+ chan->sg = kmalloc_array(VIRTQUEUE_NUM,
+ sizeof(struct scatterlist), GFP_KERNEL);
+ if (!chan->sg) {
+ pr_err("Failed to allocate virtio 9P channel\n");
+ err = -ENOMEM;
+ goto out_free_chan_shallow;
+ }
+
chan->vdev = vdev;

/* We expect one virtqueue, for requests. */
@@ -635,6 +643,8 @@ static int p9_virtio_probe(struct virtio_device *vdev)
out_free_vq:
vdev->config->del_vqs(vdev);
out_free_chan:
+ kfree(chan->sg);
+out_free_chan_shallow:
kfree(chan);
fail:
return err;
@@ -728,6 +738,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
kfree(chan->tag);
kfree(chan->vc_wq);
+ kfree(chan->sg);
kfree(chan);

}
--
2.30.2

2022-07-12 16:36:00

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 04/11] net/9p: add trans_maxsize to struct p9_client

This new field 'trans_maxsize' optionally allows transport to
update it to reflect the actual maximum msize supported by
allocated transport channel.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
include/net/9p/client.h | 2 ++
net/9p/client.c | 12 ++++++++++--
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index ec1d1706f43c..f5718057fca4 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -87,6 +87,7 @@ struct p9_req_t {
* struct p9_client - per client instance state
* @lock: protect @fids and @reqs
* @msize: maximum data size negotiated by protocol
+ * @trans_maxsize: actual maximum msize supported by transport channel
* @proto_version: 9P protocol version to use
* @trans_mod: module API instantiated with this client
* @status: connection state
@@ -101,6 +102,7 @@ struct p9_req_t {
struct p9_client {
spinlock_t lock;
unsigned int msize;
+ unsigned int trans_maxsize;
unsigned char proto_version;
struct p9_trans_module *trans_mod;
enum p9_trans_status status;
diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..20054addd81b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1031,6 +1031,14 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
goto free_client;
}

+ /*
+ * transport will get a chance to increase trans_maxsize (if
+ * necessary) and it may update trans_maxsize in create() function
+ * below accordingly to reflect the actual maximum size supported by
+ * the allocated transport channel
+ */
+ clnt->trans_maxsize = clnt->trans_mod->maxsize;
+
p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);

@@ -1038,8 +1046,8 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
if (err)
goto put_trans;

- if (clnt->msize > clnt->trans_mod->maxsize) {
- clnt->msize = clnt->trans_mod->maxsize;
+ if (clnt->msize > clnt->trans_maxsize) {
+ clnt->msize = clnt->trans_maxsize;
pr_info("Limiting 'msize' to %d as this is the maximum "
"supported by transport %s\n",
clnt->msize, clnt->trans_mod->name
--
2.30.2

2022-07-12 16:37:22

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 03/11] 9p/trans_virtio: introduce struct virtqueue_sg

The amount of elements in a scatter/gather list is limited to
approximately 128 elements. To allow going beyond that limit
with subsequent patches, pave the way by turning the one-
dimensional sg list array into a two-dimensional array, i.e:

sg[128]

becomes

sgl[nsgl][SG_MAX_SINGLE_ALLOC]

As the value of 'nsgl' is exactly (still) 1 in this commit
and the compile-time (compiler and architecture dependent)
value of 'SG_MAX_SINGLE_ALLOC' equals approximately the
previous hard coded 128 elements, this commit is therefore
more of a preparatory refactoring then actual behaviour
change.

A custom struct virtqueue_sg is defined instead of using
shared API struct sg_table, because the latter would not
allow to resize the table after allocation. sg_append_table
API OTOH would not fit either, because it requires a list
of pages beforehand upon allocation. And both APIs only
support all-or-nothing allocation.

Signed-off-by: Christian Schoenebeck <[email protected]>
---

The question is whether that should really become 9p specifc SG list
code, or whether it should rather be squeezed into shared SG list code
base. Opinions by maintainers needed.

net/9p/trans_virtio.c | 193 ++++++++++++++++++++++++++++++++----------
1 file changed, 147 insertions(+), 46 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 18bdfa64b934..f63cd1b08bca 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -36,7 +36,31 @@
#include <linux/virtio_9p.h>
#include "trans_common.h"

-#define VIRTQUEUE_DEFAULT_NUM 128
+/**
+ * struct virtqueue_sg - (chained) scatter gather lists for virtqueue data
+ * transmission
+ * @nsgl: amount of elements (in first dimension) of array field @sgl
+ * @sgl: two-dimensional array, i.e. sgl[nsgl][SG_MAX_SINGLE_ALLOC]
+ */
+struct virtqueue_sg {
+ unsigned int nsgl;
+ struct scatterlist *sgl[];
+};
+
+/*
+ * Default value for field nsgl in struct virtqueue_sg, which defines the
+ * initial virtio data transmission capacity when this virtio transport is
+ * probed.
+ */
+#define VIRTQUEUE_SG_NSGL_DEFAULT 1
+
+/* maximum value for field nsgl in struct virtqueue_sg */
+#define VIRTQUEUE_SG_NSGL_MAX \
+ ((PAGE_SIZE - sizeof(struct virtqueue_sg)) / \
+ sizeof(struct scatterlist *)) \
+
+/* last entry per sg list is used for chaining (pointer to next list) */
+#define SG_USER_PAGES_PER_LIST (SG_MAX_SINGLE_ALLOC - 1)

/* a single mutex to manage channel initialization and attachment */
static DEFINE_MUTEX(virtio_9p_lock);
@@ -53,8 +77,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
* @ring_bufs_avail: flag to indicate there is some available in the ring buf
* @vc_wq: wait queue for waiting for thing to be added to ring buf
* @p9_max_pages: maximum number of pinned pages
- * @sg: scatter gather list which is used to pack a request (protected?)
- * @sg_n: amount of elements in sg array
+ * @vq_sg: table of scatter gather lists, which are used to pack a request
* @chan_list: linked list of channels
*
* We keep all per-channel information in a structure.
@@ -77,9 +100,7 @@ struct virtio_chan {
* will be placing it in each channel.
*/
unsigned long p9_max_pages;
- /* Scatterlist: can be too big for stack. */
- struct scatterlist *sg;
- size_t sg_n;
+ struct virtqueue_sg *vq_sg;
/**
* @tag: name to identify a mount null terminated
*/
@@ -96,6 +117,92 @@ static unsigned int rest_of_page(void *data)
return PAGE_SIZE - offset_in_page(data);
}

+/**
+ * vq_sg_page - returns user page for given page index
+ * @vq_sg: scatter gather lists used by this transport
+ * @page: user page index across all scatter gather lists
+ */
+static struct scatterlist *vq_sg_page(struct virtqueue_sg *vq_sg, size_t page)
+{
+ unsigned int node = page / SG_USER_PAGES_PER_LIST;
+ unsigned int leaf = page % SG_USER_PAGES_PER_LIST;
+ BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX);
+ return &vq_sg->sgl[node][leaf];
+}
+
+/**
+ * vq_sg_npages - returns total number of individual user pages in passed
+ * scatter gather lists
+ * @vq_sg: scatter gather lists to be counted
+ */
+static size_t vq_sg_npages(struct virtqueue_sg *vq_sg)
+{
+ return vq_sg->nsgl * SG_USER_PAGES_PER_LIST;
+}
+
+/**
+ * vq_sg_free - free all memory previously allocated for @vq_sg
+ * @vq_sg: scatter gather lists to be freed
+ */
+static void vq_sg_free(struct virtqueue_sg *vq_sg)
+{
+ unsigned int i;
+
+ if (!vq_sg)
+ return;
+
+ for (i = 0; i < vq_sg->nsgl; ++i) {
+ kfree(vq_sg->sgl[i]);
+ }
+ kfree(vq_sg);
+}
+
+/**
+ * vq_sg_alloc - allocates and returns @nsgl scatter gather lists
+ * @nsgl: amount of scatter gather lists to be allocated
+ * If @nsgl is larger than one then chained lists are used if supported by
+ * architecture.
+ */
+static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
+{
+ struct virtqueue_sg *vq_sg;
+ unsigned int i;
+
+ BUG_ON(!nsgl || nsgl > VIRTQUEUE_SG_NSGL_MAX);
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+ if (WARN_ON_ONCE(nsgl > 1))
+ return NULL;
+#endif
+
+ vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+ nsgl * sizeof(struct scatterlist *),
+ GFP_KERNEL);
+
+ if (!vq_sg)
+ return NULL;
+
+ vq_sg->nsgl = nsgl;
+
+ for (i = 0; i < nsgl; ++i) {
+ vq_sg->sgl[i] = kmalloc_array(
+ SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+ GFP_KERNEL
+ );
+ if (!vq_sg->sgl[i]) {
+ vq_sg_free(vq_sg);
+ return NULL;
+ }
+ sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+ if (i) {
+ /* chain the lists */
+ sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+ vq_sg->sgl[i]);
+ }
+ }
+ sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+ return vq_sg;
+}
+
/**
* p9_virtio_close - reclaim resources of a channel
* @client: client instance
@@ -158,9 +265,8 @@ static void req_done(struct virtqueue *vq)

/**
* pack_sg_list - pack a scatter gather list from a linear buffer
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
* @start: which segment of the sg_list to start at
- * @limit: maximum segment to pack data to
* @data: data to pack into scatter/gather list
* @count: amount of data to pack into the scatter/gather list
*
@@ -170,11 +276,12 @@ static void req_done(struct virtqueue *vq)
*
*/

-static int pack_sg_list(struct scatterlist *sg, int start,
- int limit, char *data, int count)
+static int pack_sg_list(struct virtqueue_sg *vq_sg, int start,
+ char *data, int count)
{
int s;
int index = start;
+ size_t limit = vq_sg_npages(vq_sg);

while (count) {
s = rest_of_page(data);
@@ -182,13 +289,13 @@ static int pack_sg_list(struct scatterlist *sg, int start,
s = count;
BUG_ON(index >= limit);
/* Make sure we don't terminate early. */
- sg_unmark_end(&sg[index]);
- sg_set_buf(&sg[index++], data, s);
+ sg_unmark_end(vq_sg_page(vq_sg, index));
+ sg_set_buf(vq_sg_page(vq_sg, index++), data, s);
count -= s;
data += s;
}
if (index-start)
- sg_mark_end(&sg[index - 1]);
+ sg_mark_end(vq_sg_page(vq_sg, index - 1));
return index-start;
}

@@ -208,21 +315,21 @@ static int p9_virtio_cancelled(struct p9_client *client, struct p9_req_t *req)
/**
* pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
* this takes a list of pages.
- * @sg: scatter/gather list to pack into
+ * @vq_sg: scatter/gather lists to pack into
* @start: which segment of the sg_list to start at
- * @limit: maximum number of pages in sg list.
* @pdata: a list of pages to add into sg.
* @nr_pages: number of pages to pack into the scatter/gather list
* @offs: amount of data in the beginning of first page _not_ to pack
* @count: amount of data to pack into the scatter/gather list
*/
static int
-pack_sg_list_p(struct scatterlist *sg, int start, int limit,
+pack_sg_list_p(struct virtqueue_sg *vq_sg, int start,
struct page **pdata, int nr_pages, size_t offs, int count)
{
int i = 0, s;
int data_off = offs;
int index = start;
+ size_t limit = vq_sg_npages(vq_sg);

BUG_ON(nr_pages > (limit - start));
/*
@@ -235,15 +342,16 @@ pack_sg_list_p(struct scatterlist *sg, int start, int limit,
s = count;
BUG_ON(index >= limit);
/* Make sure we don't terminate early. */
- sg_unmark_end(&sg[index]);
- sg_set_page(&sg[index++], pdata[i++], s, data_off);
+ sg_unmark_end(vq_sg_page(vq_sg, index));
+ sg_set_page(vq_sg_page(vq_sg, index++), pdata[i++], s,
+ data_off);
data_off = 0;
count -= s;
nr_pages--;
}

if (index-start)
- sg_mark_end(&sg[index - 1]);
+ sg_mark_end(vq_sg_page(vq_sg, index - 1));
return index - start;
}

@@ -271,15 +379,13 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)

out_sgs = in_sgs = 0;
/* Handle out VirtIO ring buffers */
- out = pack_sg_list(chan->sg, 0,
- chan->sg_n, req->tc.sdata, req->tc.size);
+ out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
if (out)
- sgs[out_sgs++] = chan->sg;
+ sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);

- in = pack_sg_list(chan->sg, out,
- chan->sg_n, req->rc.sdata, req->rc.capacity);
+ in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, req->rc.capacity);
if (in)
- sgs[out_sgs + in_sgs++] = chan->sg + out;
+ sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);

err = virtqueue_add_sgs(chan->vq, sgs, out_sgs, in_sgs, req,
GFP_ATOMIC);
@@ -448,16 +554,15 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
out_sgs = in_sgs = 0;

/* out data */
- out = pack_sg_list(chan->sg, 0,
- chan->sg_n, req->tc.sdata, req->tc.size);
+ out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);

if (out)
- sgs[out_sgs++] = chan->sg;
+ sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);

if (out_pages) {
- sgs[out_sgs++] = chan->sg + out;
- out += pack_sg_list_p(chan->sg, out, chan->sg_n,
- out_pages, out_nr_pages, offs, outlen);
+ sgs[out_sgs++] = vq_sg_page(chan->vq_sg, out);
+ out += pack_sg_list_p(chan->vq_sg, out, out_pages,
+ out_nr_pages, offs, outlen);
}

/*
@@ -467,15 +572,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
* Arrange in such a way that server places header in the
* allocated memory and payload onto the user buffer.
*/
- in = pack_sg_list(chan->sg, out,
- chan->sg_n, req->rc.sdata, in_hdr_len);
+ in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, in_hdr_len);
if (in)
- sgs[out_sgs + in_sgs++] = chan->sg + out;
+ sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);

if (in_pages) {
- sgs[out_sgs + in_sgs++] = chan->sg + out + in;
- in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
- in_pages, in_nr_pages, offs, inlen);
+ sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out + in);
+ in += pack_sg_list_p(chan->vq_sg, out + in, in_pages,
+ in_nr_pages, offs, inlen);
}

BUG_ON(out_sgs + in_sgs > ARRAY_SIZE(sgs));
@@ -576,14 +680,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
goto fail;
}

- chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
- sizeof(struct scatterlist), GFP_KERNEL);
- if (!chan->sg) {
+ chan->vq_sg = vq_sg_alloc(VIRTQUEUE_SG_NSGL_DEFAULT);
+ if (!chan->vq_sg) {
pr_err("Failed to allocate virtio 9P channel\n");
err = -ENOMEM;
goto out_free_chan_shallow;
}
- chan->sg_n = VIRTQUEUE_DEFAULT_NUM;

chan->vdev = vdev;

@@ -596,8 +698,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
chan->vq->vdev->priv = chan;
spin_lock_init(&chan->lock);

- sg_init_table(chan->sg, chan->sg_n);
-
chan->inuse = false;
if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
virtio_cread(vdev, struct virtio_9p_config, tag_len, &tag_len);
@@ -646,7 +746,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
out_free_vq:
vdev->config->del_vqs(vdev);
out_free_chan:
- kfree(chan->sg);
+ vq_sg_free(chan->vq_sg);
out_free_chan_shallow:
kfree(chan);
fail:
@@ -741,7 +841,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
kfree(chan->tag);
kfree(chan->vc_wq);
- kfree(chan->sg);
+ vq_sg_free(chan->vq_sg);
kfree(chan);

}
@@ -780,7 +880,8 @@ static struct p9_trans_module p9_virtio_trans = {
* that are not at page boundary, that can result in an extra
* page in zero copy.
*/
- .maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
+ .maxsize = PAGE_SIZE *
+ ((VIRTQUEUE_SG_NSGL_DEFAULT * SG_USER_PAGES_PER_LIST) - 3),
.def = 1,
.owner = THIS_MODULE,
};
--
2.30.2

2022-07-12 16:38:13

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 09/11] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u

Add P9_ERRMAX macro to 9P protocol header which reflects the maximum
error string length of Rerror replies for 9p2000 and 9p2000.u protocol
versions. Unfortunately a maximum error string length is not defined by
the 9p2000 spec, picking 128 as value for now, as this seems to be a
common max. size for POSIX error strings in practice.

9p2000.L protocol version uses Rlerror replies instead which does not
contain an error string.

Signed-off-by: Christian Schoenebeck <[email protected]>
---

This could probably be merged with the next patch, on doubt I posted it
separately as squashing is easy. The advantage of a separate patch is
making the discussion of the chosen value of max. 128 bytes more
prominent.

include/net/9p/9p.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/net/9p/9p.h b/include/net/9p/9p.h
index 24a509f559ee..13abe013af21 100644
--- a/include/net/9p/9p.h
+++ b/include/net/9p/9p.h
@@ -331,6 +331,9 @@ enum p9_qid_t {
/* size of header for zero copy read/write */
#define P9_ZC_HDR_SZ 4096

+/* maximum length of an error string */
+#define P9_ERRMAX 128
+
/**
* struct p9_qid - file system entity information
* @type: 8-bit type &p9_qid_t
--
2.30.2

2022-07-12 16:38:13

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 06/11] 9p/trans_virtio: resize sg lists to whatever is possible

Right now vq_sg_resize() used a lazy implementation following
the all-or-nothing princible. So it either resized exactly to
the requested new amount of sg lists, or it did not resize at
all.

The problem with this is if a user supplies a very large msize
value, resize would simply fail and the user would stick to
the default maximum msize supported by the virtio transport.

To resolve this potential issue, change vq_sg_resize() to resize
the passed sg list to whatever is possible on the machine.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
net/9p/trans_virtio.c | 68 ++++++++++++++++++++++++++++++++++++++-----
1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 51c48741ff20..e436d748abe0 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -208,24 +208,67 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
* amount of lists
* @_vq_sg: scatter/gather lists to be resized
* @nsgl: new amount of scatter/gather lists
+ *
+ * Old scatter/gather lists are retained. Only growing the size is supported.
+ * If the requested amount cannot be satisfied, then lists are increased to
+ * whatever is possible.
*/
static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
{
struct virtqueue_sg *vq_sg;
+ unsigned int i;
+ size_t sz;
+ int ret = 0;

BUG_ON(!_vq_sg || !nsgl);
vq_sg = *_vq_sg;
+ if (nsgl > VIRTQUEUE_SG_NSGL_MAX)
+ nsgl = VIRTQUEUE_SG_NSGL_MAX;
if (vq_sg->nsgl == nsgl)
return 0;
+ if (vq_sg->nsgl > nsgl)
+ return -ENOTSUPP;
+
+ vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
+ nsgl * sizeof(struct scatterlist *),
+ GFP_KERNEL);

- /* lazy resize implementation for now */
- vq_sg = vq_sg_alloc(nsgl);
if (!vq_sg)
return -ENOMEM;

+ /* copy over old scatter gather lists */
+ sz = sizeof(struct virtqueue_sg) +
+ (*_vq_sg)->nsgl * sizeof(struct scatterlist *);
+ memcpy(vq_sg, *_vq_sg, sz);
+
+ vq_sg->nsgl = nsgl;
+
+ for (i = (*_vq_sg)->nsgl; i < nsgl; ++i) {
+ vq_sg->sgl[i] = kmalloc_array(
+ SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
+ GFP_KERNEL
+ );
+ /*
+ * handle failed allocation as soft error, we take whatever
+ * we get
+ */
+ if (!vq_sg->sgl[i]) {
+ ret = -ENOMEM;
+ vq_sg->nsgl = nsgl = i;
+ break;
+ }
+ sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
+ if (i) {
+ /* chain the lists */
+ sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
+ vq_sg->sgl[i]);
+ }
+ }
+ sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
+
kfree(*_vq_sg);
*_vq_sg = vq_sg;
- return 0;
+ return ret;
}

/**
@@ -846,12 +889,21 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
if (nsgl > chan->vq_sg->nsgl) {
/*
* if resize fails, no big deal, then just
- * continue with default msize instead
+ * continue with whatever we got
+ */
+ vq_sg_resize(&chan->vq_sg, nsgl);
+ /*
+ * actual allocation size might be less than
+ * requested, so use vq_sg->nsgl instead of nsgl
*/
- if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
- client->trans_maxsize =
- PAGE_SIZE *
- ((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+ client->trans_maxsize =
+ PAGE_SIZE * ((chan->vq_sg->nsgl *
+ SG_USER_PAGES_PER_LIST) - 3);
+ if (nsgl > chan->vq_sg->nsgl) {
+ pr_info("limiting 'msize' to %d as only %d "
+ "of %zu SG lists could be allocated",
+ client->trans_maxsize,
+ chan->vq_sg->nsgl, nsgl);
}
}
#endif /* CONFIG_ARCH_NO_SG_CHAIN */
--
2.30.2

2022-07-12 16:38:18

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 08/11] net/9p: split message size argument into 't_size' and 'r_size' pair

Refactor 'max_size' argument of p9_tag_alloc() and 'req_size' argument
of p9_client_prepare_req() both into a pair of arguments 't_size' and
'r_size' respectively to allow handling the buffer size for request and
reply separately from each other.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
net/9p/client.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index fab939541c81..56be1658870d 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -255,24 +255,26 @@ static struct kmem_cache *p9_req_cache;
* p9_tag_alloc - Allocate a new request.
* @c: Client session.
* @type: Transaction type.
- * @max_size: Maximum packet size for this request.
+ * @t_size: Buffer size for holding this request.
+ * @r_size: Buffer size for holding server's reply on this request.
*
* Context: Process context.
* Return: Pointer to new request.
*/
static struct p9_req_t *
-p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
+p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size)
{
struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
- int alloc_msize = min(c->msize, max_size);
+ int alloc_tsize = min(c->msize, t_size);
+ int alloc_rsize = min(c->msize, r_size);
int tag;

if (!req)
return ERR_PTR(-ENOMEM);

- if (p9_fcall_init(c, &req->tc, alloc_msize))
+ if (p9_fcall_init(c, &req->tc, alloc_tsize))
goto free_req;
- if (p9_fcall_init(c, &req->rc, alloc_msize))
+ if (p9_fcall_init(c, &req->rc, alloc_rsize))
goto free;

p9pdu_reset(&req->tc);
@@ -678,7 +680,7 @@ static int p9_client_flush(struct p9_client *c, struct p9_req_t *oldreq)
}

static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
- int8_t type, int req_size,
+ int8_t type, uint t_size, uint r_size,
const char *fmt, va_list ap)
{
int err;
@@ -694,7 +696,7 @@ static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
if (c->status == BeginDisconnect && type != P9_TCLUNK)
return ERR_PTR(-EIO);

- req = p9_tag_alloc(c, type, req_size);
+ req = p9_tag_alloc(c, type, t_size, r_size);
if (IS_ERR(req))
return req;

@@ -731,7 +733,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
struct p9_req_t *req;

va_start(ap, fmt);
- req = p9_client_prepare_req(c, type, c->msize, fmt, ap);
+ req = p9_client_prepare_req(c, type, c->msize, c->msize, fmt, ap);
va_end(ap);
if (IS_ERR(req))
return req;
@@ -829,7 +831,7 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
/* We allocate a inline protocol data of only 4k bytes.
* The actual content is passed in zero-copy fashion.
*/
- req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, fmt, ap);
+ req = p9_client_prepare_req(c, type, P9_ZC_HDR_SZ, P9_ZC_HDR_SZ, fmt, ap);
va_end(ap);
if (IS_ERR(req))
return req;
--
2.30.2

2022-07-12 16:38:19

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 10/11] net/9p: add p9_msg_buf_size()

This new function calculates a buffer size suitable for holding the
intended 9p request or response. For rather small message types (which
applies to almost all 9p message types actually) simply use hard coded
values. For some variable-length and potentially large message types
calculate a more precise value according to what data is actually
transmitted to avoid unnecessarily huge buffers.

Signed-off-by: Christian Schoenebeck <[email protected]>
---
net/9p/protocol.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++
net/9p/protocol.h | 2 +
2 files changed, 156 insertions(+)

diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 3754c33e2974..49939e8cde2a 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -23,6 +23,160 @@

#include <trace/events/9p.h>

+/* len[2] text[len] */
+#define P9_STRLEN(s) \
+ (2 + min_t(size_t, s ? strlen(s) : 0, USHRT_MAX))
+
+/**
+ * p9_msg_buf_size - Returns a buffer size sufficiently large to hold the
+ * intended 9p message.
+ * @c: client
+ * @type: message type
+ * @fmt: format template for assembling request message
+ * (see p9pdu_vwritef)
+ * @ap: variable arguments to be fed to passed format template
+ * (see p9pdu_vwritef)
+ *
+ * Note: Even for response types (P9_R*) the format template and variable
+ * arguments must always be for the originating request type (P9_T*).
+ */
+size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
+ const char *fmt, va_list ap)
+{
+ /* size[4] type[1] tag[2] */
+ const int hdr = 4 + 1 + 2;
+ /* ename[s] errno[4] */
+ const int rerror_size = hdr + P9_ERRMAX + 4;
+ /* ecode[4] */
+ const int rlerror_size = hdr + 4;
+ const int err_size =
+ c->proto_version == p9_proto_2000L ? rlerror_size : rerror_size;
+
+ switch (type) {
+
+ /* message types not used at all */
+ case P9_TERROR:
+ case P9_TLERROR:
+ case P9_TAUTH:
+ case P9_RAUTH:
+ BUG();
+
+ /* variable length & potentially large message types */
+ case P9_TATTACH:
+ BUG_ON(strcmp("ddss?u", fmt));
+ va_arg(ap, int32_t);
+ va_arg(ap, int32_t);
+ {
+ const char *uname = va_arg(ap, const char *);
+ const char *aname = va_arg(ap, const char *);
+ /* fid[4] afid[4] uname[s] aname[s] n_uname[4] */
+ return hdr + 4 + 4 + P9_STRLEN(uname) + P9_STRLEN(aname) + 4;
+ }
+ case P9_TWALK:
+ BUG_ON(strcmp("ddT", fmt));
+ va_arg(ap, int32_t);
+ va_arg(ap, int32_t);
+ {
+ uint i, nwname = max(va_arg(ap, int), 0);
+ size_t wname_all;
+ const char **wnames = va_arg(ap, const char **);
+ for (i = 0, wname_all = 0; i < nwname; ++i) {
+ wname_all += P9_STRLEN(wnames[i]);
+ }
+ /* fid[4] newfid[4] nwname[2] nwname*(wname[s]) */
+ return hdr + 4 + 4 + 2 + wname_all;
+ }
+ case P9_RWALK:
+ BUG_ON(strcmp("ddT", fmt));
+ va_arg(ap, int32_t);
+ va_arg(ap, int32_t);
+ {
+ uint nwname = va_arg(ap, int);
+ /* nwqid[2] nwqid*(wqid[13]) */
+ return max_t(size_t, hdr + 2 + nwname * 13, err_size);
+ }
+ case P9_TCREATE:
+ BUG_ON(strcmp("dsdb?s", fmt));
+ va_arg(ap, int32_t);
+ {
+ const char *name = va_arg(ap, const char *);
+ if ((c->proto_version != p9_proto_2000u) &&
+ (c->proto_version != p9_proto_2000L))
+ /* fid[4] name[s] perm[4] mode[1] */
+ return hdr + 4 + P9_STRLEN(name) + 4 + 1;
+ {
+ va_arg(ap, int32_t);
+ va_arg(ap, int);
+ {
+ const char *ext = va_arg(ap, const char *);
+ /* fid[4] name[s] perm[4] mode[1] extension[s] */
+ return hdr + 4 + P9_STRLEN(name) + 4 + 1 + P9_STRLEN(ext);
+ }
+ }
+ }
+ case P9_TLCREATE:
+ BUG_ON(strcmp("dsddg", fmt));
+ va_arg(ap, int32_t);
+ {
+ const char *name = va_arg(ap, const char *);
+ /* fid[4] name[s] flags[4] mode[4] gid[4] */
+ return hdr + 4 + P9_STRLEN(name) + 4 + 4 + 4;
+ }
+ case P9_RREAD:
+ case P9_RREADDIR:
+ BUG_ON(strcmp("dqd", fmt));
+ va_arg(ap, int32_t);
+ va_arg(ap, int64_t);
+ {
+ const int32_t count = va_arg(ap, int32_t);
+ /* count[4] data[count] */
+ return max_t(size_t, hdr + 4 + count, err_size);
+ }
+ case P9_TWRITE:
+ BUG_ON(strcmp("dqV", fmt));
+ va_arg(ap, int32_t);
+ va_arg(ap, int64_t);
+ {
+ const int32_t count = va_arg(ap, int32_t);
+ /* fid[4] offset[8] count[4] data[count] */
+ return hdr + 4 + 8 + 4 + count;
+ }
+ case P9_TRENAMEAT:
+ BUG_ON(strcmp("dsds", fmt));
+ va_arg(ap, int32_t);
+ {
+ const char *oldname = va_arg(ap, const char *);
+ va_arg(ap, int32_t);
+ {
+ const char *newname = va_arg(ap, const char *);
+ /* olddirfid[4] oldname[s] newdirfid[4] newname[s] */
+ return hdr + 4 + P9_STRLEN(oldname) + 4 + P9_STRLEN(newname);
+ }
+ }
+ case P9_RERROR:
+ return rerror_size;
+ case P9_RLERROR:
+ return rlerror_size;
+
+ /* small message types */
+ case P9_TSTAT:
+ case P9_RSTAT:
+ case P9_TSYMLINK:
+ case P9_RREADLINK:
+ case P9_TXATTRWALK:
+ case P9_TXATTRCREATE:
+ case P9_TLINK:
+ case P9_TMKDIR:
+ case P9_TUNLINKAT:
+ return 8 * 1024;
+
+ /* tiny message types */
+ default:
+ return 4 * 1024;
+
+ }
+}
+
static int
p9pdu_writef(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);

diff --git a/net/9p/protocol.h b/net/9p/protocol.h
index 6d719c30331a..ad2283d1f96b 100644
--- a/net/9p/protocol.h
+++ b/net/9p/protocol.h
@@ -8,6 +8,8 @@
* Copyright (C) 2008 by IBM, Corp.
*/

+size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
+ const char *fmt, va_list ap);
int p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
va_list ap);
int p9pdu_readf(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);
--
2.30.2

2022-07-12 17:07:36

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 05/11] 9p/trans_virtio: support larger msize values

The virtio transport supports by default a 9p 'msize' of up to
approximately 500 kB. This patch adds support for larger 'msize'
values by resizing the amount of scatter/gather lists if required.

Signed-off-by: Christian Schoenebeck <[email protected]>
---

I am not sure if it is safe the way SG lists are resized here. I "think"
Dominique said before there should be no concurrency here, but probably
deserves a revisit.

net/9p/trans_virtio.c | 61 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index f63cd1b08bca..51c48741ff20 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -203,6 +203,31 @@ static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
return vq_sg;
}

+/**
+ * vq_sg_resize - resize passed virtqueue scatter/gather lists to the passed
+ * amount of lists
+ * @_vq_sg: scatter/gather lists to be resized
+ * @nsgl: new amount of scatter/gather lists
+ */
+static int vq_sg_resize(struct virtqueue_sg **_vq_sg, unsigned int nsgl)
+{
+ struct virtqueue_sg *vq_sg;
+
+ BUG_ON(!_vq_sg || !nsgl);
+ vq_sg = *_vq_sg;
+ if (vq_sg->nsgl == nsgl)
+ return 0;
+
+ /* lazy resize implementation for now */
+ vq_sg = vq_sg_alloc(nsgl);
+ if (!vq_sg)
+ return -ENOMEM;
+
+ kfree(*_vq_sg);
+ *_vq_sg = vq_sg;
+ return 0;
+}
+
/**
* p9_virtio_close - reclaim resources of a channel
* @client: client instance
@@ -774,6 +799,10 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
struct virtio_chan *chan;
int ret = -ENOENT;
int found = 0;
+#if !defined(CONFIG_ARCH_NO_SG_CHAIN)
+ size_t npages;
+ size_t nsgl;
+#endif

if (devname == NULL)
return -EINVAL;
@@ -796,6 +825,38 @@ p9_virtio_create(struct p9_client *client, const char *devname, char *args)
return ret;
}

+ /*
+ * if user supplied an 'msize' option that's larger than what this
+ * transport supports by default, then try to allocate more sg lists
+ */
+ if (client->msize > client->trans_maxsize) {
+#ifdef CONFIG_ARCH_NO_SG_CHAIN
+ pr_info("limiting 'msize' to %d because architecture does not "
+ "support chained scatter gather lists\n",
+ client->trans_maxsize);
+#else
+ npages = DIV_ROUND_UP(client->msize, PAGE_SIZE);
+ if (npages > chan->p9_max_pages) {
+ npages = chan->p9_max_pages;
+ pr_info("limiting 'msize' as it would exceed the max. "
+ "of %lu pages allowed on this system\n",
+ chan->p9_max_pages);
+ }
+ nsgl = DIV_ROUND_UP(npages, SG_USER_PAGES_PER_LIST);
+ if (nsgl > chan->vq_sg->nsgl) {
+ /*
+ * if resize fails, no big deal, then just
+ * continue with default msize instead
+ */
+ if (!vq_sg_resize(&chan->vq_sg, nsgl)) {
+ client->trans_maxsize =
+ PAGE_SIZE *
+ ((nsgl * SG_USER_PAGES_PER_LIST) - 3);
+ }
+ }
+#endif /* CONFIG_ARCH_NO_SG_CHAIN */
+ }
+
client->trans = (void *)chan;
client->status = Connected;
chan->client = client;
--
2.30.2

2022-07-12 17:07:46

by Christian Schoenebeck

[permalink] [raw]
Subject: [PATCH v5 07/11] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports

This 9p client implementation is yet using linear message buffers for
most message types, i.e. they use kmalloc() et al. for allocating
continuous physical memory pages, which is usually limited to 4MB
buffers. Use KMALLOC_MAX_SIZE though instead of a hard coded 4MB for
constraining this more safely.

Unfortunately we cannot simply replace the existing kmalloc() calls by
vmalloc() ones, because that would yield in non-logical kernel addresses
(for any vmalloc(>4MB) that is) which are in general not accessible by
hosts like QEMU.

In future we would replace those linear buffers by scatter/gather lists
to eventually get rid of this limit (struct p9_fcall's sdata member by
p9_fcall_init() and struct p9_fid's rdir member by
v9fs_alloc_rdir_buf()).

Signed-off-by: Christian Schoenebeck <[email protected]>
---

Hmm, that's a bit too simple, as we also need a bit of headroom for
transport specific overhead. So maybe this has to be handled by each
transport appropriately instead?

net/9p/client.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/net/9p/client.c b/net/9p/client.c
index 20054addd81b..fab939541c81 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1042,6 +1042,17 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);

+ /*
+ * due to linear message buffers being used by client ATM
+ */
+ if (clnt->msize > KMALLOC_MAX_SIZE) {
+ clnt->msize = KMALLOC_MAX_SIZE;
+ pr_info("Limiting 'msize' to %zu as this is the maximum "
+ "supported by this client version.\n",
+ (size_t) KMALLOC_MAX_SIZE
+ );
+ }
+
err = clnt->trans_mod->create(clnt, dev_name, options);
if (err)
goto put_trans;
--
2.30.2

2022-07-12 19:46:37

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:36PM +0200:
> So far 'msize' was simply used for all 9p message types, which is far
> too much and slowed down performance tremendously with large values
> for user configurable 'msize' option.
>
> Let's stop this waste by using the new p9_msg_buf_size() function for
> allocating more appropriate, smaller buffers according to what is
> actually sent over the wire.
>
> Only exception: RDMA transport is currently excluded from this, as
> it would not cope with it. [1]
>
> Link: https://lore.kernel.org/all/[email protected]/ [1]
> Signed-off-by: Christian Schoenebeck <[email protected]>
> ---
>
> Is the !strcmp(c->trans_mod->name, "rdma") check in this patch maybe a bit
> too hack-ish? Should there rather be transport API extension instead?

hmm yeah that doesn't feel great, let's add a flag to struct
p9_trans_module

--
Dominique

2022-07-12 20:53:41

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 03/11] 9p/trans_virtio: introduce struct virtqueue_sg

Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:16PM +0200:
> The amount of elements in a scatter/gather list is limited to
> approximately 128 elements. To allow going beyond that limit
> with subsequent patches, pave the way by turning the one-
> dimensional sg list array into a two-dimensional array, i.e:
>
> sg[128]
>
> becomes
>
> sgl[nsgl][SG_MAX_SINGLE_ALLOC]
>
> As the value of 'nsgl' is exactly (still) 1 in this commit
> and the compile-time (compiler and architecture dependent)
> value of 'SG_MAX_SINGLE_ALLOC' equals approximately the
> previous hard coded 128 elements, this commit is therefore
> more of a preparatory refactoring then actual behaviour
> change.
>
> A custom struct virtqueue_sg is defined instead of using
> shared API struct sg_table, because the latter would not
> allow to resize the table after allocation. sg_append_table
> API OTOH would not fit either, because it requires a list
> of pages beforehand upon allocation. And both APIs only
> support all-or-nothing allocation.
>
> Signed-off-by: Christian Schoenebeck <[email protected]>
> ---
>
> The question is whether that should really become 9p specifc SG list
> code, or whether it should rather be squeezed into shared SG list code
> base. Opinions by maintainers needed.

hmm from the 9p side I'd say the type is simple enough that we can just
keep it here; most people don't want to resize these lists...

How much do you care about the all-or-nothing case you described in this
commit message? From the look of it, patch 6 -- at what point did you
actually see this being useful?

> net/9p/trans_virtio.c | 193 ++++++++++++++++++++++++++++++++----------
> 1 file changed, 147 insertions(+), 46 deletions(-)
>
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 18bdfa64b934..f63cd1b08bca 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -36,7 +36,31 @@
> #include <linux/virtio_9p.h>
> #include "trans_common.h"
>
> -#define VIRTQUEUE_DEFAULT_NUM 128
> +/**
> + * struct virtqueue_sg - (chained) scatter gather lists for virtqueue data
> + * transmission
> + * @nsgl: amount of elements (in first dimension) of array field @sgl
> + * @sgl: two-dimensional array, i.e. sgl[nsgl][SG_MAX_SINGLE_ALLOC]
> + */
> +struct virtqueue_sg {
> + unsigned int nsgl;
> + struct scatterlist *sgl[];
> +};
> +
> +/*
> + * Default value for field nsgl in struct virtqueue_sg, which defines the
> + * initial virtio data transmission capacity when this virtio transport is
> + * probed.
> + */
> +#define VIRTQUEUE_SG_NSGL_DEFAULT 1
> +
> +/* maximum value for field nsgl in struct virtqueue_sg */
> +#define VIRTQUEUE_SG_NSGL_MAX \
> + ((PAGE_SIZE - sizeof(struct virtqueue_sg)) / \
> + sizeof(struct scatterlist *)) \
> +
> +/* last entry per sg list is used for chaining (pointer to next list) */
> +#define SG_USER_PAGES_PER_LIST (SG_MAX_SINGLE_ALLOC - 1)
>
> /* a single mutex to manage channel initialization and attachment */
> static DEFINE_MUTEX(virtio_9p_lock);
> @@ -53,8 +77,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
> * @ring_bufs_avail: flag to indicate there is some available in the ring buf
> * @vc_wq: wait queue for waiting for thing to be added to ring buf
> * @p9_max_pages: maximum number of pinned pages
> - * @sg: scatter gather list which is used to pack a request (protected?)
> - * @sg_n: amount of elements in sg array
> + * @vq_sg: table of scatter gather lists, which are used to pack a request
> * @chan_list: linked list of channels
> *
> * We keep all per-channel information in a structure.
> @@ -77,9 +100,7 @@ struct virtio_chan {
> * will be placing it in each channel.
> */
> unsigned long p9_max_pages;
> - /* Scatterlist: can be too big for stack. */
> - struct scatterlist *sg;
> - size_t sg_n;
> + struct virtqueue_sg *vq_sg;
> /**
> * @tag: name to identify a mount null terminated
> */
> @@ -96,6 +117,92 @@ static unsigned int rest_of_page(void *data)
> return PAGE_SIZE - offset_in_page(data);
> }
>
> +/**
> + * vq_sg_page - returns user page for given page index
> + * @vq_sg: scatter gather lists used by this transport
> + * @page: user page index across all scatter gather lists
> + */
> +static struct scatterlist *vq_sg_page(struct virtqueue_sg *vq_sg, size_t page)
> +{
> + unsigned int node = page / SG_USER_PAGES_PER_LIST;
> + unsigned int leaf = page % SG_USER_PAGES_PER_LIST;
> + BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX);

probably awnt to check with vq_sg->sg_n instead?
(we already check sg_n <= MAX on alloc)


> + return &vq_sg->sgl[node][leaf];
> +}
> +
> +/**
> + * vq_sg_npages - returns total number of individual user pages in passed
> + * scatter gather lists
> + * @vq_sg: scatter gather lists to be counted
> + */
> +static size_t vq_sg_npages(struct virtqueue_sg *vq_sg)
> +{
> + return vq_sg->nsgl * SG_USER_PAGES_PER_LIST;
> +}
> +
> +/**
> + * vq_sg_free - free all memory previously allocated for @vq_sg
> + * @vq_sg: scatter gather lists to be freed
> + */
> +static void vq_sg_free(struct virtqueue_sg *vq_sg)
> +{
> + unsigned int i;
> +
> + if (!vq_sg)
> + return;
> +
> + for (i = 0; i < vq_sg->nsgl; ++i) {
> + kfree(vq_sg->sgl[i]);
> + }
> + kfree(vq_sg);
> +}
> +
> +/**
> + * vq_sg_alloc - allocates and returns @nsgl scatter gather lists
> + * @nsgl: amount of scatter gather lists to be allocated
> + * If @nsgl is larger than one then chained lists are used if supported by
> + * architecture.
> + */
> +static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
> +{
> + struct virtqueue_sg *vq_sg;
> + unsigned int i;
> +
> + BUG_ON(!nsgl || nsgl > VIRTQUEUE_SG_NSGL_MAX);
> +#ifdef CONFIG_ARCH_NO_SG_CHAIN
> + if (WARN_ON_ONCE(nsgl > 1))
> + return NULL;
> +#endif
> +
> + vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
> + nsgl * sizeof(struct scatterlist *),
> + GFP_KERNEL);
> +
> + if (!vq_sg)
> + return NULL;
> +
> + vq_sg->nsgl = nsgl;
> +
> + for (i = 0; i < nsgl; ++i) {
> + vq_sg->sgl[i] = kmalloc_array(
> + SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
> + GFP_KERNEL
> + );
> + if (!vq_sg->sgl[i]) {
> + vq_sg_free(vq_sg);
> + return NULL;
> + }
> + sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
> + if (i) {
> + /* chain the lists */
> + sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
> + vq_sg->sgl[i]);
> + }
> + }
> + sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
> + return vq_sg;
> +}
> +
> /**
> * p9_virtio_close - reclaim resources of a channel
> * @client: client instance
> @@ -158,9 +265,8 @@ static void req_done(struct virtqueue *vq)
>
> /**
> * pack_sg_list - pack a scatter gather list from a linear buffer
> - * @sg: scatter/gather list to pack into
> + * @vq_sg: scatter/gather lists to pack into
> * @start: which segment of the sg_list to start at
> - * @limit: maximum segment to pack data to
> * @data: data to pack into scatter/gather list
> * @count: amount of data to pack into the scatter/gather list
> *
> @@ -170,11 +276,12 @@ static void req_done(struct virtqueue *vq)
> *
> */
>
> -static int pack_sg_list(struct scatterlist *sg, int start,
> - int limit, char *data, int count)
> +static int pack_sg_list(struct virtqueue_sg *vq_sg, int start,
> + char *data, int count)
> {
> int s;
> int index = start;
> + size_t limit = vq_sg_npages(vq_sg);
>
> while (count) {
> s = rest_of_page(data);
> @@ -182,13 +289,13 @@ static int pack_sg_list(struct scatterlist *sg, int start,
> s = count;
> BUG_ON(index >= limit);
> /* Make sure we don't terminate early. */
> - sg_unmark_end(&sg[index]);
> - sg_set_buf(&sg[index++], data, s);
> + sg_unmark_end(vq_sg_page(vq_sg, index));
> + sg_set_buf(vq_sg_page(vq_sg, index++), data, s);
> count -= s;
> data += s;
> }
> if (index-start)
> - sg_mark_end(&sg[index - 1]);
> + sg_mark_end(vq_sg_page(vq_sg, index - 1));
> return index-start;
> }
>
> @@ -208,21 +315,21 @@ static int p9_virtio_cancelled(struct p9_client *client, struct p9_req_t *req)
> /**
> * pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
> * this takes a list of pages.
> - * @sg: scatter/gather list to pack into
> + * @vq_sg: scatter/gather lists to pack into
> * @start: which segment of the sg_list to start at
> - * @limit: maximum number of pages in sg list.
> * @pdata: a list of pages to add into sg.
> * @nr_pages: number of pages to pack into the scatter/gather list
> * @offs: amount of data in the beginning of first page _not_ to pack
> * @count: amount of data to pack into the scatter/gather list
> */
> static int
> -pack_sg_list_p(struct scatterlist *sg, int start, int limit,
> +pack_sg_list_p(struct virtqueue_sg *vq_sg, int start,
> struct page **pdata, int nr_pages, size_t offs, int count)
> {
> int i = 0, s;
> int data_off = offs;
> int index = start;
> + size_t limit = vq_sg_npages(vq_sg);
>
> BUG_ON(nr_pages > (limit - start));
> /*
> @@ -235,15 +342,16 @@ pack_sg_list_p(struct scatterlist *sg, int start, int limit,
> s = count;
> BUG_ON(index >= limit);
> /* Make sure we don't terminate early. */
> - sg_unmark_end(&sg[index]);
> - sg_set_page(&sg[index++], pdata[i++], s, data_off);
> + sg_unmark_end(vq_sg_page(vq_sg, index));
> + sg_set_page(vq_sg_page(vq_sg, index++), pdata[i++], s,
> + data_off);
> data_off = 0;
> count -= s;
> nr_pages--;
> }
>
> if (index-start)
> - sg_mark_end(&sg[index - 1]);
> + sg_mark_end(vq_sg_page(vq_sg, index - 1));
> return index - start;
> }
>
> @@ -271,15 +379,13 @@ p9_virtio_request(struct p9_client *client, struct p9_req_t *req)
>
> out_sgs = in_sgs = 0;
> /* Handle out VirtIO ring buffers */
> - out = pack_sg_list(chan->sg, 0,
> - chan->sg_n, req->tc.sdata, req->tc.size);
> + out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
> if (out)
> - sgs[out_sgs++] = chan->sg;
> + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
>
> - in = pack_sg_list(chan->sg, out,
> - chan->sg_n, req->rc.sdata, req->rc.capacity);
> + in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, req->rc.capacity);
> if (in)
> - sgs[out_sgs + in_sgs++] = chan->sg + out;
> + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
>
> err = virtqueue_add_sgs(chan->vq, sgs, out_sgs, in_sgs, req,
> GFP_ATOMIC);
> @@ -448,16 +554,15 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
> out_sgs = in_sgs = 0;
>
> /* out data */
> - out = pack_sg_list(chan->sg, 0,
> - chan->sg_n, req->tc.sdata, req->tc.size);
> + out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
>
> if (out)
> - sgs[out_sgs++] = chan->sg;
> + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
>
> if (out_pages) {
> - sgs[out_sgs++] = chan->sg + out;
> - out += pack_sg_list_p(chan->sg, out, chan->sg_n,
> - out_pages, out_nr_pages, offs, outlen);
> + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, out);
> + out += pack_sg_list_p(chan->vq_sg, out, out_pages,
> + out_nr_pages, offs, outlen);
> }
>
> /*
> @@ -467,15 +572,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
> * Arrange in such a way that server places header in the
> * allocated memory and payload onto the user buffer.
> */
> - in = pack_sg_list(chan->sg, out,
> - chan->sg_n, req->rc.sdata, in_hdr_len);
> + in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, in_hdr_len);
> if (in)
> - sgs[out_sgs + in_sgs++] = chan->sg + out;
> + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
>
> if (in_pages) {
> - sgs[out_sgs + in_sgs++] = chan->sg + out + in;
> - in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
> - in_pages, in_nr_pages, offs, inlen);
> + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out + in);
> + in += pack_sg_list_p(chan->vq_sg, out + in, in_pages,
> + in_nr_pages, offs, inlen);
> }
>
> BUG_ON(out_sgs + in_sgs > ARRAY_SIZE(sgs));
> @@ -576,14 +680,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> goto fail;
> }
>
> - chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
> - sizeof(struct scatterlist), GFP_KERNEL);
> - if (!chan->sg) {
> + chan->vq_sg = vq_sg_alloc(VIRTQUEUE_SG_NSGL_DEFAULT);
> + if (!chan->vq_sg) {
> pr_err("Failed to allocate virtio 9P channel\n");
> err = -ENOMEM;
> goto out_free_chan_shallow;
> }
> - chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
>
> chan->vdev = vdev;
>
> @@ -596,8 +698,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> chan->vq->vdev->priv = chan;
> spin_lock_init(&chan->lock);
>
> - sg_init_table(chan->sg, chan->sg_n);
> -
> chan->inuse = false;
> if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
> virtio_cread(vdev, struct virtio_9p_config, tag_len, &tag_len);
> @@ -646,7 +746,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> out_free_vq:
> vdev->config->del_vqs(vdev);
> out_free_chan:
> - kfree(chan->sg);
> + vq_sg_free(chan->vq_sg);
> out_free_chan_shallow:
> kfree(chan);
> fail:
> @@ -741,7 +841,7 @@ static void p9_virtio_remove(struct virtio_device *vdev)
> kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
> kfree(chan->tag);
> kfree(chan->vc_wq);
> - kfree(chan->sg);
> + vq_sg_free(chan->vq_sg);
> kfree(chan);
>
> }
> @@ -780,7 +880,8 @@ static struct p9_trans_module p9_virtio_trans = {
> * that are not at page boundary, that can result in an extra
> * page in zero copy.
> */
> - .maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
> + .maxsize = PAGE_SIZE *
> + ((VIRTQUEUE_SG_NSGL_DEFAULT * SG_USER_PAGES_PER_LIST) - 3),
> .def = 1,
> .owner = THIS_MODULE,
> };

2022-07-12 20:56:45

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 07/11] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports

Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:26PM +0200:
> This 9p client implementation is yet using linear message buffers for
> most message types, i.e. they use kmalloc() et al. for allocating
> continuous physical memory pages, which is usually limited to 4MB
> buffers. Use KMALLOC_MAX_SIZE though instead of a hard coded 4MB for
> constraining this more safely.
>
> Unfortunately we cannot simply replace the existing kmalloc() calls by
> vmalloc() ones, because that would yield in non-logical kernel addresses
> (for any vmalloc(>4MB) that is) which are in general not accessible by
> hosts like QEMU.
>
> In future we would replace those linear buffers by scatter/gather lists
> to eventually get rid of this limit (struct p9_fcall's sdata member by
> p9_fcall_init() and struct p9_fid's rdir member by
> v9fs_alloc_rdir_buf()).
>
> Signed-off-by: Christian Schoenebeck <[email protected]>
> ---
>
> Hmm, that's a bit too simple, as we also need a bit of headroom for
> transport specific overhead. So maybe this has to be handled by each
> transport appropriately instead?

hm yes I'd say it's redundant with each transports max size already --
let's just keep appropriate max values in each transport.

>
> net/9p/client.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 20054addd81b..fab939541c81 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1042,6 +1042,17 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
> p9_debug(P9_DEBUG_MUX, "clnt %p trans %p msize %d protocol %d\n",
> clnt, clnt->trans_mod, clnt->msize, clnt->proto_version);
>
> + /*
> + * due to linear message buffers being used by client ATM
> + */
> + if (clnt->msize > KMALLOC_MAX_SIZE) {
> + clnt->msize = KMALLOC_MAX_SIZE;
> + pr_info("Limiting 'msize' to %zu as this is the maximum "
> + "supported by this client version.\n",
> + (size_t) KMALLOC_MAX_SIZE
> + );
> + }
> +
> err = clnt->trans_mod->create(clnt, dev_name, options);
> if (err)
> goto put_trans;

2022-07-12 21:17:47

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] remove msize limit in virtio transport

Alright; anything I didn't reply to looks good to me.

Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:35:54PM +0200:
> OVERVIEW OF PATCHES:
>
> * Patches 1..6 remove the msize limitation from the 'virtio' transport
> (i.e. the 9p 'virtio' transport itself actually supports >4MB now, tested
> successfully with an experimental QEMU version and some dirty 9p Linux
> client hacks up to msize=128MB).

I have no problem with this except for the small nitpicks I gave, but
would be tempted to delay this part for one more cycle as it's really
independant -- what do you think?


> * Patch 7 limits msize for all transports to 4 MB for now as >4MB would need
> more work on 9p client level (see commit log of patch 7 for details).
>
> * Patches 8..11 tremendously reduce unnecessarily huge 9p message sizes and
> therefore provide performance gain as well. So far, almost all 9p messages
> simply allocated message buffers exactly msize large, even for messages
> that actually just needed few bytes. So these patches make sense by
> themselves, independent of this overall series, however for this series
> even more, because the larger msize, the more this issue would have hurt
> otherwise.

time-wise we're getting close to the merge window already (probably in 2
weeks), how confident are you in this?
I can take patches 8..11 in -next now and probably find some time to
test over next weekend, are we good?

--
Dominique

2022-07-12 21:24:59

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

Dominique Martinet wrote on Wed, Jul 13, 2022 at 04:33:35AM +0900:
> Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:36PM +0200:
> > So far 'msize' was simply used for all 9p message types, which is far
> > too much and slowed down performance tremendously with large values
> > for user configurable 'msize' option.
> >
> > Let's stop this waste by using the new p9_msg_buf_size() function for
> > allocating more appropriate, smaller buffers according to what is
> > actually sent over the wire.
> >
> > Only exception: RDMA transport is currently excluded from this, as
> > it would not cope with it. [1]

Thinking back on RDMA:
- vs. one or two buffers as discussed in another thread, rdma will still
require two buffers, we post the receive buffer before sending as we
could otherwise be raced (reply from server during the time it'd take to
recycle the send buffer)
In practice the recv buffers should act liks a fifo and we might be able
to post the buffer we're about to send for recv before sending it and it
shouldn't be overwritten until it's sent, but that doesn't look quite good.

- for this particular patch, we can still allocate smaller short buffers
for requests, so we should probably keep tsize to 0.
rsize there really isn't much we can do without a protocol change
though...

--
Dominique

2022-07-13 09:25:16

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] remove msize limit in virtio transport

On Dienstag, 12. Juli 2022 23:13:16 CEST Dominique Martinet wrote:
> Alright; anything I didn't reply to looks good to me.
>
> Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:35:54PM +0200:
> > OVERVIEW OF PATCHES:
> >
> > * Patches 1..6 remove the msize limitation from the 'virtio' transport
> >
> > (i.e. the 9p 'virtio' transport itself actually supports >4MB now,
> > tested
> > successfully with an experimental QEMU version and some dirty 9p Linux
> > client hacks up to msize=128MB).
>
> I have no problem with this except for the small nitpicks I gave, but
> would be tempted to delay this part for one more cycle as it's really
> independant -- what do you think?

Yes, I would also postpone the virtio patches towards subsequent release
cycle.

> > * Patch 7 limits msize for all transports to 4 MB for now as >4MB would
> > need>
> > more work on 9p client level (see commit log of patch 7 for details).
> >
> > * Patches 8..11 tremendously reduce unnecessarily huge 9p message sizes
> > and
> >
> > therefore provide performance gain as well. So far, almost all 9p
> > messages
> > simply allocated message buffers exactly msize large, even for messages
> > that actually just needed few bytes. So these patches make sense by
> > themselves, independent of this overall series, however for this series
> > even more, because the larger msize, the more this issue would have hurt
> > otherwise.
>
> time-wise we're getting close to the merge window already (probably in 2
> weeks), how confident are you in this?
> I can take patches 8..11 in -next now and probably find some time to
> test over next weekend, are we good?

Well, I have tested them thoroughly, but nevertheless IMO someone else than me
should review patch 10 as well, and review whether the calculations for the
individual message types are correct. That's a bit of spec dictionary lookup.

Best regards,
Christian Schoenebeck


2022-07-13 09:29:59

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [PATCH v5 03/11] 9p/trans_virtio: introduce struct virtqueue_sg

On Dienstag, 12. Juli 2022 22:33:34 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:16PM +0200:
> > The amount of elements in a scatter/gather list is limited to
> > approximately 128 elements. To allow going beyond that limit
> > with subsequent patches, pave the way by turning the one-
> >
> > dimensional sg list array into a two-dimensional array, i.e:
> > sg[128]
> >
> > becomes
> >
> > sgl[nsgl][SG_MAX_SINGLE_ALLOC]
> >
> > As the value of 'nsgl' is exactly (still) 1 in this commit
> > and the compile-time (compiler and architecture dependent)
> > value of 'SG_MAX_SINGLE_ALLOC' equals approximately the
> > previous hard coded 128 elements, this commit is therefore
> > more of a preparatory refactoring then actual behaviour
> > change.
> >
> > A custom struct virtqueue_sg is defined instead of using
> > shared API struct sg_table, because the latter would not
> > allow to resize the table after allocation. sg_append_table
> > API OTOH would not fit either, because it requires a list
> > of pages beforehand upon allocation. And both APIs only
> > support all-or-nothing allocation.
> >
> > Signed-off-by: Christian Schoenebeck <[email protected]>
> > ---
> >
> > The question is whether that should really become 9p specifc SG list
> > code, or whether it should rather be squeezed into shared SG list code
> > base. Opinions by maintainers needed.
>
> hmm from the 9p side I'd say the type is simple enough that we can just
> keep it here; most people don't want to resize these lists...

OK, then I retain it as 9p-specific SG struct then.

> How much do you care about the all-or-nothing case you described in this
> commit message? From the look of it, patch 6 -- at what point did you
> actually see this being useful?

Patch 6 is probably a case of over-engineering, so if you want I can also drop
patch 6 now or retain it and you'll just eventually ignore it. Because when
someone comes into the situation having trouble to allocate the SG lists
already, then allocation of their actual bulk date pages is likely to become
much more troublesome.

> > net/9p/trans_virtio.c | 193 ++++++++++++++++++++++++++++++++----------
> > 1 file changed, 147 insertions(+), 46 deletions(-)
> >
> > diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> > index 18bdfa64b934..f63cd1b08bca 100644
> > --- a/net/9p/trans_virtio.c
> > +++ b/net/9p/trans_virtio.c
> > @@ -36,7 +36,31 @@
> >
> > #include <linux/virtio_9p.h>
> > #include "trans_common.h"
> >
> > -#define VIRTQUEUE_DEFAULT_NUM 128
> > +/**
> > + * struct virtqueue_sg - (chained) scatter gather lists for virtqueue
> > data
> > + * transmission
> > + * @nsgl: amount of elements (in first dimension) of array field @sgl
> > + * @sgl: two-dimensional array, i.e. sgl[nsgl][SG_MAX_SINGLE_ALLOC]
> > + */
> > +struct virtqueue_sg {
> > + unsigned int nsgl;
> > + struct scatterlist *sgl[];
> > +};
> > +
> > +/*
> > + * Default value for field nsgl in struct virtqueue_sg, which defines the
> > + * initial virtio data transmission capacity when this virtio transport
> > is
> > + * probed.
> > + */
> > +#define VIRTQUEUE_SG_NSGL_DEFAULT 1
> > +
> > +/* maximum value for field nsgl in struct virtqueue_sg */
> > +#define VIRTQUEUE_SG_NSGL_MAX
\
> > + ((PAGE_SIZE - sizeof(struct virtqueue_sg)) /
\
> > + sizeof(struct scatterlist *))
\
> > +
> > +/* last entry per sg list is used for chaining (pointer to next list) */
> > +#define SG_USER_PAGES_PER_LIST (SG_MAX_SINGLE_ALLOC - 1)
> >
> > /* a single mutex to manage channel initialization and attachment */
> > static DEFINE_MUTEX(virtio_9p_lock);
> >
> > @@ -53,8 +77,7 @@ static atomic_t vp_pinned = ATOMIC_INIT(0);
> >
> > * @ring_bufs_avail: flag to indicate there is some available in the ring
> > buf * @vc_wq: wait queue for waiting for thing to be added to ring buf
> > * @p9_max_pages: maximum number of pinned pages
> >
> > - * @sg: scatter gather list which is used to pack a request (protected?)
> > - * @sg_n: amount of elements in sg array
> > + * @vq_sg: table of scatter gather lists, which are used to pack a
> > request
> >
> > * @chan_list: linked list of channels
> > *
> > * We keep all per-channel information in a structure.
> >
> > @@ -77,9 +100,7 @@ struct virtio_chan {
> >
> > * will be placing it in each channel.
> > */
> >
> > unsigned long p9_max_pages;
> >
> > - /* Scatterlist: can be too big for stack. */
> > - struct scatterlist *sg;
> > - size_t sg_n;
> > + struct virtqueue_sg *vq_sg;
> >
> > /**
> >
> > * @tag: name to identify a mount null terminated
> > */
> >
> > @@ -96,6 +117,92 @@ static unsigned int rest_of_page(void *data)
> >
> > return PAGE_SIZE - offset_in_page(data);
> >
> > }
> >
> > +/**
> > + * vq_sg_page - returns user page for given page index
> > + * @vq_sg: scatter gather lists used by this transport
> > + * @page: user page index across all scatter gather lists
> > + */
> > +static struct scatterlist *vq_sg_page(struct virtqueue_sg *vq_sg, size_t
> > page) +{
> > + unsigned int node = page / SG_USER_PAGES_PER_LIST;
> > + unsigned int leaf = page % SG_USER_PAGES_PER_LIST;
> > + BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX);
>
> probably awnt to check with vq_sg->sg_n instead?
> (we already check sg_n <= MAX on alloc)

Right, that makes sense!

> > + return &vq_sg->sgl[node][leaf];
> > +}
> > +
> > +/**
> > + * vq_sg_npages - returns total number of individual user pages in passed
> > + * scatter gather lists
> > + * @vq_sg: scatter gather lists to be counted
> > + */
> > +static size_t vq_sg_npages(struct virtqueue_sg *vq_sg)
> > +{
> > + return vq_sg->nsgl * SG_USER_PAGES_PER_LIST;
> > +}
> > +
> > +/**
> > + * vq_sg_free - free all memory previously allocated for @vq_sg
> > + * @vq_sg: scatter gather lists to be freed
> > + */
> > +static void vq_sg_free(struct virtqueue_sg *vq_sg)
> > +{
> > + unsigned int i;
> > +
> > + if (!vq_sg)
> > + return;
> > +
> > + for (i = 0; i < vq_sg->nsgl; ++i) {
> > + kfree(vq_sg->sgl[i]);
> > + }
> > + kfree(vq_sg);
> > +}
> > +
> > +/**
> > + * vq_sg_alloc - allocates and returns @nsgl scatter gather lists
> > + * @nsgl: amount of scatter gather lists to be allocated
> > + * If @nsgl is larger than one then chained lists are used if supported
> > by
> > + * architecture.
> > + */
> > +static struct virtqueue_sg *vq_sg_alloc(unsigned int nsgl)
> > +{
> > + struct virtqueue_sg *vq_sg;
> > + unsigned int i;
> > +
> > + BUG_ON(!nsgl || nsgl > VIRTQUEUE_SG_NSGL_MAX);
> > +#ifdef CONFIG_ARCH_NO_SG_CHAIN
> > + if (WARN_ON_ONCE(nsgl > 1))
> > + return NULL;
> > +#endif
> > +
> > + vq_sg = kzalloc(sizeof(struct virtqueue_sg) +
> > + nsgl * sizeof(struct scatterlist *),
> > + GFP_KERNEL);
> > +
> > + if (!vq_sg)
> > + return NULL;
> > +
> > + vq_sg->nsgl = nsgl;
> > +
> > + for (i = 0; i < nsgl; ++i) {
> > + vq_sg->sgl[i] = kmalloc_array(
> > + SG_MAX_SINGLE_ALLOC, sizeof(struct scatterlist),
> > + GFP_KERNEL
> > + );
> > + if (!vq_sg->sgl[i]) {
> > + vq_sg_free(vq_sg);
> > + return NULL;
> > + }
> > + sg_init_table(vq_sg->sgl[i], SG_MAX_SINGLE_ALLOC);
> > + if (i) {
> > + /* chain the lists */
> > + sg_chain(vq_sg->sgl[i - 1], SG_MAX_SINGLE_ALLOC,
> > + vq_sg->sgl[i]);
> > + }
> > + }
> > + sg_mark_end(&vq_sg->sgl[nsgl - 1][SG_MAX_SINGLE_ALLOC - 1]);
> > + return vq_sg;
> > +}
> > +
> >
> > /**
> >
> > * p9_virtio_close - reclaim resources of a channel
> > * @client: client instance
> >
> > @@ -158,9 +265,8 @@ static void req_done(struct virtqueue *vq)
> >
> > /**
> >
> > * pack_sg_list - pack a scatter gather list from a linear buffer
> >
> > - * @sg: scatter/gather list to pack into
> > + * @vq_sg: scatter/gather lists to pack into
> >
> > * @start: which segment of the sg_list to start at
> >
> > - * @limit: maximum segment to pack data to
> >
> > * @data: data to pack into scatter/gather list
> > * @count: amount of data to pack into the scatter/gather list
> > *
> >
> > @@ -170,11 +276,12 @@ static void req_done(struct virtqueue *vq)
> >
> > *
> > */
> >
> > -static int pack_sg_list(struct scatterlist *sg, int start,
> > - int limit, char *data, int count)
> > +static int pack_sg_list(struct virtqueue_sg *vq_sg, int start,
> > + char *data, int count)
> >
> > {
> >
> > int s;
> > int index = start;
> >
> > + size_t limit = vq_sg_npages(vq_sg);
> >
> > while (count) {
> >
> > s = rest_of_page(data);
> >
> > @@ -182,13 +289,13 @@ static int pack_sg_list(struct scatterlist *sg, int
> > start,>
> > s = count;
> >
> > BUG_ON(index >= limit);
> > /* Make sure we don't terminate early. */
> >
> > - sg_unmark_end(&sg[index]);
> > - sg_set_buf(&sg[index++], data, s);
> > + sg_unmark_end(vq_sg_page(vq_sg, index));
> > + sg_set_buf(vq_sg_page(vq_sg, index++), data, s);
> >
> > count -= s;
> > data += s;
> >
> > }
> > if (index-start)
> >
> > - sg_mark_end(&sg[index - 1]);
> > + sg_mark_end(vq_sg_page(vq_sg, index - 1));
> >
> > return index-start;
> >
> > }
> >
> > @@ -208,21 +315,21 @@ static int p9_virtio_cancelled(struct p9_client
> > *client, struct p9_req_t *req)>
> > /**
> >
> > * pack_sg_list_p - Just like pack_sg_list. Instead of taking a buffer,
> > * this takes a list of pages.
> >
> > - * @sg: scatter/gather list to pack into
> > + * @vq_sg: scatter/gather lists to pack into
> >
> > * @start: which segment of the sg_list to start at
> >
> > - * @limit: maximum number of pages in sg list.
> >
> > * @pdata: a list of pages to add into sg.
> > * @nr_pages: number of pages to pack into the scatter/gather list
> > * @offs: amount of data in the beginning of first page _not_ to pack
> > * @count: amount of data to pack into the scatter/gather list
> > */
> >
> > static int
> >
> > -pack_sg_list_p(struct scatterlist *sg, int start, int limit,
> > +pack_sg_list_p(struct virtqueue_sg *vq_sg, int start,
> >
> > struct page **pdata, int nr_pages, size_t offs, int count)
> >
> > {
> >
> > int i = 0, s;
> > int data_off = offs;
> > int index = start;
> >
> > + size_t limit = vq_sg_npages(vq_sg);
> >
> > BUG_ON(nr_pages > (limit - start));
> > /*
> >
> > @@ -235,15 +342,16 @@ pack_sg_list_p(struct scatterlist *sg, int start,
> > int limit,>
> > s = count;
> >
> > BUG_ON(index >= limit);
> > /* Make sure we don't terminate early. */
> >
> > - sg_unmark_end(&sg[index]);
> > - sg_set_page(&sg[index++], pdata[i++], s, data_off);
> > + sg_unmark_end(vq_sg_page(vq_sg, index));
> > + sg_set_page(vq_sg_page(vq_sg, index++), pdata[i++], s,
> > + data_off);
> >
> > data_off = 0;
> > count -= s;
> > nr_pages--;
> >
> > }
> >
> > if (index-start)
> >
> > - sg_mark_end(&sg[index - 1]);
> > + sg_mark_end(vq_sg_page(vq_sg, index - 1));
> >
> > return index - start;
> >
> > }
> >
> > @@ -271,15 +379,13 @@ p9_virtio_request(struct p9_client *client, struct
> > p9_req_t *req)>
> > out_sgs = in_sgs = 0;
> > /* Handle out VirtIO ring buffers */
> >
> > - out = pack_sg_list(chan->sg, 0,
> > - chan->sg_n, req->tc.sdata, req->tc.size);
> > + out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
> >
> > if (out)
> >
> > - sgs[out_sgs++] = chan->sg;
> > + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
> >
> > - in = pack_sg_list(chan->sg, out,
> > - chan->sg_n, req->rc.sdata, req->rc.capacity);
> > + in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, req-
>rc.capacity);
> >
> > if (in)
> >
> > - sgs[out_sgs + in_sgs++] = chan->sg + out;
> > + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
> >
> > err = virtqueue_add_sgs(chan->vq, sgs, out_sgs, in_sgs, req,
> >
> > GFP_ATOMIC);
> >
> > @@ -448,16 +554,15 @@ p9_virtio_zc_request(struct p9_client *client,
> > struct p9_req_t *req,>
> > out_sgs = in_sgs = 0;
> >
> > /* out data */
> >
> > - out = pack_sg_list(chan->sg, 0,
> > - chan->sg_n, req->tc.sdata, req->tc.size);
> > + out = pack_sg_list(chan->vq_sg, 0, req->tc.sdata, req->tc.size);
> >
> > if (out)
> >
> > - sgs[out_sgs++] = chan->sg;
> > + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, 0);
> >
> > if (out_pages) {
> >
> > - sgs[out_sgs++] = chan->sg + out;
> > - out += pack_sg_list_p(chan->sg, out, chan->sg_n,
> > - out_pages, out_nr_pages, offs,
outlen);
> > + sgs[out_sgs++] = vq_sg_page(chan->vq_sg, out);
> > + out += pack_sg_list_p(chan->vq_sg, out, out_pages,
> > + out_nr_pages, offs, outlen);
> >
> > }
> >
> > /*
> >
> > @@ -467,15 +572,14 @@ p9_virtio_zc_request(struct p9_client *client,
> > struct p9_req_t *req,>
> > * Arrange in such a way that server places header in the
> > * allocated memory and payload onto the user buffer.
> > */
> >
> > - in = pack_sg_list(chan->sg, out,
> > - chan->sg_n, req->rc.sdata, in_hdr_len);
> > + in = pack_sg_list(chan->vq_sg, out, req->rc.sdata, in_hdr_len);
> >
> > if (in)
> >
> > - sgs[out_sgs + in_sgs++] = chan->sg + out;
> > + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out);
> >
> > if (in_pages) {
> >
> > - sgs[out_sgs + in_sgs++] = chan->sg + out + in;
> > - in += pack_sg_list_p(chan->sg, out + in, chan->sg_n,
> > - in_pages, in_nr_pages, offs,
inlen);
> > + sgs[out_sgs + in_sgs++] = vq_sg_page(chan->vq_sg, out +
in);
> > + in += pack_sg_list_p(chan->vq_sg, out + in, in_pages,
> > + in_nr_pages, offs, inlen);
> >
> > }
> >
> > BUG_ON(out_sgs + in_sgs > ARRAY_SIZE(sgs));
> >
> > @@ -576,14 +680,12 @@ static int p9_virtio_probe(struct virtio_device
> > *vdev)>
> > goto fail;
> >
> > }
> >
> > - chan->sg = kmalloc_array(VIRTQUEUE_DEFAULT_NUM,
> > - sizeof(struct scatterlist),
GFP_KERNEL);
> > - if (!chan->sg) {
> > + chan->vq_sg = vq_sg_alloc(VIRTQUEUE_SG_NSGL_DEFAULT);
> > + if (!chan->vq_sg) {
> >
> > pr_err("Failed to allocate virtio 9P channel\n");
> > err = -ENOMEM;
> > goto out_free_chan_shallow;
> >
> > }
> >
> > - chan->sg_n = VIRTQUEUE_DEFAULT_NUM;
> >
> > chan->vdev = vdev;
> >
> > @@ -596,8 +698,6 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> >
> > chan->vq->vdev->priv = chan;
> > spin_lock_init(&chan->lock);
> >
> > - sg_init_table(chan->sg, chan->sg_n);
> > -
> >
> > chan->inuse = false;
> > if (virtio_has_feature(vdev, VIRTIO_9P_MOUNT_TAG)) {
> >
> > virtio_cread(vdev, struct virtio_9p_config, tag_len,
&tag_len);
> >
> > @@ -646,7 +746,7 @@ static int p9_virtio_probe(struct virtio_device *vdev)
> >
> > out_free_vq:
> > vdev->config->del_vqs(vdev);
> >
> > out_free_chan:
> > - kfree(chan->sg);
> > + vq_sg_free(chan->vq_sg);
> >
> > out_free_chan_shallow:
> > kfree(chan);
> >
> > fail:
> > @@ -741,7 +841,7 @@ static void p9_virtio_remove(struct virtio_device
> > *vdev)>
> > kobject_uevent(&(vdev->dev.kobj), KOBJ_CHANGE);
> > kfree(chan->tag);
> > kfree(chan->vc_wq);
> >
> > - kfree(chan->sg);
> > + vq_sg_free(chan->vq_sg);
> >
> > kfree(chan);
> >
> > }
> >
> > @@ -780,7 +880,8 @@ static struct p9_trans_module p9_virtio_trans = {
> >
> > * that are not at page boundary, that can result in an extra
> > * page in zero copy.
> > */
> >
> > - .maxsize = PAGE_SIZE * (VIRTQUEUE_DEFAULT_NUM - 3),
> > + .maxsize = PAGE_SIZE *
> > + ((VIRTQUEUE_SG_NSGL_DEFAULT * SG_USER_PAGES_PER_LIST) -
3),
> >
> > .def = 1,
> > .owner = THIS_MODULE,
> >
> > };




2022-07-13 09:45:53

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

On Dienstag, 12. Juli 2022 23:11:42 CEST Dominique Martinet wrote:
> Dominique Martinet wrote on Wed, Jul 13, 2022 at 04:33:35AM +0900:
> > Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:36PM +0200:
> > > So far 'msize' was simply used for all 9p message types, which is far
> > > too much and slowed down performance tremendously with large values
> > > for user configurable 'msize' option.
> > >
> > > Let's stop this waste by using the new p9_msg_buf_size() function for
> > > allocating more appropriate, smaller buffers according to what is
> > > actually sent over the wire.
> > >
> > > Only exception: RDMA transport is currently excluded from this, as
> > > it would not cope with it. [1]
>
> Thinking back on RDMA:
> - vs. one or two buffers as discussed in another thread, rdma will still
> require two buffers, we post the receive buffer before sending as we
> could otherwise be raced (reply from server during the time it'd take to
> recycle the send buffer)
> In practice the recv buffers should act liks a fifo and we might be able
> to post the buffer we're about to send for recv before sending it and it
> shouldn't be overwritten until it's sent, but that doesn't look quite good.
>
> - for this particular patch, we can still allocate smaller short buffers
> for requests, so we should probably keep tsize to 0.
> rsize there really isn't much we can do without a protocol change
> though...

Good to know! I don't have any RDMA setup here to test, so I rely on what you
say and adjust this in v6 accordingly, along with the strcmp -> flag change of
course.

As this flag is going to be very RDMA-transport specific, I'm still scratching
my head for a good name though.

Best regards,
Christian Schoenebeck


2022-07-13 09:47:12

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

On Mittwoch, 13. Juli 2022 11:19:48 CEST Christian Schoenebeck wrote:
> On Dienstag, 12. Juli 2022 23:11:42 CEST Dominique Martinet wrote:
> > Dominique Martinet wrote on Wed, Jul 13, 2022 at 04:33:35AM +0900:
> > > Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:36PM +0200:
> > > > So far 'msize' was simply used for all 9p message types, which is far
> > > > too much and slowed down performance tremendously with large values
> > > > for user configurable 'msize' option.
> > > >
> > > > Let's stop this waste by using the new p9_msg_buf_size() function for
> > > > allocating more appropriate, smaller buffers according to what is
> > > > actually sent over the wire.
> > > >
> > > > Only exception: RDMA transport is currently excluded from this, as
> > > > it would not cope with it. [1]
> >
> > Thinking back on RDMA:
> > - vs. one or two buffers as discussed in another thread, rdma will still
> > require two buffers, we post the receive buffer before sending as we
> > could otherwise be raced (reply from server during the time it'd take to
> > recycle the send buffer)
> > In practice the recv buffers should act liks a fifo and we might be able
> > to post the buffer we're about to send for recv before sending it and it
> > shouldn't be overwritten until it's sent, but that doesn't look quite
> > good.
> >
> > - for this particular patch, we can still allocate smaller short buffers
> > for requests, so we should probably keep tsize to 0.
> > rsize there really isn't much we can do without a protocol change
> > though...
>
> Good to know! I don't have any RDMA setup here to test, so I rely on what
> you say and adjust this in v6 accordingly, along with the strcmp -> flag
> change of course.
>
> As this flag is going to be very RDMA-transport specific, I'm still
> scratching my head for a good name though.

Or, instead of inventing some exotic flag name, maybe introducing an enum for
the individual 9p transport types?

Best regards,
Christian Schoenebeck



2022-07-13 09:47:24

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

Christian Schoenebeck wrote on Wed, Jul 13, 2022 at 11:19:48AM +0200:
> > - for this particular patch, we can still allocate smaller short buffers
> > for requests, so we should probably keep tsize to 0.
> > rsize there really isn't much we can do without a protocol change
> > though...
>
> Good to know! I don't have any RDMA setup here to test, so I rely on what you
> say and adjust this in v6 accordingly, along with the strcmp -> flag change of
> course.

Yeah... I've got a connect-x 3 (mlx4, got a cheap old one) card laying
around, I need to find somewhere to plug it in and actually run some
validation again at some point.
Haven't used 9p/RDMA since I left my previous work in 2020...

I'll try to find time for that before the merge


> As this flag is going to be very RDMA-transport specific, I'm still scratching
> my head for a good name though.

The actual limitation is that receive buffers are pooled, so something
to like pooled_rcv_buffers or shared_rcv_buffers or anything along that
line?

--
Dominique

2022-07-13 10:31:16

by Dominique Martinet

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

Christian Schoenebeck wrote on Wed, Jul 13, 2022 at 11:29:13AM +0200:
> > As this flag is going to be very RDMA-transport specific, I'm still
> > scratching my head for a good name though.
>
> Or, instead of inventing some exotic flag name, maybe introducing an enum for
> the individual 9p transport types?

That works for me as well

--
Dominique

2022-07-13 10:32:58

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 10/11] net/9p: add p9_msg_buf_size()

Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:33PM +0200:
> This new function calculates a buffer size suitable for holding the
> intended 9p request or response. For rather small message types (which
> applies to almost all 9p message types actually) simply use hard coded
> values. For some variable-length and potentially large message types
> calculate a more precise value according to what data is actually
> transmitted to avoid unnecessarily huge buffers.
>
> Signed-off-by: Christian Schoenebeck <[email protected]>

Overally already had checked a few times but I just went through
client.c va argument lists as well this time, just a few nitpicks.


> ---
> net/9p/protocol.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++
> net/9p/protocol.h | 2 +
> 2 files changed, 156 insertions(+)
>
> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> index 3754c33e2974..49939e8cde2a 100644
> --- a/net/9p/protocol.c
> +++ b/net/9p/protocol.c
> @@ -23,6 +23,160 @@
>
> #include <trace/events/9p.h>
>
> +/* len[2] text[len] */
> +#define P9_STRLEN(s) \
> + (2 + min_t(size_t, s ? strlen(s) : 0, USHRT_MAX))
> +
> +/**
> + * p9_msg_buf_size - Returns a buffer size sufficiently large to hold the
> + * intended 9p message.
> + * @c: client
> + * @type: message type
> + * @fmt: format template for assembling request message
> + * (see p9pdu_vwritef)
> + * @ap: variable arguments to be fed to passed format template
> + * (see p9pdu_vwritef)
> + *
> + * Note: Even for response types (P9_R*) the format template and variable
> + * arguments must always be for the originating request type (P9_T*).
> + */
> +size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
> + const char *fmt, va_list ap)
> +{
> + /* size[4] type[1] tag[2] */
> + const int hdr = 4 + 1 + 2;
> + /* ename[s] errno[4] */
> + const int rerror_size = hdr + P9_ERRMAX + 4;
> + /* ecode[4] */
> + const int rlerror_size = hdr + 4;
> + const int err_size =
> + c->proto_version == p9_proto_2000L ? rlerror_size : rerror_size;
> +
> + switch (type) {
> +
> + /* message types not used at all */
> + case P9_TERROR:
> + case P9_TLERROR:
> + case P9_TAUTH:
> + case P9_RAUTH:
> + BUG();
> +
> + /* variable length & potentially large message types */
> + case P9_TATTACH:
> + BUG_ON(strcmp("ddss?u", fmt));
> + va_arg(ap, int32_t);
> + va_arg(ap, int32_t);
> + {
> + const char *uname = va_arg(ap, const char *);
> + const char *aname = va_arg(ap, const char *);
> + /* fid[4] afid[4] uname[s] aname[s] n_uname[4] */
> + return hdr + 4 + 4 + P9_STRLEN(uname) + P9_STRLEN(aname) + 4;
> + }
> + case P9_TWALK:
> + BUG_ON(strcmp("ddT", fmt));
> + va_arg(ap, int32_t);
> + va_arg(ap, int32_t);
> + {
> + uint i, nwname = max(va_arg(ap, int), 0);

I was about to say that the max is useless as for loop would be cut
short, but these are unsigned... So the code in protocol.c p9pdu_vwritef
'T' has a bug (int cast directly to uint16): do you want to fix it or
shall I go ahead?

> + size_t wname_all;
> + const char **wnames = va_arg(ap, const char **);
> + for (i = 0, wname_all = 0; i < nwname; ++i) {
> + wname_all += P9_STRLEN(wnames[i]);
> + }
> + /* fid[4] newfid[4] nwname[2] nwname*(wname[s]) */
> + return hdr + 4 + 4 + 2 + wname_all;
> + }
> + case P9_RWALK:
> + BUG_ON(strcmp("ddT", fmt));
> + va_arg(ap, int32_t);
> + va_arg(ap, int32_t);
> + {
> + uint nwname = va_arg(ap, int);
> + /* nwqid[2] nwqid*(wqid[13]) */
> + return max_t(size_t, hdr + 2 + nwname * 13, err_size);
> + }
> + case P9_TCREATE:
> + BUG_ON(strcmp("dsdb?s", fmt));
> + va_arg(ap, int32_t);
> + {
> + const char *name = va_arg(ap, const char *);
> + if ((c->proto_version != p9_proto_2000u) &&
> + (c->proto_version != p9_proto_2000L))

(I don't think 9p2000.L can call TCREATE, but it doesn't really hurt
either)

> + /* fid[4] name[s] perm[4] mode[1] */
> + return hdr + 4 + P9_STRLEN(name) + 4 + 1;
> + {
> + va_arg(ap, int32_t);
> + va_arg(ap, int);
> + {
> + const char *ext = va_arg(ap, const char *);
> + /* fid[4] name[s] perm[4] mode[1] extension[s] */
> + return hdr + 4 + P9_STRLEN(name) + 4 + 1 + P9_STRLEN(ext);
> + }
> + }
> + }
> + case P9_TLCREATE:
> + BUG_ON(strcmp("dsddg", fmt));
> + va_arg(ap, int32_t);
> + {
> + const char *name = va_arg(ap, const char *);
> + /* fid[4] name[s] flags[4] mode[4] gid[4] */
> + return hdr + 4 + P9_STRLEN(name) + 4 + 4 + 4;
> + }
> + case P9_RREAD:
> + case P9_RREADDIR:
> + BUG_ON(strcmp("dqd", fmt));
> + va_arg(ap, int32_t);
> + va_arg(ap, int64_t);
> + {
> + const int32_t count = va_arg(ap, int32_t);
> + /* count[4] data[count] */
> + return max_t(size_t, hdr + 4 + count, err_size);
> + }
> + case P9_TWRITE:
> + BUG_ON(strcmp("dqV", fmt));
> + va_arg(ap, int32_t);
> + va_arg(ap, int64_t);
> + {
> + const int32_t count = va_arg(ap, int32_t);
> + /* fid[4] offset[8] count[4] data[count] */
> + return hdr + 4 + 8 + 4 + count;
> + }
> + case P9_TRENAMEAT:

if we have trenameat we probably want trename, tunlinkat as well?
What's your criteria for counting individually vs slapping 8k at it?

In this particular case, oldname/newname are single component names
within a directory so this is capped at 2*(4+256), that could easily fit
in 4k without bothering.

> + BUG_ON(strcmp("dsds", fmt));
> + va_arg(ap, int32_t);
> + {
> + const char *oldname = va_arg(ap, const char *);
> + va_arg(ap, int32_t);
> + {
> + const char *newname = va_arg(ap, const char *);

(style nitpick) I don't see the point of nesting another level of
indentation here, it feels cleaner to declare oldname/newname at the
start of the block and be done with it.
Doesn't really matter but it was a bit confusing with the if for TCREATE
earlier.

> + /* olddirfid[4] oldname[s] newdirfid[4] newname[s] */
> + return hdr + 4 + P9_STRLEN(oldname) + 4 + P9_STRLEN(newname);
> + }
> + }
> + case P9_RERROR:
> + return rerror_size;
> + case P9_RLERROR:
> + return rlerror_size;
> +
> + /* small message types */

ditto: what's your criteria for 4k vs 8k?

> + case P9_TSTAT:

this is just fid[4], so 4k is more than enough

> + case P9_RSTAT:

also fixed size 4+4+8+8+8+8+8+8+4 -- fits in 4k.

> + case P9_TSYMLINK:

that one has symlink target which can be arbitrarily long (filesystem
specific, 4k is the usual limit for linux but some filesystem I don't
know might handle more -- it might be worth going through the trouble of
going through it.

Ah, we can't support an arbitrary length as we won't know the size for
rreadlink before the reply comes, so we have to set some arbitrary
max. Okay for 8k.

> + case P9_RREADLINK:

Ok as above.

> + case P9_TXATTRWALK:

xattr names seem capped at 256, could fit 4k but ok for 8.

> + case P9_TXATTRCREATE:

same, ok for either 4 or 8.

> + case P9_TLINK:

name is component name inside directory so capped at 256, but ok.

> + case P9_TMKDIR:

same

> + case P9_TUNLINKAT:

same

> + return 8 * 1024;
> +
> + /* tiny message types */
> + default:

I went through things we didn't list:
mknod has a name but it's also component within directory, so should be
consistent with mkdir/unlinkat
trename, same.

tlock contains client_id which comes from hostname.. I think that's
capped at 256 as well? so ok for 4k.

rest all looks ok to me.


--
Dominique


> + return 4 * 1024;
> +
> + }
> +}
> +
> static int
> p9pdu_writef(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);
>
> diff --git a/net/9p/protocol.h b/net/9p/protocol.h
> index 6d719c30331a..ad2283d1f96b 100644
> --- a/net/9p/protocol.h
> +++ b/net/9p/protocol.h
> @@ -8,6 +8,8 @@
> * Copyright (C) 2008 by IBM, Corp.
> */
>
> +size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
> + const char *fmt, va_list ap);
> int p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
> va_list ap);
> int p9pdu_readf(struct p9_fcall *pdu, int proto_version, const char *fmt, ...);
OA

2022-07-13 10:46:12

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [V9fs-developer] [PATCH v5 11/11] net/9p: allocate appropriate reduced message buffers

On Mittwoch, 13. Juli 2022 11:29:57 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Jul 13, 2022 at 11:19:48AM +0200:
> > > - for this particular patch, we can still allocate smaller short buffers
> > > for requests, so we should probably keep tsize to 0.
> > > rsize there really isn't much we can do without a protocol change
> > > though...
> >
> > Good to know! I don't have any RDMA setup here to test, so I rely on what
> > you say and adjust this in v6 accordingly, along with the strcmp -> flag
> > change of course.
>
> Yeah... I've got a connect-x 3 (mlx4, got a cheap old one) card laying
> around, I need to find somewhere to plug it in and actually run some
> validation again at some point.
> Haven't used 9p/RDMA since I left my previous work in 2020...
>
> I'll try to find time for that before the merge
>
> > As this flag is going to be very RDMA-transport specific, I'm still
> > scratching my head for a good name though.
>
> The actual limitation is that receive buffers are pooled, so something
> to like pooled_rcv_buffers or shared_rcv_buffers or anything along that
> line?

OK, I'll go this way then, as it's the easiest to do, can easily be refactored
in future if someone really cares, and it feels less like a hack than
injecting "if transport == rdma" into client code directly.

Best regards,
Christian Schoenebeck


2022-07-13 13:27:32

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [PATCH v5 10/11] net/9p: add p9_msg_buf_size()

On Mittwoch, 13. Juli 2022 12:29:31 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Tue, Jul 12, 2022 at 04:31:33PM +0200:
> > This new function calculates a buffer size suitable for holding the
> > intended 9p request or response. For rather small message types (which
> > applies to almost all 9p message types actually) simply use hard coded
> > values. For some variable-length and potentially large message types
> > calculate a more precise value according to what data is actually
> > transmitted to avoid unnecessarily huge buffers.
> >
> > Signed-off-by: Christian Schoenebeck <[email protected]>
>
> Overally already had checked a few times but I just went through
> client.c va argument lists as well this time, just a few nitpicks.
>
> > ---
> >
> > net/9p/protocol.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++
> > net/9p/protocol.h | 2 +
> > 2 files changed, 156 insertions(+)
> >
> > diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> > index 3754c33e2974..49939e8cde2a 100644
> > --- a/net/9p/protocol.c
> > +++ b/net/9p/protocol.c
> > @@ -23,6 +23,160 @@
> >
> > #include <trace/events/9p.h>
> >
> > +/* len[2] text[len] */
> > +#define P9_STRLEN(s) \
> > + (2 + min_t(size_t, s ? strlen(s) : 0, USHRT_MAX))
> > +
> > +/**
> > + * p9_msg_buf_size - Returns a buffer size sufficiently large to hold the
> > + * intended 9p message.
> > + * @c: client
> > + * @type: message type
> > + * @fmt: format template for assembling request message
> > + * (see p9pdu_vwritef)
> > + * @ap: variable arguments to be fed to passed format template
> > + * (see p9pdu_vwritef)
> > + *
> > + * Note: Even for response types (P9_R*) the format template and variable
> > + * arguments must always be for the originating request type (P9_T*).
> > + */
> > +size_t p9_msg_buf_size(struct p9_client *c, enum p9_msg_t type,
> > + const char *fmt, va_list ap)
> > +{
> > + /* size[4] type[1] tag[2] */
> > + const int hdr = 4 + 1 + 2;
> > + /* ename[s] errno[4] */
> > + const int rerror_size = hdr + P9_ERRMAX + 4;
> > + /* ecode[4] */
> > + const int rlerror_size = hdr + 4;
> > + const int err_size =
> > + c->proto_version == p9_proto_2000L ? rlerror_size : rerror_size;
> > +
> > + switch (type) {
> > +
> > + /* message types not used at all */
> > + case P9_TERROR:
> > + case P9_TLERROR:
> > + case P9_TAUTH:
> > + case P9_RAUTH:
> > + BUG();
> > +
> > + /* variable length & potentially large message types */
> > + case P9_TATTACH:
> > + BUG_ON(strcmp("ddss?u", fmt));
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int32_t);
> > + {
> > + const char *uname = va_arg(ap, const char *);
> > + const char *aname = va_arg(ap, const char *);
> > + /* fid[4] afid[4] uname[s] aname[s] n_uname[4] */
> > + return hdr + 4 + 4 + P9_STRLEN(uname) + P9_STRLEN(aname) + 4;
> > + }
> > + case P9_TWALK:
> > + BUG_ON(strcmp("ddT", fmt));
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int32_t);
> > + {
> > + uint i, nwname = max(va_arg(ap, int), 0);
>
> I was about to say that the max is useless as for loop would be cut
> short, but these are unsigned... So the code in protocol.c p9pdu_vwritef
> 'T' has a bug (int cast directly to uint16): do you want to fix it or
> shall I go ahead?

I'd either send a separate patch today for fixing 'T', or if you want to handle it by yourself, then just go ahead.

> > + size_t wname_all;
> > + const char **wnames = va_arg(ap, const char **);
> > + for (i = 0, wname_all = 0; i < nwname; ++i) {
> > + wname_all += P9_STRLEN(wnames[i]);
> > + }
> > + /* fid[4] newfid[4] nwname[2] nwname*(wname[s]) */
> > + return hdr + 4 + 4 + 2 + wname_all;
> > + }
> > + case P9_RWALK:
> > + BUG_ON(strcmp("ddT", fmt));
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int32_t);
> > + {
> > + uint nwname = va_arg(ap, int);
> > + /* nwqid[2] nwqid*(wqid[13]) */
> > + return max_t(size_t, hdr + 2 + nwname * 13, err_size);
> > + }
> > + case P9_TCREATE:
> > + BUG_ON(strcmp("dsdb?s", fmt));
> > + va_arg(ap, int32_t);
> > + {
> > + const char *name = va_arg(ap, const char *);
> > + if ((c->proto_version != p9_proto_2000u) &&
> > + (c->proto_version != p9_proto_2000L))
>
> (I don't think 9p2000.L can call TCREATE, but it doesn't really hurt
> either)

Yes, Tcreate is only 9p2000 and 9p2000.u. Semantically this particular check here means "if proto == 9p.2000". I can't remember anymore why I came up with this inverted form here. I'll change it to "if (c->proto_version == p9_proto_legacy)".

> > + /* fid[4] name[s] perm[4] mode[1] */
> > + return hdr + 4 + P9_STRLEN(name) + 4 + 1;
> > + {
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int);
> > + {
> > + const char *ext = va_arg(ap, const char *);
> > + /* fid[4] name[s] perm[4] mode[1] extension[s] */
> > + return hdr + 4 + P9_STRLEN(name) + 4 + 1 + P9_STRLEN(ext);
> > + }
> > + }
> > + }
> > + case P9_TLCREATE:
> > + BUG_ON(strcmp("dsddg", fmt));
> > + va_arg(ap, int32_t);
> > + {
> > + const char *name = va_arg(ap, const char *);
> > + /* fid[4] name[s] flags[4] mode[4] gid[4] */
> > + return hdr + 4 + P9_STRLEN(name) + 4 + 4 + 4;
> > + }
> > + case P9_RREAD:
> > + case P9_RREADDIR:
> > + BUG_ON(strcmp("dqd", fmt));
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int64_t);
> > + {
> > + const int32_t count = va_arg(ap, int32_t);
> > + /* count[4] data[count] */
> > + return max_t(size_t, hdr + 4 + count, err_size);
> > + }
> > + case P9_TWRITE:
> > + BUG_ON(strcmp("dqV", fmt));
> > + va_arg(ap, int32_t);
> > + va_arg(ap, int64_t);
> > + {
> > + const int32_t count = va_arg(ap, int32_t);
> > + /* fid[4] offset[8] count[4] data[count] */
> > + return hdr + 4 + 8 + 4 + count;
> > + }
>
> > + case P9_TRENAMEAT:
> if we have trenameat we probably want trename, tunlinkat as well?
> What's your criteria for counting individually vs slapping 8k at it?
>
> In this particular case, oldname/newname are single component names
> within a directory so this is capped at 2*(4+256), that could easily fit
> in 4k without bothering.

I have not taken the Linux kernel's current filename limit NAME_MAX (255) as basis, in that case you would be right. Instead I looked up what the maximum filename length among file systems in general was, and saw that ReiserFS supports up to slightly below 4k? So I took 4k as basis for the calculation used here, and the intention was to make this code more future proof. Because revisiting this code later on always takes quite some time and always has this certain potential to miss out details.

However if you want this to be based on what the Linux kernel currently supports, then I can also adjust this code to 255 being the basis.

Independent of the decision; additionally it might make sense to add something like:

#if NAME_MAX > 255
# error p9_msg_buf_size() needs adjustments
#endif

> > + BUG_ON(strcmp("dsds", fmt));
> > + va_arg(ap, int32_t);
> > + {
> > + const char *oldname = va_arg(ap, const char *);
> > + va_arg(ap, int32_t);
> > + {
> > + const char *newname = va_arg(ap, const char *);
>
> (style nitpick) I don't see the point of nesting another level of
> indentation here, it feels cleaner to declare oldname/newname at the
> start of the block and be done with it.

Because va_arg(ap, int32_t); must remain between those two declarations, and I think either the compiler or style check script was barking at me. But I will recheck, if possible I will remove the additional block scope here.

> > + /* olddirfid[4] oldname[s] newdirfid[4] newname[s] */
> > + return hdr + 4 + P9_STRLEN(oldname) + 4 + P9_STRLEN(newname);
> > + }
> > + }
> > + case P9_RERROR:
> > + return rerror_size;
> > + case P9_RLERROR:
> > + return rlerror_size;
> > +
> > + /* small message types */
>
> ditto: what's your criteria for 4k vs 8k?

As above, 4k being the basis for directory entry names, plus PATH_MAX (4k) as basis for maximum path length.

However looking at it again, if NAME_MAX == 4k was assumed exactly, then Tsymlink would have the potential to exceed 8k, as it has name[s] and symtgt[s] plus the other fields.

> > + case P9_TSTAT:
> this is just fid[4], so 4k is more than enough

I guess that was a typo and should have been Twstat instead?

> > + case P9_RSTAT:
> also fixed size 4+4+8+8+8+8+8+8+4 -- fits in 4k.

Rstat contains stat[n] which in turn contains variable-length string fields (filename, owner name, group name)

> > + case P9_TSYMLINK:
> that one has symlink target which can be arbitrarily long (filesystem
> specific, 4k is the usual limit for linux but some filesystem I don't
> know might handle more -- it might be worth going through the trouble of
> going through it.

Like mentioned above, if exactly NAME_MAX == 4k was assumed, then Tsymlink may even be >8k.

> Ah, we can't support an arbitrary length as we won't know the size for
> rreadlink before the reply comes, so we have to set some arbitrary
> max. Okay for 8k.
>
> > + case P9_RREADLINK:
> Ok as above.
>
> > + case P9_TXATTRWALK:
> xattr names seem capped at 256, could fit 4k but ok for 8.

Again current NAME_MAX (255) vs. assumed max. supported (4k) by filesystems.

> > + case P9_TXATTRCREATE:
> same, ok for either 4 or 8.
>
> > + case P9_TLINK:
> name is component name inside directory so capped at 256, but ok.
>
> > + case P9_TMKDIR:
> same
>
> > + case P9_TUNLINKAT:
> same
>
> > + return 8 * 1024;
> > +
> > + /* tiny message types */
>
> > + default:
> I went through things we didn't list:
> mknod has a name but it's also component within directory, so should be
> consistent with mkdir/unlinkat
> trename, same.

If 255 is assumed then it would be fine to not list here, otherwise with max. name 4k I should rather list it as 8k message.

> tlock contains client_id which comes from hostname.. I think that's
> capped at 256 as well? so ok for 4k.

Looks like that slipped through on my side completely. Again 255 being basis then yes it could be skipped, 4k would require being listed as 8k message then.

> rest all looks ok to me.

Thanks for the review! I know, that's really a dry patch to look at. :)

Best regards,
Christian Schoenebeck


2022-07-13 21:11:33

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v5 10/11] net/9p: add p9_msg_buf_size()

Christian Schoenebeck wrote on Wed, Jul 13, 2022 at 03:06:01PM +0200:
> > > + case P9_TWALK:
> > > + BUG_ON(strcmp("ddT", fmt));
> > > + va_arg(ap, int32_t);
> > > + va_arg(ap, int32_t);
> > > + {
> > > + uint i, nwname = max(va_arg(ap, int), 0);
> >
> > I was about to say that the max is useless as for loop would be cut
> > short, but these are unsigned... So the code in protocol.c p9pdu_vwritef
> > 'T' has a bug (int cast directly to uint16): do you want to fix it or
> > shall I go ahead?
>
> I'd either send a separate patch today for fixing 'T', or if you want
> to handle it by yourself, then just go ahead.

I'd appreciate if you have time, doesn't make much difference though

> > > + case P9_TCREATE:
> > > + BUG_ON(strcmp("dsdb?s", fmt));
> > > + va_arg(ap, int32_t);
> > > + {
> > > + const char *name = va_arg(ap, const char *);
> > > + if ((c->proto_version != p9_proto_2000u) &&
> > > + (c->proto_version != p9_proto_2000L))
> >
> > (I don't think 9p2000.L can call TCREATE, but it doesn't really hurt
> > either)
>
> Yes, Tcreate is only 9p2000 and 9p2000.u. Semantically this particular
> check here means "if proto == 9p.2000". I can't remember anymore why I
> came up with this inverted form here. I'll change it to "if
> (c->proto_version == p9_proto_legacy)".

Sounds good.

> > > + case P9_TRENAMEAT:
> > if we have trenameat we probably want trename, tunlinkat as well?
> > What's your criteria for counting individually vs slapping 8k at it?
> >
> > In this particular case, oldname/newname are single component names
> > within a directory so this is capped at 2*(4+256), that could easily fit
> > in 4k without bothering.
>
> I have not taken the Linux kernel's current filename limit NAME_MAX
> (255) as basis, in that case you would be right. Instead I looked up
> what the maximum filename length among file systems in general was,
> and saw that ReiserFS supports up to slightly below 4k? So I took 4k
> as basis for the calculation used here, and the intention was to make
> this code more future proof. Because revisiting this code later on
> always takes quite some time and always has this certain potential to
> miss out details.

hmm, that's pretty deeply engrained into the VFS but I guess it might
change eventually, yes.

I don't mind as long as we're consistent (cf. unlink/mkdir below), in
practice measuring doesn't cost much.

> Independent of the decision; additionally it might make sense to add
> something like:
>
> #if NAME_MAX > 255
> # error p9_msg_buf_size() needs adjustments
> #endif

That's probably an understatement but I don't mind either way, it
doesn't hurt.


> > > + BUG_ON(strcmp("dsds", fmt));
> > > + va_arg(ap, int32_t);
> > > + {
> > > + const char *oldname = va_arg(ap, const char *);
> > > + va_arg(ap, int32_t);
> > > + {
> > > + const char *newname = va_arg(ap, const char *);
> >
> > (style nitpick) I don't see the point of nesting another level of
> > indentation here, it feels cleaner to declare oldname/newname at the
> > start of the block and be done with it.
>
> Because va_arg(ap, int32_t); must remain between those two
> declarations, and I think either the compiler or style check script
> was barking at me. But I will recheck, if possible I will remove the
> additional block scope here.

Yes, I think it'd need to look like this:

case foo:
BUG_ON(...)
va_arg(ap, int32_t);
{
const char *oldname = va_arg(ap, const char *);
const char *newname;
va_arg(ap, int32_t);
newname = va_arg(ap, const_char *);
...
}
or
{
const char *oldname, *newname;
oldname = va_arg(ap, const char *);
va_arg(ap, int32_t)
newname = va_arg(ap, const char *);
...
}

I guess the later is slightly easier on the eyes


> > > + /* small message types */
> >
> > ditto: what's your criteria for 4k vs 8k?
>
> As above, 4k being the basis for directory entry names, plus PATH_MAX
> (4k) as basis for maximum path length.
>
> However looking at it again, if NAME_MAX == 4k was assumed exactly,
> then Tsymlink would have the potential to exceed 8k, as it has name[s]
> and symtgt[s] plus the other fields.

yes.


> > > + case P9_TSTAT:
> > this is just fid[4], so 4k is more than enough
>
> I guess that was a typo and should have been Twstat instead?

Ah, had missed this because 9p2000.L's version of stat[n] is fixed size.
Sounds good.

> > > + case P9_RSTAT:
> > also fixed size 4+4+8+8+8+8+8+8+4 -- fits in 4k.
>
> Rstat contains stat[n] which in turn contains variable-length string
> fields (filename, owner name, group name)

Right, same mistake.

>
> > > + case P9_TSYMLINK:
> > that one has symlink target which can be arbitrarily long (filesystem
> > specific, 4k is the usual limit for linux but some filesystem I don't
> > know might handle more -- it might be worth going through the trouble of
> > going through it.
>
> Like mentioned above, if exactly NAME_MAX == 4k was assumed, then
> Tsymlink may even be >8k.

And all the other remarks are 'yes if we assume bigger NAME_MAX' -- I'm
happy either way.


> > rest all looks ok to me.
>
> Thanks for the review! I know, that's really a dry patch to look
> at. :)

Thanks for writing it in the first place ;)

--
Dominique

2022-07-14 13:22:47

by Christian Schoenebeck

[permalink] [raw]
Subject: Re: [PATCH v5 10/11] net/9p: add p9_msg_buf_size()

On Mittwoch, 13. Juli 2022 22:52:56 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Wed, Jul 13, 2022 at 03:06:01PM +0200:
> > > > + case P9_TWALK:
> > > > + BUG_ON(strcmp("ddT", fmt));
> > > > + va_arg(ap, int32_t);
> > > > + va_arg(ap, int32_t);
> > > > + {
> > > > + uint i, nwname = max(va_arg(ap, int), 0);
> > >
> > > I was about to say that the max is useless as for loop would be cut
> > > short, but these are unsigned... So the code in protocol.c p9pdu_vwritef
> > > 'T' has a bug (int cast directly to uint16): do you want to fix it or
> > > shall I go ahead?
> >
> > I'd either send a separate patch today for fixing 'T', or if you want
> > to handle it by yourself, then just go ahead.
>
> I'd appreciate if you have time, doesn't make much difference though

Looking closer at this separate issue; there is probably nothing to fix. 'T'
(and 'R') in p9pdu_vwritef() pulls an 'int' argument from the stack. But the
actual variable is passed here:

struct p9_fid *p9_client_walk(struct p9_fid *oldfid, uint16_t nwname,
const unsigned char * const *wnames, int clone)
{
...
req = p9_client_rpc(clnt, P9_TWALK, "ddT", oldfid->fid, fid->fid,
nwname, wnames);
...
}

So the variable being passed was already uint16_t, which made me re-aware why
this is working anyway: Because C and C++ have this nice language hack that
any variadic integer variable smaller than 'int' is automatically casted to
'int' before being passed.

I mean I could clamp the pulled argument like:

diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 3754c33e2974..5fd1e948c86a 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -441,7 +441,7 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
}
break;
case 'T':{
- uint16_t nwname = va_arg(ap, int);
+ uint16_t nwname = clamp_t(int, va_arg(ap, int), 0, USHRT_MAX);
const char **wnames = va_arg(ap, const char **);

errcode = p9pdu_writef(pdu, proto_version, "w",
@@ -462,7 +462,7 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
}
break;
case 'R':{
- uint16_t nwqid = va_arg(ap, int);
+ uint16_t nwqid = clamp_t(int, va_arg(ap, int), 0, USHRT_MAX);
struct p9_qid *wqids =
va_arg(ap, struct p9_qid *);

But it's pretty much pointless overhead. Another option would be to change
va_arg(ap, int) -> va_arg(ap, uint16_t), just to make it more clear what was
pushed from the other side.

Which probably also means I can simply drop the max() call in this patch 10
here as well.

For the 'R' case: I haven't found the spot where this is actually used.

> > > > + case P9_TCREATE:
> > > > + BUG_ON(strcmp("dsdb?s", fmt));
> > > > + va_arg(ap, int32_t);
> > > > + {
> > > > + const char *name = va_arg(ap, const char *);
> > > > + if ((c->proto_version != p9_proto_2000u) &&
> > > > + (c->proto_version != p9_proto_2000L))
> > >
> > > (I don't think 9p2000.L can call TCREATE, but it doesn't really hurt
> > > either)
> >
> > Yes, Tcreate is only 9p2000 and 9p2000.u. Semantically this particular
> > check here means "if proto == 9p.2000". I can't remember anymore why I
> > came up with this inverted form here. I'll change it to "if
> > (c->proto_version == p9_proto_legacy)".
>
> Sounds good.
>
> > > > + case P9_TRENAMEAT:
> > > if we have trenameat we probably want trename, tunlinkat as well?
> > > What's your criteria for counting individually vs slapping 8k at it?
> > >
> > > In this particular case, oldname/newname are single component names
> > > within a directory so this is capped at 2*(4+256), that could easily fit
> > > in 4k without bothering.
> >
> > I have not taken the Linux kernel's current filename limit NAME_MAX
> > (255) as basis, in that case you would be right. Instead I looked up
> > what the maximum filename length among file systems in general was,
> > and saw that ReiserFS supports up to slightly below 4k? So I took 4k
> > as basis for the calculation used here, and the intention was to make
> > this code more future proof. Because revisiting this code later on
> > always takes quite some time and always has this certain potential to
> > miss out details.
>
> hmm, that's pretty deeply engrained into the VFS but I guess it might
> change eventually, yes.
>
> I don't mind as long as we're consistent (cf. unlink/mkdir below), in
> practice measuring doesn't cost much.

OK, I also make that more clear from the commit log then that 4k was taken as
basis and why.

> > Independent of the decision; additionally it might make sense to add
> > something like:
> >
> > #if NAME_MAX > 255
> > # error p9_msg_buf_size() needs adjustments
> > #endif
>
> That's probably an understatement but I don't mind either way, it
> doesn't hurt.
>
> > > > + BUG_ON(strcmp("dsds", fmt));
> > > > + va_arg(ap, int32_t);
> > > > + {
> > > > + const char *oldname = va_arg(ap, const char *);
> > > > + va_arg(ap, int32_t);
> > > > + {
> > > > + const char *newname = va_arg(ap, const char *);
> > >
> > > (style nitpick) I don't see the point of nesting another level of
> > > indentation here, it feels cleaner to declare oldname/newname at the
> > > start of the block and be done with it.
> >
> > Because va_arg(ap, int32_t); must remain between those two
> > declarations, and I think either the compiler or style check script
> > was barking at me. But I will recheck, if possible I will remove the
> > additional block scope here.
>
> Yes, I think it'd need to look like this:
>
> case foo:
> BUG_ON(...)
> va_arg(ap, int32_t);
> {
> const char *oldname = va_arg(ap, const char *);
> const char *newname;
> va_arg(ap, int32_t);
> newname = va_arg(ap, const_char *);
> ...
> }
> or
> {
> const char *oldname, *newname;
> oldname = va_arg(ap, const char *);
> va_arg(ap, int32_t)
> newname = va_arg(ap, const char *);
> ...
> }
>
> I guess the later is slightly easier on the eyes

Ah yes, that's your win there.

> > > > + /* small message types */
> > >
> > > ditto: what's your criteria for 4k vs 8k?
> >
> > As above, 4k being the basis for directory entry names, plus PATH_MAX
> > (4k) as basis for maximum path length.
> >
> > However looking at it again, if NAME_MAX == 4k was assumed exactly,
> > then Tsymlink would have the potential to exceed 8k, as it has name[s]
> > and symtgt[s] plus the other fields.
>
> yes.
>
> > > > + case P9_TSTAT:
> > > this is just fid[4], so 4k is more than enough
> >
> > I guess that was a typo and should have been Twstat instead?
>
> Ah, had missed this because 9p2000.L's version of stat[n] is fixed size.
> Sounds good.
>
> > > > + case P9_RSTAT:
> > > also fixed size 4+4+8+8+8+8+8+8+4 -- fits in 4k.
> >
> > Rstat contains stat[n] which in turn contains variable-length string
> > fields (filename, owner name, group name)
>
> Right, same mistake.
>
> > > > + case P9_TSYMLINK:
> > > that one has symlink target which can be arbitrarily long (filesystem
> > > specific, 4k is the usual limit for linux but some filesystem I don't
> > > know might handle more -- it might be worth going through the trouble of
> > > going through it.
> >
> > Like mentioned above, if exactly NAME_MAX == 4k was assumed, then
> > Tsymlink may even be >8k.
>
> And all the other remarks are 'yes if we assume bigger NAME_MAX' -- I'm
> happy either way.
>
> > > rest all looks ok to me.
> >
> > Thanks for the review! I know, that's really a dry patch to look
> > at. :)
>
> Thanks for writing it in the first place ;)
>
> --
> Dominique