2020-07-05 18:52:46

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 0/4] zone-append support in io-uring and aio

Changes since v2:
- Use file append infra (O_APPEND/RWF_APPEND) to trigger zone-append
(Christoph, Wilcox)
- Added Block I/O path changes (Damien). Avoided append split into multi-bio.
- Added patch to extend zone-append in block-layer to support bvec iov_iter.
Append using io-uring fixed-buffer is enabled with this.
- Made io-uring support code more concise, added changes mentioned by Pavel.

v2: https://lore.kernel.org/io-uring/[email protected]/

Changes since v1:
- No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead.
- linux-aio changes vanish because of no new opcode
- Fixed the overflow and other issues mentioned by Damien
- Simplified uring support code, fixed the issues mentioned by Pavel
- Added error checks for io-uring fixed-buffer and sync kiocb

v1: https://lore.kernel.org/io-uring/[email protected]/

Cover letter (updated):

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.
Application can send write with existing O/RWF_APPEND;On a zoned-block-device
this will trigger zone-append. On regular block device existing behavior is
retained. However, infra allows zone-append to be triggered on any file if
FMODE_ZONE_APPEND (new kernel-only fmode) is set during open.

With zone-append, written-location within zone is known only after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
__u64 data; /* the data field from the iocb */
__u64 obj; /* what iocb this event came from */
__s64 res; /* result code for this event */
__s64 res2; /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
__u32 flags;
};

32 bit flags is not sufficient, to cover zone-size represented by chunk_sectors.
Discussions in the LKML led to following ways to go about it -
Option 1: Return zone-relative offset in sector/512b unit
Option 2: Return zone-relative offset in bytes

With option #1, io-uring changes remain minimal, relatively clean, and extra
checks and conversions are avoided in I/O path. Also ki_complete interface change
is avoided (last parameter ret2 is of long type, which cannot store return value
in bytes). Bad part of the choice is - return value is in 512b units and not in
bytes. To hide that, a wrapper needs to be written in user-space that converts
cqe->flags value to bytes and combines with zone-start.

Option #2 requires pulling some bits from cqe->res and combine those with
cqe->flags to store result in bytes. This bitwise scattering needs to be done
by kernel in I/O path, and application still needs to have a relatively
heavyweight wrapper to assemble the pieces so that both cqe->res and append
location are derived correctly.

Patchset picks option #1.

Kanchan Joshi (2):
fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND
block: enable zone-append for iov_iter of bvec type

Selvakumar S (2):
block: add zone append handling for direct I/O path
io_uring: add support for zone-append

block/bio.c | 31 ++++++++++++++++++++++++++++---
fs/block_dev.c | 49 ++++++++++++++++++++++++++++++++++++++++---------
fs/io_uring.c | 21 +++++++++++++++++++--
include/linux/fs.h | 14 ++++++++++++--
4 files changed, 99 insertions(+), 16 deletions(-)

--
2.7.4


2020-07-05 18:52:56

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 1/4] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND

Enable zone-append using existing O_APPEND and RWF_APPEND infra.
Unlike file-append, zone-apppend requires bit of additional processing
in common path to send completion-result to upper layer. To skip that
for non-zoned block-devices/files, introduce FMODE_ZONE_APPEND and
IOCB_ZONE_APPEND.
When a file is opened, it can subscribe to zone-append by setting
FMODE_ZONE_APPEND. Add IOCB_ZONE_APPEND which is set in kiocb->ki_flags
if write meets existing file-append critera, and file has subscribed to
zone-append.

Signed-off-by: Selvakumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
include/linux/fs.h | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c4ab4d..ef13df4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)0x20000000)

+/* File can support zone-append */
+#define FMODE_ZONE_APPEND ((__force fmode_t)0x40000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
@@ -315,6 +318,7 @@ enum rw_hint {
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
#define IOCB_NOWAIT (1 << 7)
+#define IOCB_ZONE_APPEND (1 << 8)

struct kiocb {
struct file *ki_filp;
@@ -3427,8 +3431,11 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma)
static inline int iocb_flags(struct file *file)
{
int res = 0;
- if (file->f_flags & O_APPEND)
+ if (file->f_flags & O_APPEND) {
res |= IOCB_APPEND;
+ if (file->f_mode & FMODE_ZONE_APPEND)
+ res |= IOCB_ZONE_APPEND;
+ }
if (file->f_flags & O_DIRECT)
res |= IOCB_DIRECT;
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
@@ -3454,8 +3461,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
ki->ki_flags |= IOCB_DSYNC;
if (flags & RWF_SYNC)
ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
- if (flags & RWF_APPEND)
+ if (flags & RWF_APPEND) {
ki->ki_flags |= IOCB_APPEND;
+ if (ki->ki_filp->f_mode & FMODE_ZONE_APPEND)
+ ki->ki_flags |= IOCB_ZONE_APPEND;
+ }
return 0;
}

--
2.7.4

2020-07-05 18:54:21

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 4/4] io_uring: add support for zone-append

From: Selvakumar S <[email protected]>

For zone-append, block-layer will return zone-relative offset via ret2
of ki_complete interface. Make changes to collect it, and send to
user-space using cqe->flags.

Signed-off-by: Selvakumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
fs/io_uring.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..cbde4df 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -402,6 +402,8 @@ struct io_rw {
struct kiocb kiocb;
u64 addr;
u64 len;
+ /* zone-relative offset for append, in sectors */
+ u32 append_offset;
};

struct io_connect {
@@ -541,6 +543,7 @@ enum {
REQ_F_NO_FILE_TABLE_BIT,
REQ_F_QUEUE_TIMEOUT_BIT,
REQ_F_WORK_INITIALIZED_BIT,
+ REQ_F_ZONE_APPEND_BIT,

/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
@@ -598,6 +601,8 @@ enum {
REQ_F_QUEUE_TIMEOUT = BIT(REQ_F_QUEUE_TIMEOUT_BIT),
/* io_wq_work is initialized */
REQ_F_WORK_INITIALIZED = BIT(REQ_F_WORK_INITIALIZED_BIT),
+ /* to return zone relative offset for zone append*/
+ REQ_F_ZONE_APPEND = BIT(REQ_F_ZONE_APPEND_BIT),
};

struct async_poll {
@@ -1745,6 +1750,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,

if (req->flags & REQ_F_BUFFER_SELECTED)
cflags = io_put_kbuf(req);
+ if (req->flags & REQ_F_ZONE_APPEND)
+ cflags = req->rw.append_offset;

__io_cqring_fill_event(req, req->result, cflags);
(*nr_events)++;
@@ -1943,7 +1950,7 @@ static inline void req_set_fail_links(struct io_kiocb *req)
req->flags |= REQ_F_FAIL_LINK;
}

-static void io_complete_rw_common(struct kiocb *kiocb, long res)
+static void io_complete_rw_common(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
int cflags = 0;
@@ -1955,6 +1962,10 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
req_set_fail_links(req);
if (req->flags & REQ_F_BUFFER_SELECTED)
cflags = io_put_kbuf(req);
+ /* use cflags to return zone append completion result */
+ if (req->flags & REQ_F_ZONE_APPEND)
+ cflags = res2;
+
__io_cqring_add_event(req, res, cflags);
}

@@ -1962,7 +1973,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);

- io_complete_rw_common(kiocb, res);
+ io_complete_rw_common(kiocb, res, res2);
io_put_req(req);
}

@@ -1975,6 +1986,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)

if (res != req->result)
req_set_fail_links(req);
+ if (req->flags & REQ_F_ZONE_APPEND)
+ req->rw.append_offset = res2;
+
req->result = res;
if (res != -EAGAIN)
WRITE_ONCE(req->iopoll_completed, 1);
@@ -2739,6 +2753,9 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
SB_FREEZE_WRITE);
}
kiocb->ki_flags |= IOCB_WRITE;
+ /* zone-append requires few extra steps during completion */
+ if (kiocb->ki_flags & IOCB_ZONE_APPEND)
+ req->flags |= REQ_F_ZONE_APPEND;

if (!force_nonblock)
current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize;
--
2.7.4

2020-07-05 18:54:41

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 3/4] block: enable zone-append for iov_iter of bvec type

zone-append with bvec iov_iter gives WARN_ON, and returns -EINVAL.
Add new helper to process such iov_iter and add pages in bio honoring
zone-append specific constraints.
This is used to enable zone-append with io-uring fixed-buffer.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Selvakumar S <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
block/bio.c | 31 ++++++++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0cecdbc..ade9da7 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -975,6 +975,30 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
iov_iter_advance(iter, size);
return 0;
}
+static int __bio_iov_bvec_append_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+ const struct bio_vec *bv = iter->bvec;
+ unsigned int len;
+ size_t size;
+ struct request_queue *q = bio->bi_disk->queue;
+ unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+ bool same_page = false;
+
+ if (WARN_ON_ONCE(!max_append_sectors))
+ return -EINVAL;
+
+ if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+ return -EINVAL;
+
+ len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
+ size = bio_add_hw_page(q, bio, bv->bv_page, len,
+ bv->bv_offset + iter->iov_offset,
+ max_append_sectors, &same_page);
+ if (unlikely(size != len))
+ return -EINVAL;
+ iov_iter_advance(iter, size);
+ return 0;
+}

#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))

@@ -1105,9 +1129,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

do {
if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
- if (WARN_ON_ONCE(is_bvec))
- return -EINVAL;
- ret = __bio_iov_append_get_pages(bio, iter);
+ if (is_bvec)
+ ret = __bio_iov_bvec_append_add_pages(bio, iter);
+ else
+ ret = __bio_iov_append_get_pages(bio, iter);
} else {
if (is_bvec)
ret = __bio_iov_bvec_add_pages(bio, iter);
--
2.7.4

2020-07-05 18:55:10

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH v3 2/4] block: add zone append handling for direct I/O path

From: Selvakumar S <[email protected]>

For zoned block device, subscribe to zone-append by setting
FMODE_ZONE_APPEND during open. Make direct IO submission path use
IOCB_ZONE_APPEND to send bio with append op. Make direct IO completion
return zone-relative offset, in sector unit, to upper layer using
kiocb->ki_complete interface.
Return failure if write is larger than max_append_limit and therefore
requires formation of multiple bios.

Signed-off-by: Selvakumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Arnav Dawn <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
fs/block_dev.c | 49 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 40 insertions(+), 9 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e5..941fb22 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -178,10 +178,19 @@ static struct inode *bdev_file_inode(struct file *file)
return file->f_mapping->host;
}

-static unsigned int dio_bio_write_op(struct kiocb *iocb)
+static unsigned int dio_bio_op(bool is_read, struct kiocb *iocb)
{
- unsigned int op = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
+ unsigned int op;

+ if (is_read)
+ return REQ_OP_READ;
+
+ if (iocb->ki_flags & IOCB_ZONE_APPEND)
+ op = REQ_OP_ZONE_APPEND;
+ else
+ op = REQ_OP_WRITE;
+
+ op |= REQ_SYNC | REQ_IDLE;
/* avoid the need for a I/O completion work item */
if (iocb->ki_flags & IOCB_DSYNC)
op |= REQ_FUA;
@@ -207,6 +216,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
loff_t pos = iocb->ki_pos;
bool should_dirty = false;
+ bool is_read = (iov_iter_rw(iter) == READ);
struct bio bio;
ssize_t ret;
blk_qc_t qc;
@@ -231,18 +241,17 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
bio.bi_ioprio = iocb->ki_ioprio;
+ bio.bi_opf = dio_bio_op(is_read, iocb);

ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
goto out;
ret = bio.bi_iter.bi_size;

- if (iov_iter_rw(iter) == READ) {
- bio.bi_opf = REQ_OP_READ;
+ if (is_read) {
if (iter_is_iovec(iter))
should_dirty = true;
} else {
- bio.bi_opf = dio_bio_write_op(iocb);
task_io_account_write(ret);
}
if (iocb->ki_flags & IOCB_HIPRI)
@@ -295,6 +304,16 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
}

+static inline long blkdev_bio_res2(struct kiocb *iocb, struct bio *bio)
+{
+ /* calculate zone relative offset for zone append */
+ if (op_is_write(bio_op(bio)) && iocb->ki_flags & IOCB_ZONE_APPEND) {
+ sector_t zone_sec = blk_queue_zone_sectors(bio->bi_disk->queue);
+ return bio->bi_iter.bi_sector & (zone_sec - 1);
+ }
+ return 0;
+}
+
static void blkdev_bio_end_io(struct bio *bio)
{
struct blkdev_dio *dio = bio->bi_private;
@@ -307,15 +326,17 @@ static void blkdev_bio_end_io(struct bio *bio)
if (!dio->is_sync) {
struct kiocb *iocb = dio->iocb;
ssize_t ret;
+ long res2;

if (likely(!dio->bio.bi_status)) {
ret = dio->size;
iocb->ki_pos += ret;
+ res2 = blkdev_bio_res2(iocb, bio);
} else {
ret = blk_status_to_errno(dio->bio.bi_status);
}

- dio->iocb->ki_complete(iocb, ret, 0);
+ dio->iocb->ki_complete(iocb, ret, res2);
if (dio->multi_bio)
bio_put(&dio->bio);
} else {
@@ -382,6 +403,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
+ bio->bi_opf = dio_bio_op(is_read, iocb);

ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
@@ -391,11 +413,9 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
}

if (is_read) {
- bio->bi_opf = REQ_OP_READ;
if (dio->should_dirty)
bio_set_pages_dirty(bio);
} else {
- bio->bi_opf = dio_bio_write_op(iocb);
task_io_account_write(bio->bi_iter.bi_size);
}

@@ -419,6 +439,12 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
}

if (!dio->multi_bio) {
+ /* zone-append cannot work with multi bio*/
+ if (!is_read && iocb->ki_flags & IOCB_ZONE_APPEND) {
+ bio->bi_status = BLK_STS_IOERR;
+ bio_endio(bio);
+ break;
+ }
/*
* AIO needs an extra reference to ensure the dio
* structure which is embedded into the first bio
@@ -1841,6 +1867,7 @@ EXPORT_SYMBOL(blkdev_get_by_dev);
static int blkdev_open(struct inode * inode, struct file * filp)
{
struct block_device *bdev;
+ int ret;

/*
* Preserve backwards compatibility and allow large file access
@@ -1866,7 +1893,11 @@ static int blkdev_open(struct inode * inode, struct file * filp)
filp->f_mapping = bdev->bd_inode->i_mapping;
filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);

- return blkdev_get(bdev, filp->f_mode, filp);
+ ret = blkdev_get(bdev, filp->f_mode, filp);
+ if (blk_queue_is_zoned(bdev->bd_disk->queue))
+ filp->f_mode |= FMODE_ZONE_APPEND;
+
+ return ret;
}

static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
--
2.7.4

2020-07-05 21:01:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/5/20 12:47 PM, Kanchan Joshi wrote:
> From: Selvakumar S <[email protected]>
>
> For zone-append, block-layer will return zone-relative offset via ret2
> of ki_complete interface. Make changes to collect it, and send to
> user-space using cqe->flags.
>
> Signed-off-by: Selvakumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/io_uring.c | 21 +++++++++++++++++++--
> 1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 155f3d8..cbde4df 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -402,6 +402,8 @@ struct io_rw {
> struct kiocb kiocb;
> u64 addr;
> u64 len;
> + /* zone-relative offset for append, in sectors */
> + u32 append_offset;
> };

I don't like this very much at all. As it stands, the first cacheline
of io_kiocb is set aside for request-private data. io_rw is already
exactly 64 bytes, which means that you're now growing io_rw beyond
a cacheline and increasing the size of io_kiocb as a whole.

Maybe you can reuse io_rw->len for this, as that is only used on the
submission side of things.

--
Jens Axboe

2020-07-05 22:04:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
> > From: Selvakumar S <[email protected]>
> >
> > For zone-append, block-layer will return zone-relative offset via ret2
> > of ki_complete interface. Make changes to collect it, and send to
> > user-space using cqe->flags.
> >
> > Signed-off-by: Selvakumar S <[email protected]>
> > Signed-off-by: Kanchan Joshi <[email protected]>
> > Signed-off-by: Nitesh Shetty <[email protected]>
> > Signed-off-by: Javier Gonzalez <[email protected]>
> > ---
> > fs/io_uring.c | 21 +++++++++++++++++++--
> > 1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 155f3d8..cbde4df 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -402,6 +402,8 @@ struct io_rw {
> > struct kiocb kiocb;
> > u64 addr;
> > u64 len;
> > + /* zone-relative offset for append, in sectors */
> > + u32 append_offset;
> > };
>
> I don't like this very much at all. As it stands, the first cacheline
> of io_kiocb is set aside for request-private data. io_rw is already
> exactly 64 bytes, which means that you're now growing io_rw beyond
> a cacheline and increasing the size of io_kiocb as a whole.
>
> Maybe you can reuse io_rw->len for this, as that is only used on the
> submission side of things.

I'm surprised you aren't more upset by the abuse of cqe->flags for the
address.

What do you think to my idea of interpreting the user_data as being a
pointer to somewhere to store the address? Obviously other things
can be stored after the address in the user_data.

Or we could have a separate flag to indicate that is how to interpret
the user_data.

2020-07-05 22:09:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/5/20 3:09 PM, Matthew Wilcox wrote:
> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
>>> From: Selvakumar S <[email protected]>
>>>
>>> For zone-append, block-layer will return zone-relative offset via ret2
>>> of ki_complete interface. Make changes to collect it, and send to
>>> user-space using cqe->flags.
>>>
>>> Signed-off-by: Selvakumar S <[email protected]>
>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>> ---
>>> fs/io_uring.c | 21 +++++++++++++++++++--
>>> 1 file changed, 19 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 155f3d8..cbde4df 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -402,6 +402,8 @@ struct io_rw {
>>> struct kiocb kiocb;
>>> u64 addr;
>>> u64 len;
>>> + /* zone-relative offset for append, in sectors */
>>> + u32 append_offset;
>>> };
>>
>> I don't like this very much at all. As it stands, the first cacheline
>> of io_kiocb is set aside for request-private data. io_rw is already
>> exactly 64 bytes, which means that you're now growing io_rw beyond
>> a cacheline and increasing the size of io_kiocb as a whole.
>>
>> Maybe you can reuse io_rw->len for this, as that is only used on the
>> submission side of things.
>
> I'm surprised you aren't more upset by the abuse of cqe->flags for the
> address.

Yeah, it's not great either, but we have less leeway there in terms of
how much space is available to pass back extra data.

> What do you think to my idea of interpreting the user_data as being a
> pointer to somewhere to store the address? Obviously other things
> can be stored after the address in the user_data.

I don't like that at all, as all other commands just pass user_data
through. This means the application would have to treat this very
differently, and potentially not have a way to store any data for
locating the original command on the user side.

> Or we could have a separate flag to indicate that is how to interpret
> the user_data.

I'd be vehemently against changing user_data in any shape or form.
It's to be passed through from sqe to cqe, that's how the command flow
works. It's never kernel generated, and it's also used as a key for
command lookup.

--
Jens Axboe

2020-07-06 14:13:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
> > On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
> >> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
> >>> From: Selvakumar S <[email protected]>
> >>>
> >>> For zone-append, block-layer will return zone-relative offset via ret2
> >>> of ki_complete interface. Make changes to collect it, and send to
> >>> user-space using cqe->flags.

> > I'm surprised you aren't more upset by the abuse of cqe->flags for the
> > address.
>
> Yeah, it's not great either, but we have less leeway there in terms of
> how much space is available to pass back extra data.
>
> > What do you think to my idea of interpreting the user_data as being a
> > pointer to somewhere to store the address? Obviously other things
> > can be stored after the address in the user_data.
>
> I don't like that at all, as all other commands just pass user_data
> through. This means the application would have to treat this very
> differently, and potentially not have a way to store any data for
> locating the original command on the user side.

I think you misunderstood me. You seem to have thought I meant
"use the user_data field to return the address" when I actually meant
"interpret the user_data field as a pointer to where userspace
wants the address stored".

2020-07-06 14:29:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/6/20 8:10 AM, Matthew Wilcox wrote:
> On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
>> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
>>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
>>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
>>>>> From: Selvakumar S <[email protected]>
>>>>>
>>>>> For zone-append, block-layer will return zone-relative offset via ret2
>>>>> of ki_complete interface. Make changes to collect it, and send to
>>>>> user-space using cqe->flags.
>
>>> I'm surprised you aren't more upset by the abuse of cqe->flags for the
>>> address.
>>
>> Yeah, it's not great either, but we have less leeway there in terms of
>> how much space is available to pass back extra data.
>>
>>> What do you think to my idea of interpreting the user_data as being a
>>> pointer to somewhere to store the address? Obviously other things
>>> can be stored after the address in the user_data.
>>
>> I don't like that at all, as all other commands just pass user_data
>> through. This means the application would have to treat this very
>> differently, and potentially not have a way to store any data for
>> locating the original command on the user side.
>
> I think you misunderstood me. You seem to have thought I meant
> "use the user_data field to return the address" when I actually meant
> "interpret the user_data field as a pointer to where userspace
> wants the address stored".

It's still somewhat weird to have user_data have special meaning, you're
now having the kernel interpret it while every other command it's just
an opaque that is passed through.

But it could of course work, and the app could embed the necessary
u32/u64 in some other structure that's persistent across IO. If it
doesn't have that, then it'd need to now have one allocated and freed
across the lifetime of the IO.

If we're going that route, it'd be better to define the write such that
you're passing in the necessary information upfront. In syscall terms,
then that'd be something ala:

ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt,
off_t *offset, int flags);

where *offset is copied out when the write completes. That removes the
need to abuse user_data, with just providing the storage pointer for the
offset upfront.

--
Jens Axboe

2020-07-06 14:33:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Mon, Jul 06, 2020 at 08:27:17AM -0600, Jens Axboe wrote:
> On 7/6/20 8:10 AM, Matthew Wilcox wrote:
> > On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
> >> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
> >>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
> >>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
> >>>>> From: Selvakumar S <[email protected]>
> >>>>>
> >>>>> For zone-append, block-layer will return zone-relative offset via ret2
> >>>>> of ki_complete interface. Make changes to collect it, and send to
> >>>>> user-space using cqe->flags.
> >
> >>> I'm surprised you aren't more upset by the abuse of cqe->flags for the
> >>> address.
> >>
> >> Yeah, it's not great either, but we have less leeway there in terms of
> >> how much space is available to pass back extra data.
> >>
> >>> What do you think to my idea of interpreting the user_data as being a
> >>> pointer to somewhere to store the address? Obviously other things
> >>> can be stored after the address in the user_data.
> >>
> >> I don't like that at all, as all other commands just pass user_data
> >> through. This means the application would have to treat this very
> >> differently, and potentially not have a way to store any data for
> >> locating the original command on the user side.
> >
> > I think you misunderstood me. You seem to have thought I meant
> > "use the user_data field to return the address" when I actually meant
> > "interpret the user_data field as a pointer to where userspace
> > wants the address stored".
>
> It's still somewhat weird to have user_data have special meaning, you're
> now having the kernel interpret it while every other command it's just
> an opaque that is passed through.
>
> But it could of course work, and the app could embed the necessary
> u32/u64 in some other structure that's persistent across IO. If it
> doesn't have that, then it'd need to now have one allocated and freed
> across the lifetime of the IO.
>
> If we're going that route, it'd be better to define the write such that
> you're passing in the necessary information upfront. In syscall terms,
> then that'd be something ala:
>
> ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt,
> off_t *offset, int flags);
>
> where *offset is copied out when the write completes. That removes the
> need to abuse user_data, with just providing the storage pointer for the
> offset upfront.

That works for me! In io_uring terms, would you like to see that done
as adding:

union {
__u64 off; /* offset into file */
+ __u64 *offp; /* appending writes */
__u64 addr2;
};

2020-07-06 14:34:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/6/20 8:32 AM, Matthew Wilcox wrote:
> On Mon, Jul 06, 2020 at 08:27:17AM -0600, Jens Axboe wrote:
>> On 7/6/20 8:10 AM, Matthew Wilcox wrote:
>>> On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
>>>> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
>>>>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
>>>>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>
>>>>>>> For zone-append, block-layer will return zone-relative offset via ret2
>>>>>>> of ki_complete interface. Make changes to collect it, and send to
>>>>>>> user-space using cqe->flags.
>>>
>>>>> I'm surprised you aren't more upset by the abuse of cqe->flags for the
>>>>> address.
>>>>
>>>> Yeah, it's not great either, but we have less leeway there in terms of
>>>> how much space is available to pass back extra data.
>>>>
>>>>> What do you think to my idea of interpreting the user_data as being a
>>>>> pointer to somewhere to store the address? Obviously other things
>>>>> can be stored after the address in the user_data.
>>>>
>>>> I don't like that at all, as all other commands just pass user_data
>>>> through. This means the application would have to treat this very
>>>> differently, and potentially not have a way to store any data for
>>>> locating the original command on the user side.
>>>
>>> I think you misunderstood me. You seem to have thought I meant
>>> "use the user_data field to return the address" when I actually meant
>>> "interpret the user_data field as a pointer to where userspace
>>> wants the address stored".
>>
>> It's still somewhat weird to have user_data have special meaning, you're
>> now having the kernel interpret it while every other command it's just
>> an opaque that is passed through.
>>
>> But it could of course work, and the app could embed the necessary
>> u32/u64 in some other structure that's persistent across IO. If it
>> doesn't have that, then it'd need to now have one allocated and freed
>> across the lifetime of the IO.
>>
>> If we're going that route, it'd be better to define the write such that
>> you're passing in the necessary information upfront. In syscall terms,
>> then that'd be something ala:
>>
>> ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt,
>> off_t *offset, int flags);
>>
>> where *offset is copied out when the write completes. That removes the
>> need to abuse user_data, with just providing the storage pointer for the
>> offset upfront.
>
> That works for me! In io_uring terms, would you like to see that done
> as adding:
>
> union {
> __u64 off; /* offset into file */
> + __u64 *offp; /* appending writes */
> __u64 addr2;
> };
>

Either that, or just use addr2 for it directly. I consider the appending
writes a marginal enough use case that it doesn't really warrant adding
a specially named field for that.

--
Jens Axboe

2020-07-07 15:53:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Tue, Jul 07, 2020 at 08:41:05PM +0530, Kanchan Joshi wrote:
> On Mon, Jul 06, 2020 at 03:32:08PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 06, 2020 at 08:27:17AM -0600, Jens Axboe wrote:
> > > On 7/6/20 8:10 AM, Matthew Wilcox wrote:
> > > > On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
> > > >> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
> > > >>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
> > > >>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
> > > >>>>> From: Selvakumar S <[email protected]>
> > > >>>>>
> > > >>>>> For zone-append, block-layer will return zone-relative offset via ret2
> > > >>>>> of ki_complete interface. Make changes to collect it, and send to
> > > >>>>> user-space using cqe->flags.
> > > >
> > > >>> I'm surprised you aren't more upset by the abuse of cqe->flags for the
> > > >>> address.
>
> Documentation (https://kernel.dk/io_uring.pdf) mentioned cqe->flags can carry
> the metadata for the operation. I wonder if this should be called abuse.
>
> > > >> Yeah, it's not great either, but we have less leeway there in terms of
> > > >> how much space is available to pass back extra data.
> > > >>
> > > >>> What do you think to my idea of interpreting the user_data as being a
> > > >>> pointer to somewhere to store the address? Obviously other things
> > > >>> can be stored after the address in the user_data.
> > > >>
> > > >> I don't like that at all, as all other commands just pass user_data
> > > >> through. This means the application would have to treat this very
> > > >> differently, and potentially not have a way to store any data for
> > > >> locating the original command on the user side.
> > > >
> > > > I think you misunderstood me. You seem to have thought I meant
> > > > "use the user_data field to return the address" when I actually meant
> > > > "interpret the user_data field as a pointer to where userspace
> > > > wants the address stored".
> > >
> > > It's still somewhat weird to have user_data have special meaning, you're
> > > now having the kernel interpret it while every other command it's just
> > > an opaque that is passed through.
> > >
> > > But it could of course work, and the app could embed the necessary
> > > u32/u64 in some other structure that's persistent across IO. If it
> > > doesn't have that, then it'd need to now have one allocated and freed
> > > across the lifetime of the IO.
> > >
> > > If we're going that route, it'd be better to define the write such that
> > > you're passing in the necessary information upfront. In syscall terms,
> > > then that'd be something ala:
> > >
> > > ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt,
> > > off_t *offset, int flags);
> > >
> > > where *offset is copied out when the write completes. That removes the
> > > need to abuse user_data, with just providing the storage pointer for the
> > > offset upfront.
> >
> > That works for me! In io_uring terms, would you like to see that done
> > as adding:
> >
> > union {
> > __u64 off; /* offset into file */
> > + __u64 *offp; /* appending writes */
> > __u64 addr2;
> > };
> But there are peformance implications of this approach?
> If I got it right, the workflow is: - Application allocates 64bit of space,
> writes "off" into it and pass it
> in the sqe->addr2
> - Kernel first reads sqe->addr2, reads the value to know the intended
> write-location, and stores the address somewhere (?) to be used during
> completion. Storing this address seems tricky as this may add one more
> cacheline (in io_kiocb->rw)?

io_kiocb is:
/* size: 232, cachelines: 4, members: 19 */
/* forced alignments: 1 */
/* last cacheline: 40 bytes */
so we have another 24 bytes before io_kiocb takes up another cacheline.
If that's a serious problem, I have an idea about how to shrink struct
kiocb by 8 bytes so struct io_rw would have space to store another
pointer.

> - During completion cqe res/flags are written as before, but extra step
> to copy the append-completion-result into that user-space address.
> Extra steps are due to the pointer indirection.

... we've just done an I/O. Concern about an extra pointer access
seems misplaced?

> And it seems application needs to be careful about managing this 64bit of
> space for a cluster of writes, especially if it wants to reuse the sqe
> before the completion.
> New one can handle 64bit result cleanly, but seems slower than current
> one.

But userspace has to _do_ something with that information anyway. So
it must already have somewhere to put that information.

I do think that interpretation of that field should be a separate flag
from WRITE_APPEND so apps which genuinely don't care about where the I/O
ended up don't have to allocate some temporary storage. eg a logging
application which just needs to know that it managed to append to the
end of the log and doesn't need to do anything if it's successful.

2020-07-07 16:03:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Tue, Jul 07, 2020 at 04:52:37PM +0100, Matthew Wilcox wrote:
> But userspace has to _do_ something with that information anyway. So
> it must already have somewhere to put that information.
>
> I do think that interpretation of that field should be a separate flag
> from WRITE_APPEND so apps which genuinely don't care about where the I/O
> ended up don't have to allocate some temporary storage. eg a logging
> application which just needs to know that it managed to append to the
> end of the log and doesn't need to do anything if it's successful.

I agree with the concept of a flag for just returning the write
location. Because if you don't need that O_APPEND is all you need
anyway.

2020-07-07 20:40:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/7/20 2:23 PM, Kanchan Joshi wrote:
> On Tue, Jul 07, 2020 at 04:52:37PM +0100, Matthew Wilcox wrote:
>> On Tue, Jul 07, 2020 at 08:41:05PM +0530, Kanchan Joshi wrote:
>>> On Mon, Jul 06, 2020 at 03:32:08PM +0100, Matthew Wilcox wrote:
>>>> On Mon, Jul 06, 2020 at 08:27:17AM -0600, Jens Axboe wrote:
>>>>> On 7/6/20 8:10 AM, Matthew Wilcox wrote:
>>>>>> On Sun, Jul 05, 2020 at 03:12:50PM -0600, Jens Axboe wrote:
>>>>>>> On 7/5/20 3:09 PM, Matthew Wilcox wrote:
>>>>>>>> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
>>>>>>>>> On 7/5/20 12:47 PM, Kanchan Joshi wrote:
>>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>>
>>>>>>>>>> For zone-append, block-layer will return zone-relative offset via ret2
>>>>>>>>>> of ki_complete interface. Make changes to collect it, and send to
>>>>>>>>>> user-space using cqe->flags.
>>>>>>
>>>>>>>> I'm surprised you aren't more upset by the abuse of cqe->flags for the
>>>>>>>> address.
>>>
>>> Documentation (https://protect2.fireeye.com/url?k=297dbcbf-74aee030-297c37f0-0cc47a31ce52-632d3561909b91fc&q=1&u=https%3A%2F%2Fkernel.dk%2Fio_uring.pdf) mentioned cqe->flags can carry
>>> the metadata for the operation. I wonder if this should be called abuse.
>>>
>>>>>>> Yeah, it's not great either, but we have less leeway there in terms of
>>>>>>> how much space is available to pass back extra data.
>>>>>>>
>>>>>>>> What do you think to my idea of interpreting the user_data as being a
>>>>>>>> pointer to somewhere to store the address? Obviously other things
>>>>>>>> can be stored after the address in the user_data.
>>>>>>>
>>>>>>> I don't like that at all, as all other commands just pass user_data
>>>>>>> through. This means the application would have to treat this very
>>>>>>> differently, and potentially not have a way to store any data for
>>>>>>> locating the original command on the user side.
>>>>>>
>>>>>> I think you misunderstood me. You seem to have thought I meant
>>>>>> "use the user_data field to return the address" when I actually meant
>>>>>> "interpret the user_data field as a pointer to where userspace
>>>>>> wants the address stored".
>>>>>
>>>>> It's still somewhat weird to have user_data have special meaning, you're
>>>>> now having the kernel interpret it while every other command it's just
>>>>> an opaque that is passed through.
>>>>>
>>>>> But it could of course work, and the app could embed the necessary
>>>>> u32/u64 in some other structure that's persistent across IO. If it
>>>>> doesn't have that, then it'd need to now have one allocated and freed
>>>>> across the lifetime of the IO.
>>>>>
>>>>> If we're going that route, it'd be better to define the write such that
>>>>> you're passing in the necessary information upfront. In syscall terms,
>>>>> then that'd be something ala:
>>>>>
>>>>> ssize_t my_append_write(int fd, const struct iovec *iov, int iovcnt,
>>>>> off_t *offset, int flags);
>>>>>
>>>>> where *offset is copied out when the write completes. That removes the
>>>>> need to abuse user_data, with just providing the storage pointer for the
>>>>> offset upfront.
>>>>
>>>> That works for me! In io_uring terms, would you like to see that done
>>>> as adding:
>>>>
>>>> union {
>>>> __u64 off; /* offset into file */
>>>> + __u64 *offp; /* appending writes */
>>>> __u64 addr2;
>>>> };
>>> But there are peformance implications of this approach?
>>> If I got it right, the workflow is: - Application allocates 64bit of space,
>>> writes "off" into it and pass it
>>> in the sqe->addr2
>>> - Kernel first reads sqe->addr2, reads the value to know the intended
>>> write-location, and stores the address somewhere (?) to be used during
>>> completion. Storing this address seems tricky as this may add one more
>>> cacheline (in io_kiocb->rw)?
>>
>> io_kiocb is:
>> /* size: 232, cachelines: 4, members: 19 */
>> /* forced alignments: 1 */
>> /* last cacheline: 40 bytes */
>> so we have another 24 bytes before io_kiocb takes up another cacheline.
>> If that's a serious problem, I have an idea about how to shrink struct
>> kiocb by 8 bytes so struct io_rw would have space to store another
>> pointer.
> Yes, io_kiocb has room. Cache-locality wise whether that is fine or
> it must be placed within io_rw - I'll come to know once I get to
> implement this. Please share the idea you have, it can come handy.

Except it doesn't, I'm not interested in adding per-request type fields
to the generic part of it. Before we know it, we'll blow past the next
cacheline.

If we can find space in the kiocb, that'd be much better. Note that once
the async buffered bits go in for 5.9, then there's no longer a 4-byte
hole in struct kiocb.

>> ... we've just done an I/O. Concern about an extra pointer access
>> seems misplaced?
>
> I was thinking about both read-from (submission) and write-to
> (completion) from user-space pointer, and all those checks APIs
> (get_user, copy_from_user) perform.....but when seen against I/O (that
> too direct), it does look small. Down the line it may matter for cached-IO
> but I get your point.

Really don't think that matters at all.

--
Jens Axboe

2020-07-07 22:51:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Tue, Jul 07, 2020 at 02:40:06PM -0600, Jens Axboe wrote:
> >> so we have another 24 bytes before io_kiocb takes up another cacheline.
> >> If that's a serious problem, I have an idea about how to shrink struct
> >> kiocb by 8 bytes so struct io_rw would have space to store another
> >> pointer.
> > Yes, io_kiocb has room. Cache-locality wise whether that is fine or
> > it must be placed within io_rw - I'll come to know once I get to
> > implement this. Please share the idea you have, it can come handy.
>
> Except it doesn't, I'm not interested in adding per-request type fields
> to the generic part of it. Before we know it, we'll blow past the next
> cacheline.
>
> If we can find space in the kiocb, that'd be much better. Note that once
> the async buffered bits go in for 5.9, then there's no longer a 4-byte
> hole in struct kiocb.

Well, poot, I was planning on using that. OK, how about this:

+#define IOCB_NO_CMPL (15 << 28)

struct kiocb {
[...]
- void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
+ loff_t __user *ki_uposp;
- int ki_flags;
+ unsigned int ki_flags;

+typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
+static ki_cmpl * const ki_cmpls[15];

+void ki_complete(struct kiocb *iocb, long ret, long ret2)
+{
+ unsigned int id = iocb->ki_flags >> 28;
+
+ if (id < 15)
+ ki_cmpls[id](iocb, ret, ret2);
+}

+int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
+{
+ for (i = 0; i < 15; i++) {
+ if (ki_cmpls[id])
+ continue;
+ ki_cmpls[id] = cb;
+ return id;
+ }
+ WARN();
+ return -1;
+}

... etc, also need an unregister.

2020-07-07 22:52:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/7/20 4:18 PM, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 02:40:06PM -0600, Jens Axboe wrote:
>>>> so we have another 24 bytes before io_kiocb takes up another cacheline.
>>>> If that's a serious problem, I have an idea about how to shrink struct
>>>> kiocb by 8 bytes so struct io_rw would have space to store another
>>>> pointer.
>>> Yes, io_kiocb has room. Cache-locality wise whether that is fine or
>>> it must be placed within io_rw - I'll come to know once I get to
>>> implement this. Please share the idea you have, it can come handy.
>>
>> Except it doesn't, I'm not interested in adding per-request type fields
>> to the generic part of it. Before we know it, we'll blow past the next
>> cacheline.
>>
>> If we can find space in the kiocb, that'd be much better. Note that once
>> the async buffered bits go in for 5.9, then there's no longer a 4-byte
>> hole in struct kiocb.
>
> Well, poot, I was planning on using that. OK, how about this:

Figured you might have had your sights set on that one, which is why I
wanted to bring it up upfront :-)

> +#define IOCB_NO_CMPL (15 << 28)
>
> struct kiocb {
> [...]
> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
> + loff_t __user *ki_uposp;
> - int ki_flags;
> + unsigned int ki_flags;
>
> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
> +static ki_cmpl * const ki_cmpls[15];
>
> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
> +{
> + unsigned int id = iocb->ki_flags >> 28;
> +
> + if (id < 15)
> + ki_cmpls[id](iocb, ret, ret2);
> +}
>
> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
> +{
> + for (i = 0; i < 15; i++) {
> + if (ki_cmpls[id])
> + continue;
> + ki_cmpls[id] = cb;
> + return id;
> + }
> + WARN();
> + return -1;
> +}

That could work, we don't really have a lot of different completion
types in the kernel.

--
Jens Axboe

2020-07-08 14:25:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Wed, Jul 08, 2020 at 06:28:05PM +0530, Kanchan Joshi wrote:
> The last thing is about the flag used to trigger this processing. Will it be
> fine to intoduce new flag (RWF_APPEND2 or RWF_APPEND_OFFSET)
> instead of using RWF_APPEND?
>
> New flag will do what RWF_APPEND does and will also return the
> written-location (and therefore expects pointer setup in application).

I think it's simpler to understand if it's called RWF_INDIRECT_OFFSET
Then it'd look like:

+ rwf_t rwf = READ_ONCE(sqe->rw_flags);
...
- iocb->ki_pos = READ_ONCE(sqe->off);
+ if (rwf & RWF_INDIRECT_OFFSET) {
+ loff_t __user *loffp = u64_to_user_ptr(sqe->addr2);
+
+ if (get_user(iocb->ki_pos, loffp)
+ return -EFAULT;
+ iocb->ki_loffp = loffp;
+ } else {
+ iocb->ki_pos = READ_ONCE(sqe->off);
+ }
...
- ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+ ret = kiocb_set_rw_flags(kiocb, rwf);

2020-07-08 14:55:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/8/20 6:58 AM, Kanchan Joshi wrote:
> On Tue, Jul 07, 2020 at 04:37:55PM -0600, Jens Axboe wrote:
>> On 7/7/20 4:18 PM, Matthew Wilcox wrote:
>>> On Tue, Jul 07, 2020 at 02:40:06PM -0600, Jens Axboe wrote:
>>>>>> so we have another 24 bytes before io_kiocb takes up another cacheline.
>>>>>> If that's a serious problem, I have an idea about how to shrink struct
>>>>>> kiocb by 8 bytes so struct io_rw would have space to store another
>>>>>> pointer.
>>>>> Yes, io_kiocb has room. Cache-locality wise whether that is fine or
>>>>> it must be placed within io_rw - I'll come to know once I get to
>>>>> implement this. Please share the idea you have, it can come handy.
>>>>
>>>> Except it doesn't, I'm not interested in adding per-request type fields
>>>> to the generic part of it. Before we know it, we'll blow past the next
>>>> cacheline.
>>>>
>>>> If we can find space in the kiocb, that'd be much better. Note that once
>>>> the async buffered bits go in for 5.9, then there's no longer a 4-byte
>>>> hole in struct kiocb.
>>>
>>> Well, poot, I was planning on using that. OK, how about this:
>>
>> Figured you might have had your sights set on that one, which is why I
>> wanted to bring it up upfront :-)
>>
>>> +#define IOCB_NO_CMPL (15 << 28)
>>>
>>> struct kiocb {
>>> [...]
>>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>> + loff_t __user *ki_uposp;
>>> - int ki_flags;
>>> + unsigned int ki_flags;
>>>
>>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
>>> +static ki_cmpl * const ki_cmpls[15];
>>>
>>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
>>> +{
>>> + unsigned int id = iocb->ki_flags >> 28;
>>> +
>>> + if (id < 15)
>>> + ki_cmpls[id](iocb, ret, ret2);
>>> +}
>>>
>>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
>>> +{
>>> + for (i = 0; i < 15; i++) {
>>> + if (ki_cmpls[id])
>>> + continue;
>>> + ki_cmpls[id] = cb;
>>> + return id;
>>> + }
>>> + WARN();
>>> + return -1;
>>> +}
>>
>> That could work, we don't really have a lot of different completion
>> types in the kernel.
>
> Thanks, this looks sorted.

Not really, someone still needs to do that work. I took a quick look, and
most of it looks straight forward. The only potential complication is
ocfs2, which does a swap of the completion for the kiocb. That would just
turn into an upper flag swap. And potential sync kiocb with NULL
ki_complete. The latter should be fine, I think we just need to reserve
completion nr 0 for being that.

--
Jens Axboe

2020-07-08 15:00:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Wed, Jul 08, 2020 at 08:54:07AM -0600, Jens Axboe wrote:
> On 7/8/20 6:58 AM, Kanchan Joshi wrote:
> >>> +#define IOCB_NO_CMPL (15 << 28)
> >>>
> >>> struct kiocb {
> >>> [...]
> >>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
> >>> + loff_t __user *ki_uposp;
> >>> - int ki_flags;
> >>> + unsigned int ki_flags;
> >>>
> >>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
> >>> +static ki_cmpl * const ki_cmpls[15];
> >>>
> >>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
> >>> +{
> >>> + unsigned int id = iocb->ki_flags >> 28;
> >>> +
> >>> + if (id < 15)
> >>> + ki_cmpls[id](iocb, ret, ret2);
> >>> +}
> >>>
> >>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
> >>> +{
> >>> + for (i = 0; i < 15; i++) {
> >>> + if (ki_cmpls[id])
> >>> + continue;
> >>> + ki_cmpls[id] = cb;
> >>> + return id;
> >>> + }
> >>> + WARN();
> >>> + return -1;
> >>> +}
> >>
> >> That could work, we don't really have a lot of different completion
> >> types in the kernel.
> >
> > Thanks, this looks sorted.
>
> Not really, someone still needs to do that work. I took a quick look, and
> most of it looks straight forward. The only potential complication is
> ocfs2, which does a swap of the completion for the kiocb. That would just
> turn into an upper flag swap. And potential sync kiocb with NULL
> ki_complete. The latter should be fine, I think we just need to reserve
> completion nr 0 for being that.

I was reserving completion 15 for that ;-)

+#define IOCB_NO_CMPL (15 << 28)
...
+ if (id < 15)
+ ki_cmpls[id](iocb, ret, ret2);

Saves us one pointer in the array ...

2020-07-08 15:03:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/8/20 8:58 AM, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 08:54:07AM -0600, Jens Axboe wrote:
>> On 7/8/20 6:58 AM, Kanchan Joshi wrote:
>>>>> +#define IOCB_NO_CMPL (15 << 28)
>>>>>
>>>>> struct kiocb {
>>>>> [...]
>>>>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>>>> + loff_t __user *ki_uposp;
>>>>> - int ki_flags;
>>>>> + unsigned int ki_flags;
>>>>>
>>>>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
>>>>> +static ki_cmpl * const ki_cmpls[15];
>>>>>
>>>>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
>>>>> +{
>>>>> + unsigned int id = iocb->ki_flags >> 28;
>>>>> +
>>>>> + if (id < 15)
>>>>> + ki_cmpls[id](iocb, ret, ret2);
>>>>> +}
>>>>>
>>>>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
>>>>> +{
>>>>> + for (i = 0; i < 15; i++) {
>>>>> + if (ki_cmpls[id])
>>>>> + continue;
>>>>> + ki_cmpls[id] = cb;
>>>>> + return id;
>>>>> + }
>>>>> + WARN();
>>>>> + return -1;
>>>>> +}
>>>>
>>>> That could work, we don't really have a lot of different completion
>>>> types in the kernel.
>>>
>>> Thanks, this looks sorted.
>>
>> Not really, someone still needs to do that work. I took a quick look, and
>> most of it looks straight forward. The only potential complication is
>> ocfs2, which does a swap of the completion for the kiocb. That would just
>> turn into an upper flag swap. And potential sync kiocb with NULL
>> ki_complete. The latter should be fine, I think we just need to reserve
>> completion nr 0 for being that.
>
> I was reserving completion 15 for that ;-)
>
> +#define IOCB_NO_CMPL (15 << 28)
> ...
> + if (id < 15)
> + ki_cmpls[id](iocb, ret, ret2);
>
> Saves us one pointer in the array ...

That works. Are you going to turn this into an actual series of patches,
adding the functionality and converting users?

--
Jens Axboe

2020-07-08 15:04:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Wed, Jul 08, 2020 at 08:59:50AM -0600, Jens Axboe wrote:
> On 7/8/20 8:58 AM, Matthew Wilcox wrote:
> > On Wed, Jul 08, 2020 at 08:54:07AM -0600, Jens Axboe wrote:
> >> On 7/8/20 6:58 AM, Kanchan Joshi wrote:
> >>>>> +#define IOCB_NO_CMPL (15 << 28)
> >>>>>
> >>>>> struct kiocb {
> >>>>> [...]
> >>>>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
> >>>>> + loff_t __user *ki_uposp;
> >>>>> - int ki_flags;
> >>>>> + unsigned int ki_flags;
> >>>>>
> >>>>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
> >>>>> +static ki_cmpl * const ki_cmpls[15];
> >>>>>
> >>>>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
> >>>>> +{
> >>>>> + unsigned int id = iocb->ki_flags >> 28;
> >>>>> +
> >>>>> + if (id < 15)
> >>>>> + ki_cmpls[id](iocb, ret, ret2);
> >>>>> +}
> >>>>>
> >>>>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
> >>>>> +{
> >>>>> + for (i = 0; i < 15; i++) {
> >>>>> + if (ki_cmpls[id])
> >>>>> + continue;
> >>>>> + ki_cmpls[id] = cb;
> >>>>> + return id;
> >>>>> + }
> >>>>> + WARN();
> >>>>> + return -1;
> >>>>> +}
> >>>>
> >>>> That could work, we don't really have a lot of different completion
> >>>> types in the kernel.
> >>>
> >>> Thanks, this looks sorted.
> >>
> >> Not really, someone still needs to do that work. I took a quick look, and
> >> most of it looks straight forward. The only potential complication is
> >> ocfs2, which does a swap of the completion for the kiocb. That would just
> >> turn into an upper flag swap. And potential sync kiocb with NULL
> >> ki_complete. The latter should be fine, I think we just need to reserve
> >> completion nr 0 for being that.
> >
> > I was reserving completion 15 for that ;-)
> >
> > +#define IOCB_NO_CMPL (15 << 28)
> > ...
> > + if (id < 15)
> > + ki_cmpls[id](iocb, ret, ret2);
> >
> > Saves us one pointer in the array ...
>
> That works. Are you going to turn this into an actual series of patches,
> adding the functionality and converting users?

I was under the impression Kanchan was going to do that, but I can run it
off quickly ...

2020-07-08 15:08:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/8/20 9:02 AM, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 08:59:50AM -0600, Jens Axboe wrote:
>> On 7/8/20 8:58 AM, Matthew Wilcox wrote:
>>> On Wed, Jul 08, 2020 at 08:54:07AM -0600, Jens Axboe wrote:
>>>> On 7/8/20 6:58 AM, Kanchan Joshi wrote:
>>>>>>> +#define IOCB_NO_CMPL (15 << 28)
>>>>>>>
>>>>>>> struct kiocb {
>>>>>>> [...]
>>>>>>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>>>>>> + loff_t __user *ki_uposp;
>>>>>>> - int ki_flags;
>>>>>>> + unsigned int ki_flags;
>>>>>>>
>>>>>>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
>>>>>>> +static ki_cmpl * const ki_cmpls[15];
>>>>>>>
>>>>>>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
>>>>>>> +{
>>>>>>> + unsigned int id = iocb->ki_flags >> 28;
>>>>>>> +
>>>>>>> + if (id < 15)
>>>>>>> + ki_cmpls[id](iocb, ret, ret2);
>>>>>>> +}
>>>>>>>
>>>>>>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
>>>>>>> +{
>>>>>>> + for (i = 0; i < 15; i++) {
>>>>>>> + if (ki_cmpls[id])
>>>>>>> + continue;
>>>>>>> + ki_cmpls[id] = cb;
>>>>>>> + return id;
>>>>>>> + }
>>>>>>> + WARN();
>>>>>>> + return -1;
>>>>>>> +}
>>>>>>
>>>>>> That could work, we don't really have a lot of different completion
>>>>>> types in the kernel.
>>>>>
>>>>> Thanks, this looks sorted.
>>>>
>>>> Not really, someone still needs to do that work. I took a quick look, and
>>>> most of it looks straight forward. The only potential complication is
>>>> ocfs2, which does a swap of the completion for the kiocb. That would just
>>>> turn into an upper flag swap. And potential sync kiocb with NULL
>>>> ki_complete. The latter should be fine, I think we just need to reserve
>>>> completion nr 0 for being that.
>>>
>>> I was reserving completion 15 for that ;-)
>>>
>>> +#define IOCB_NO_CMPL (15 << 28)
>>> ...
>>> + if (id < 15)
>>> + ki_cmpls[id](iocb, ret, ret2);
>>>
>>> Saves us one pointer in the array ...
>>
>> That works. Are you going to turn this into an actual series of patches,
>> adding the functionality and converting users?
>
> I was under the impression Kanchan was going to do that, but I can run it
> off quickly ...

I just wanted to get clarification there, because to me it sounded like
you expected Kanchan to do it, and Kanchan assuming it "was sorted". I'd
consider that a prerequisite for the append series as far as io_uring is
concerned, hence _someone_ needs to actually do it ;-)

--
Jens Axboe

2020-07-08 16:08:44

by Javier González

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append


> On 8 Jul 2020, at 17.06, Jens Axboe <[email protected]> wrote:
>
> On 7/8/20 9:02 AM, Matthew Wilcox wrote:
>>> On Wed, Jul 08, 2020 at 08:59:50AM -0600, Jens Axboe wrote:
>>> On 7/8/20 8:58 AM, Matthew Wilcox wrote:
>>>> On Wed, Jul 08, 2020 at 08:54:07AM -0600, Jens Axboe wrote:
>>>>> On 7/8/20 6:58 AM, Kanchan Joshi wrote:
>>>>>>>> +#define IOCB_NO_CMPL (15 << 28)
>>>>>>>>
>>>>>>>> struct kiocb {
>>>>>>>> [...]
>>>>>>>> - void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
>>>>>>>> + loff_t __user *ki_uposp;
>>>>>>>> - int ki_flags;
>>>>>>>> + unsigned int ki_flags;
>>>>>>>>
>>>>>>>> +typedef void ki_cmpl(struct kiocb *, long ret, long ret2);
>>>>>>>> +static ki_cmpl * const ki_cmpls[15];
>>>>>>>>
>>>>>>>> +void ki_complete(struct kiocb *iocb, long ret, long ret2)
>>>>>>>> +{
>>>>>>>> + unsigned int id = iocb->ki_flags >> 28;
>>>>>>>> +
>>>>>>>> + if (id < 15)
>>>>>>>> + ki_cmpls[id](iocb, ret, ret2);
>>>>>>>> +}
>>>>>>>>
>>>>>>>> +int kiocb_cmpl_register(void (*cb)(struct kiocb *, long, long))
>>>>>>>> +{
>>>>>>>> + for (i = 0; i < 15; i++) {
>>>>>>>> + if (ki_cmpls[id])
>>>>>>>> + continue;
>>>>>>>> + ki_cmpls[id] = cb;
>>>>>>>> + return id;
>>>>>>>> + }
>>>>>>>> + WARN();
>>>>>>>> + return -1;
>>>>>>>> +}
>>>>>>>
>>>>>>> That could work, we don't really have a lot of different completion
>>>>>>> types in the kernel.
>>>>>>
>>>>>> Thanks, this looks sorted.
>>>>>
>>>>> Not really, someone still needs to do that work. I took a quick look, and
>>>>> most of it looks straight forward. The only potential complication is
>>>>> ocfs2, which does a swap of the completion for the kiocb. That would just
>>>>> turn into an upper flag swap. And potential sync kiocb with NULL
>>>>> ki_complete. The latter should be fine, I think we just need to reserve
>>>>> completion nr 0 for being that.
>>>>
>>>> I was reserving completion 15 for that ;-)
>>>>
>>>> +#define IOCB_NO_CMPL (15 << 28)
>>>> ...
>>>> + if (id < 15)
>>>> + ki_cmpls[id](iocb, ret, ret2);
>>>>
>>>> Saves us one pointer in the array ...
>>>
>>> That works. Are you going to turn this into an actual series of patches,
>>> adding the functionality and converting users?
>>
>> I was under the impression Kanchan was going to do that, but I can run it
>> off quickly ...
>
> I just wanted to get clarification there, because to me it sounded like
> you expected Kanchan to do it, and Kanchan assuming it "was sorted". I'd
> consider that a prerequisite for the append series as far as io_uring is
> concerned, hence _someone_ needs to actually do it ;-)
>

I believe Kanchan meant that now the trade-off we were asking to clear out is sorted.

We will send a new version shortly for the current functionality - we can see what we are missing on when the uring interface is clear.

We really want this to be stable as a lot of other things are depending on this (e.g., fio patches)

Javier

2020-07-08 16:35:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Wed, Jul 08, 2020 at 06:08:12PM +0200, Javier Gonz?lez wrote:
> > I just wanted to get clarification there, because to me it sounded like
> > you expected Kanchan to do it, and Kanchan assuming it "was sorted". I'd
> > consider that a prerequisite for the append series as far as io_uring is
> > concerned, hence _someone_ needs to actually do it ;-)

I don't know that it's a prerequisite in terms of the patches actually
depend on it. I appreciate you want it first to ensure that we don't bloat
the kiocb.

> I believe Kanchan meant that now the trade-off we were asking to clear out is sorted.
>
> We will send a new version shortly for the current functionality - we can see what we are missing on when the uring interface is clear.

I've started work on a patch series for this. Mostly just waiting for
compilation now ... should be done in the next few hours.

2020-07-08 16:40:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/8/20 10:33 AM, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 06:08:12PM +0200, Javier González wrote:
>>> I just wanted to get clarification there, because to me it sounded like
>>> you expected Kanchan to do it, and Kanchan assuming it "was sorted". I'd
>>> consider that a prerequisite for the append series as far as io_uring is
>>> concerned, hence _someone_ needs to actually do it ;-)
>
> I don't know that it's a prerequisite in terms of the patches actually
> depend on it. I appreciate you want it first to ensure that we don't bloat
> the kiocb.

Maybe not for the series, but for the io_uring addition it is.

>> I believe Kanchan meant that now the trade-off we were asking to
>> clear out is sorted.
>>
>> We will send a new version shortly for the current functionality - we
>> can see what we are missing on when the uring interface is clear.
>
> I've started work on a patch series for this. Mostly just waiting for
> compilation now ... should be done in the next few hours.

Great!

--
Jens Axboe

2020-07-08 16:44:20

by Javier González

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append


> On 8 Jul 2020, at 18.34, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Jul 08, 2020 at 06:08:12PM +0200, Javier González wrote:
>>> I just wanted to get clarification there, because to me it sounded like
>>> you expected Kanchan to do it, and Kanchan assuming it "was sorted". I'd
>>> consider that a prerequisite for the append series as far as io_uring is
>>> concerned, hence _someone_ needs to actually do it ;-)
>
> I don't know that it's a prerequisite in terms of the patches actually
> depend on it. I appreciate you want it first to ensure that we don't bloat
> the kiocb.
>
>> I believe Kanchan meant that now the trade-off we were asking to clear out is sorted.
>>
>> We will send a new version shortly for the current functionality - we can see what we are missing on when the uring interface is clear.
>
> I've started work on a patch series for this. Mostly just waiting for
> compilation now ... should be done in the next few hours.


Awesome!

2020-07-09 10:18:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 155f3d8..cbde4df 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -402,6 +402,8 @@ struct io_rw {
> > struct kiocb kiocb;
> > u64 addr;
> > u64 len;
> > + /* zone-relative offset for append, in sectors */
> > + u32 append_offset;
> > };
>
> I don't like this very much at all. As it stands, the first cacheline
> of io_kiocb is set aside for request-private data. io_rw is already
> exactly 64 bytes, which means that you're now growing io_rw beyond
> a cacheline and increasing the size of io_kiocb as a whole.
>
> Maybe you can reuse io_rw->len for this, as that is only used on the
> submission side of things.

We don't actually need any new field at all. By the time the write
returned ki_pos contains the offset after the write, and the res
argument to ->ki_complete contains the amount of bytes written, which
allow us to trivially derive the starting position.

2020-07-09 13:58:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/9/20 4:15 AM, Christoph Hellwig wrote:
> On Sun, Jul 05, 2020 at 03:00:47PM -0600, Jens Axboe wrote:
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 155f3d8..cbde4df 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -402,6 +402,8 @@ struct io_rw {
>>> struct kiocb kiocb;
>>> u64 addr;
>>> u64 len;
>>> + /* zone-relative offset for append, in sectors */
>>> + u32 append_offset;
>>> };
>>
>> I don't like this very much at all. As it stands, the first cacheline
>> of io_kiocb is set aside for request-private data. io_rw is already
>> exactly 64 bytes, which means that you're now growing io_rw beyond
>> a cacheline and increasing the size of io_kiocb as a whole.
>>
>> Maybe you can reuse io_rw->len for this, as that is only used on the
>> submission side of things.
>
> We don't actually need any new field at all. By the time the write
> returned ki_pos contains the offset after the write, and the res
> argument to ->ki_complete contains the amount of bytes written, which
> allow us to trivially derive the starting position.

Then let's just do that instead of jumping through hoops either
justifying growing io_rw/io_kiocb or turning kiocb into a global
completion thing.

--
Jens Axboe

2020-07-09 14:01:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
> > We don't actually need any new field at all. By the time the write
> > returned ki_pos contains the offset after the write, and the res
> > argument to ->ki_complete contains the amount of bytes written, which
> > allow us to trivially derive the starting position.
>
> Then let's just do that instead of jumping through hoops either
> justifying growing io_rw/io_kiocb or turning kiocb into a global
> completion thing.

Unfortunately that is a totally separate issue - the in-kernel offset
can be trivially calculated. But we still need to figure out a way to
pass it on to userspace. The current patchset does that by abusing
the flags, which doesn't really work as the flags are way too small.
So we somewhere need to have an address to do the put_user to.

2020-07-09 14:06:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/9/20 8:00 AM, Christoph Hellwig wrote:
> On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
>>> We don't actually need any new field at all. By the time the write
>>> returned ki_pos contains the offset after the write, and the res
>>> argument to ->ki_complete contains the amount of bytes written, which
>>> allow us to trivially derive the starting position.
>>
>> Then let's just do that instead of jumping through hoops either
>> justifying growing io_rw/io_kiocb or turning kiocb into a global
>> completion thing.
>
> Unfortunately that is a totally separate issue - the in-kernel offset
> can be trivially calculated. But we still need to figure out a way to
> pass it on to userspace. The current patchset does that by abusing
> the flags, which doesn't really work as the flags are way too small.
> So we somewhere need to have an address to do the put_user to.

Right, we're just trading the 'append_offset' for a 'copy_offset_here'
pointer, which are stored in the same spot...

--
Jens Axboe

2020-07-09 18:40:19

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Thu, Jul 9, 2020 at 7:36 PM Jens Axboe <[email protected]> wrote:
>
> On 7/9/20 8:00 AM, Christoph Hellwig wrote:
> > On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
> >>> We don't actually need any new field at all. By the time the write
> >>> returned ki_pos contains the offset after the write, and the res
> >>> argument to ->ki_complete contains the amount of bytes written, which
> >>> allow us to trivially derive the starting position.

Deriving starting position was not the purpose at all.
But yes, append-offset is not needed, for a different reason.
It was kept for uring specific handling. Completion-result from lower
layer was always coming to uring in ret2 via ki_complete(....,ret2).
And ret2 goes to CQE (and user-space) without any conversion in between.
For polled-completion, there is a short window when we get ret2 but cannot
write into CQE immediately, so thought of storing that in append_offset
(but should not have done, solving was possible without it).

FWIW, if we move to indirect-offset approach, append_offset gets
eliminated automatically, because there is no need to write to CQE
itself.

> >> Then let's just do that instead of jumping through hoops either
> >> justifying growing io_rw/io_kiocb or turning kiocb into a global
> >> completion thing.
> >
> > Unfortunately that is a totally separate issue - the in-kernel offset
> > can be trivially calculated. But we still need to figure out a way to
> > pass it on to userspace. The current patchset does that by abusing
> > the flags, which doesn't really work as the flags are way too small.
> > So we somewhere need to have an address to do the put_user to.
>
> Right, we're just trading the 'append_offset' for a 'copy_offset_here'
> pointer, which are stored in the same spot...

The address needs to be stored somewhere. And there does not seem
other option but to use io_kiocb?
The bigger problem with address/indirect-offset is to be able to write to it
during completion as process-context is different. Will that require entering
into task_work_add() world, and may make it costly affair?

Using flags have not been liked here, but given the upheaval involved so
far I have begun to feel - it was keeping things simple. Should it be
reconsidered?


--
Joshi

2020-07-09 18:51:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/9/20 12:36 PM, Kanchan Joshi wrote:
> On Thu, Jul 9, 2020 at 7:36 PM Jens Axboe <[email protected]> wrote:
>>
>> On 7/9/20 8:00 AM, Christoph Hellwig wrote:
>>> On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
>>>>> We don't actually need any new field at all. By the time the write
>>>>> returned ki_pos contains the offset after the write, and the res
>>>>> argument to ->ki_complete contains the amount of bytes written, which
>>>>> allow us to trivially derive the starting position.
>
> Deriving starting position was not the purpose at all.
> But yes, append-offset is not needed, for a different reason.
> It was kept for uring specific handling. Completion-result from lower
> layer was always coming to uring in ret2 via ki_complete(....,ret2).
> And ret2 goes to CQE (and user-space) without any conversion in between.
> For polled-completion, there is a short window when we get ret2 but cannot
> write into CQE immediately, so thought of storing that in append_offset
> (but should not have done, solving was possible without it).
>
> FWIW, if we move to indirect-offset approach, append_offset gets
> eliminated automatically, because there is no need to write to CQE
> itself.
>
>>>> Then let's just do that instead of jumping through hoops either
>>>> justifying growing io_rw/io_kiocb or turning kiocb into a global
>>>> completion thing.
>>>
>>> Unfortunately that is a totally separate issue - the in-kernel offset
>>> can be trivially calculated. But we still need to figure out a way to
>>> pass it on to userspace. The current patchset does that by abusing
>>> the flags, which doesn't really work as the flags are way too small.
>>> So we somewhere need to have an address to do the put_user to.
>>
>> Right, we're just trading the 'append_offset' for a 'copy_offset_here'
>> pointer, which are stored in the same spot...
>
> The address needs to be stored somewhere. And there does not seem
> other option but to use io_kiocb?

That is where it belongs, not sure this was ever questioned. And inside
io_rw at that.

> The bigger problem with address/indirect-offset is to be able to write
> to it during completion as process-context is different. Will that
> require entering into task_work_add() world, and may make it costly
> affair?

It might, if you have IRQ context for the completion. task_work isn't
expensive, however. It's not like a thread offload.

> Using flags have not been liked here, but given the upheaval involved so
> far I have begun to feel - it was keeping things simple. Should it be
> reconsidered?

It's definitely worth considering, especially since we can use cflags
like Pavel suggested upfront and not need any extra storage. But it
brings us back to the 32-bit vs 64-bit discussion, and then using blocks
instead of bytes. Which isn't exactly super pretty.

--
Jens Axboe

2020-07-09 18:53:03

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 09/07/2020 21:36, Kanchan Joshi wrote:
> On Thu, Jul 9, 2020 at 7:36 PM Jens Axboe <[email protected]> wrote:
>>
>> On 7/9/20 8:00 AM, Christoph Hellwig wrote:
>>> On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
>>>>> We don't actually need any new field at all. By the time the write
>>>>> returned ki_pos contains the offset after the write, and the res
>>>>> argument to ->ki_complete contains the amount of bytes written, which
>>>>> allow us to trivially derive the starting position.
>
> Deriving starting position was not the purpose at all.
> But yes, append-offset is not needed, for a different reason.
> It was kept for uring specific handling. Completion-result from lower
> layer was always coming to uring in ret2 via ki_complete(....,ret2).
> And ret2 goes to CQE (and user-space) without any conversion in between.
> For polled-completion, there is a short window when we get ret2 but cannot
> write into CQE immediately, so thought of storing that in append_offset
> (but should not have done, solving was possible without it).

fwiw, there are more cases when it's postponed.

> FWIW, if we move to indirect-offset approach, append_offset gets
> eliminated automatically, because there is no need to write to CQE
> itself.

Right, for the indirect approach we can write offset right after getting it.
If not, then it's somehow stored in an CQE, so may be placed into
existing req->{result,cflags}, which mimic CQE's fields.

>
>>>> Then let's just do that instead of jumping through hoops either
>>>> justifying growing io_rw/io_kiocb or turning kiocb into a global
>>>> completion thing.
>>>
>>> Unfortunately that is a totally separate issue - the in-kernel offset
>>> can be trivially calculated. But we still need to figure out a way to
>>> pass it on to userspace. The current patchset does that by abusing
>>> the flags, which doesn't really work as the flags are way too small.
>>> So we somewhere need to have an address to do the put_user to.
>>
>> Right, we're just trading the 'append_offset' for a 'copy_offset_here'
>> pointer, which are stored in the same spot...
>
> The address needs to be stored somewhere. And there does not seem
> other option but to use io_kiocb?
> The bigger problem with address/indirect-offset is to be able to write to it
> during completion as process-context is different. Will that require entering
> into task_work_add() world, and may make it costly affair?
>
> Using flags have not been liked here, but given the upheaval involved so
> far I have begun to feel - it was keeping things simple. Should it be
> reconsidered?
>
>
> --
> Joshi
>

--
Pavel Begunkov

2020-07-09 18:58:17

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 09/07/2020 21:50, Pavel Begunkov wrote:
> On 09/07/2020 21:36, Kanchan Joshi wrote:
>> On Thu, Jul 9, 2020 at 7:36 PM Jens Axboe <[email protected]> wrote:
>>>
>>> On 7/9/20 8:00 AM, Christoph Hellwig wrote:
>>>> On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
>>>>>> We don't actually need any new field at all. By the time the write
>>>>>> returned ki_pos contains the offset after the write, and the res
>>>>>> argument to ->ki_complete contains the amount of bytes written, which
>>>>>> allow us to trivially derive the starting position.
>>
>> Deriving starting position was not the purpose at all.
>> But yes, append-offset is not needed, for a different reason.
>> It was kept for uring specific handling. Completion-result from lower
>> layer was always coming to uring in ret2 via ki_complete(....,ret2).
>> And ret2 goes to CQE (and user-space) without any conversion in between.
>> For polled-completion, there is a short window when we get ret2 but cannot
>> write into CQE immediately, so thought of storing that in append_offset
>> (but should not have done, solving was possible without it).
>
> fwiw, there are more cases when it's postponed.
>
>> FWIW, if we move to indirect-offset approach, append_offset gets
>> eliminated automatically, because there is no need to write to CQE
>> itself.
>
> Right, for the indirect approach we can write offset right after getting it.

Take it back, as you mentioned with task_work, we may need the right context.

BTW, there is one more way to get space for it, and it would also shed 8 bytes
from io_kiocb, but that would need some work to be done.

> If not, then it's somehow stored in an CQE, so may be placed into
> existing req->{result,cflags}, which mimic CQE's fields.
>
>>
>>>>> Then let's just do that instead of jumping through hoops either
>>>>> justifying growing io_rw/io_kiocb or turning kiocb into a global
>>>>> completion thing.
>>>>
>>>> Unfortunately that is a totally separate issue - the in-kernel offset
>>>> can be trivially calculated. But we still need to figure out a way to
>>>> pass it on to userspace. The current patchset does that by abusing
>>>> the flags, which doesn't really work as the flags are way too small.
>>>> So we somewhere need to have an address to do the put_user to.
>>>
>>> Right, we're just trading the 'append_offset' for a 'copy_offset_here'
>>> pointer, which are stored in the same spot...
>>
>> The address needs to be stored somewhere. And there does not seem
>> other option but to use io_kiocb?
>> The bigger problem with address/indirect-offset is to be able to write to it
>> during completion as process-context is different. Will that require entering
>> into task_work_add() world, and may make it costly affair?
>>
>> Using flags have not been liked here, but given the upheaval involved so
>> far I have begun to feel - it was keeping things simple. Should it be
>> reconsidered?
>>
>>
>> --
>> Joshi
>>
>

--
Pavel Begunkov

2020-07-09 19:07:34

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 12:20 AM Jens Axboe <[email protected]> wrote:
>
> On 7/9/20 12:36 PM, Kanchan Joshi wrote:
> > On Thu, Jul 9, 2020 at 7:36 PM Jens Axboe <[email protected]> wrote:
> >>
> >> On 7/9/20 8:00 AM, Christoph Hellwig wrote:
> >>> On Thu, Jul 09, 2020 at 07:58:04AM -0600, Jens Axboe wrote:
> >>>>> We don't actually need any new field at all. By the time the write
> >>>>> returned ki_pos contains the offset after the write, and the res
> >>>>> argument to ->ki_complete contains the amount of bytes written, which
> >>>>> allow us to trivially derive the starting position.
> >
> > Deriving starting position was not the purpose at all.
> > But yes, append-offset is not needed, for a different reason.
> > It was kept for uring specific handling. Completion-result from lower
> > layer was always coming to uring in ret2 via ki_complete(....,ret2).
> > And ret2 goes to CQE (and user-space) without any conversion in between.
> > For polled-completion, there is a short window when we get ret2 but cannot
> > write into CQE immediately, so thought of storing that in append_offset
> > (but should not have done, solving was possible without it).
> >
> > FWIW, if we move to indirect-offset approach, append_offset gets
> > eliminated automatically, because there is no need to write to CQE
> > itself.
> >
> >>>> Then let's just do that instead of jumping through hoops either
> >>>> justifying growing io_rw/io_kiocb or turning kiocb into a global
> >>>> completion thing.
> >>>
> >>> Unfortunately that is a totally separate issue - the in-kernel offset
> >>> can be trivially calculated. But we still need to figure out a way to
> >>> pass it on to userspace. The current patchset does that by abusing
> >>> the flags, which doesn't really work as the flags are way too small.
> >>> So we somewhere need to have an address to do the put_user to.
> >>
> >> Right, we're just trading the 'append_offset' for a 'copy_offset_here'
> >> pointer, which are stored in the same spot...
> >
> > The address needs to be stored somewhere. And there does not seem
> > other option but to use io_kiocb?
>
> That is where it belongs, not sure this was ever questioned. And inside
> io_rw at that.
>
> > The bigger problem with address/indirect-offset is to be able to write
> > to it during completion as process-context is different. Will that
> > require entering into task_work_add() world, and may make it costly
> > affair?
>
> It might, if you have IRQ context for the completion. task_work isn't
> expensive, however. It's not like a thread offload.
>
> > Using flags have not been liked here, but given the upheaval involved so
> > far I have begun to feel - it was keeping things simple. Should it be
> > reconsidered?
>
> It's definitely worth considering, especially since we can use cflags
> like Pavel suggested upfront and not need any extra storage. But it
> brings us back to the 32-bit vs 64-bit discussion, and then using blocks
> instead of bytes. Which isn't exactly super pretty.
>
I agree that what we had was not great.
Append required special treatment (conversion for sector to bytes) for io_uring.
And we were planning a user-space wrapper to abstract that.

But good part (as it seems now) was: append result went along with cflags at
virtually no additional cost. And uring code changes became super clean/minimal
with further revisions.
While indirect-offset requires doing allocation/mgmt in application,
io-uring submission
and in completion path (which seems trickier), and those CQE flags
still get written
user-space and serve no purpose for append-write.

--
Joshi

2020-07-10 13:10:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Thu, Jul 09, 2020 at 12:50:27PM -0600, Jens Axboe wrote:
> It might, if you have IRQ context for the completion. task_work isn't
> expensive, however. It's not like a thread offload.
>
> > Using flags have not been liked here, but given the upheaval involved so
> > far I have begun to feel - it was keeping things simple. Should it be
> > reconsidered?
>
> It's definitely worth considering, especially since we can use cflags
> like Pavel suggested upfront and not need any extra storage. But it
> brings us back to the 32-bit vs 64-bit discussion, and then using blocks
> instead of bytes. Which isn't exactly super pretty.

block doesn't work for the case of writes to files that don't have
to be aligned in any way. And that I think is the more broadly
applicable use case than zone append on block devices.

2020-07-10 13:12:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote:
> Append required special treatment (conversion for sector to bytes) for io_uring.
> And we were planning a user-space wrapper to abstract that.
>
> But good part (as it seems now) was: append result went along with cflags at
> virtually no additional cost. And uring code changes became super clean/minimal
> with further revisions.
> While indirect-offset requires doing allocation/mgmt in application,
> io-uring submission
> and in completion path (which seems trickier), and those CQE flags
> still get written
> user-space and serve no purpose for append-write.

I have to say that storing the results in the CQE generally make
so much more sense. I wonder if we need a per-fd "large CGE" flag
that adds two extra u64s to the CQE, and some ops just require this
version.

2020-07-10 13:33:38

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 6:39 PM Christoph Hellwig <[email protected]> wrote:
>
> On Thu, Jul 09, 2020 at 12:50:27PM -0600, Jens Axboe wrote:
> > It might, if you have IRQ context for the completion. task_work isn't
> > expensive, however. It's not like a thread offload.
> >
> > > Using flags have not been liked here, but given the upheaval involved so
> > > far I have begun to feel - it was keeping things simple. Should it be
> > > reconsidered?
> >
> > It's definitely worth considering, especially since we can use cflags
> > like Pavel suggested upfront and not need any extra storage. But it
> > brings us back to the 32-bit vs 64-bit discussion, and then using blocks
> > instead of bytes. Which isn't exactly super pretty.
>
> block doesn't work for the case of writes to files that don't have
> to be aligned in any way. And that I think is the more broadly
> applicable use case than zone append on block devices.

But when can it happen that we do zone-append on a file (zonefs I
asssume), and device returns a location (write-pointer essentially)
which is not in multiple of 512b?


--
Joshi

2020-07-10 13:44:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 06:59:45PM +0530, Kanchan Joshi wrote:
> > block doesn't work for the case of writes to files that don't have
> > to be aligned in any way. And that I think is the more broadly
> > applicable use case than zone append on block devices.
>
> But when can it happen that we do zone-append on a file (zonefs I
> asssume), and device returns a location (write-pointer essentially)
> which is not in multiple of 512b?

All the time. You open a file with O_APPEND. You write a record to
it of any kind of size, then the next write will return the position
it got written at, which can be anything.

2020-07-10 13:49:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 02:10:54PM +0100, Christoph Hellwig wrote:
> On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote:
> > Append required special treatment (conversion for sector to bytes) for io_uring.
> > And we were planning a user-space wrapper to abstract that.
> >
> > But good part (as it seems now) was: append result went along with cflags at
> > virtually no additional cost. And uring code changes became super clean/minimal
> > with further revisions.
> > While indirect-offset requires doing allocation/mgmt in application,
> > io-uring submission
> > and in completion path (which seems trickier), and those CQE flags
> > still get written
> > user-space and serve no purpose for append-write.
>
> I have to say that storing the results in the CQE generally make
> so much more sense. I wonder if we need a per-fd "large CGE" flag
> that adds two extra u64s to the CQE, and some ops just require this
> version.

If we're going to go the route of changing the CQE, how about:

struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
+ union {
+ struct {
+ __s32 res; /* result code for this event */
+ __u32 flags;
+ };
+ __s64 res64;
+ };
};

then we don't need to change the CQE size and it just depends on the SQE
whether the CQE for it uses res+flags or res64.

2020-07-10 13:50:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 02:48:24PM +0100, Matthew Wilcox wrote:
> If we're going to go the route of changing the CQE, how about:
>
> struct io_uring_cqe {
> __u64 user_data; /* sqe->data submission passed back */
> - __s32 res; /* result code for this event */
> - __u32 flags;
> + union {
> + struct {
> + __s32 res; /* result code for this event */
> + __u32 flags;
> + };
> + __s64 res64;
> + };
> };
>
> then we don't need to change the CQE size and it just depends on the SQE
> whether the CQE for it uses res+flags or res64.

How do you return a status code or short write when you just have
a u64 that is needed for the offset?

2020-07-10 13:53:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 02:49:32PM +0100, Christoph Hellwig wrote:
> On Fri, Jul 10, 2020 at 02:48:24PM +0100, Matthew Wilcox wrote:
> > If we're going to go the route of changing the CQE, how about:
> >
> > struct io_uring_cqe {
> > __u64 user_data; /* sqe->data submission passed back */
> > - __s32 res; /* result code for this event */
> > - __u32 flags;
> > + union {
> > + struct {
> > + __s32 res; /* result code for this event */
> > + __u32 flags;
> > + };
> > + __s64 res64;
> > + };
> > };
> >
> > then we don't need to change the CQE size and it just depends on the SQE
> > whether the CQE for it uses res+flags or res64.
>
> How do you return a status code or short write when you just have
> a u64 that is needed for the offset?

it's an s64 not a u64 so you can return a negative errno. i didn't
think we allowed short writes for objects-which-have-a-pos.

2020-07-10 13:59:09

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 6:59 PM Kanchan Joshi <[email protected]> wrote:
>
> On Fri, Jul 10, 2020 at 6:39 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Thu, Jul 09, 2020 at 12:50:27PM -0600, Jens Axboe wrote:
> > > It might, if you have IRQ context for the completion. task_work isn't
> > > expensive, however. It's not like a thread offload.

Not sure about polled-completion but we have IRQ context for regular completion.
If I've got it right, I need to store task_struct during submission,
and use that to register a task_work during completion. At some point
when this task_work gets called it will update the user-space pointer
with the result.
It can be the case that we get N completions parallely, but they all
would get serialized because all N task-works need to be executed in
the context of single task/process?

> > > > Using flags have not been liked here, but given the upheaval involved so
> > > > far I have begun to feel - it was keeping things simple. Should it be
> > > > reconsidered?
> > >
> > > It's definitely worth considering, especially since we can use cflags
> > > like Pavel suggested upfront and not need any extra storage. But it
> > > brings us back to the 32-bit vs 64-bit discussion, and then using blocks
> > > instead of bytes. Which isn't exactly super pretty.
> >
> > block doesn't work for the case of writes to files that don't have
> > to be aligned in any way. And that I think is the more broadly
> > applicable use case than zone append on block devices.
>
> But when can it happen that we do zone-append on a file (zonefs I
> asssume), and device returns a location (write-pointer essentially)
> which is not in multiple of 512b?
>
>
> --
> Joshi



--
Joshi

2020-07-10 14:12:40

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 7:21 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jul 10, 2020 at 02:49:32PM +0100, Christoph Hellwig wrote:
> > On Fri, Jul 10, 2020 at 02:48:24PM +0100, Matthew Wilcox wrote:
> > > If we're going to go the route of changing the CQE, how about:
> > >
> > > struct io_uring_cqe {
> > > __u64 user_data; /* sqe->data submission passed back */
> > > - __s32 res; /* result code for this event */
> > > - __u32 flags;
> > > + union {
> > > + struct {
> > > + __s32 res; /* result code for this event */
> > > + __u32 flags;
> > > + };
> > > + __s64 res64;
> > > + };
> > > };
> > >
> > > then we don't need to change the CQE size and it just depends on the SQE
> > > whether the CQE for it uses res+flags or res64.
> >
> > How do you return a status code or short write when you just have
> > a u64 that is needed for the offset?
>
> it's an s64 not a u64 so you can return a negative errno. i didn't
> think we allowed short writes for objects-which-have-a-pos.

If we are doing this for zone-append (and not general cases), "__s64
res64" should work -.
64 bits = 1 (sign) + 23 (bytes-copied: cqe->res) + 40
(written-location: chunk_sector bytes limit)

--
Joshi

2020-07-10 14:12:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/10/20 7:10 AM, Christoph Hellwig wrote:
> On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote:
>> Append required special treatment (conversion for sector to bytes) for io_uring.
>> And we were planning a user-space wrapper to abstract that.
>>
>> But good part (as it seems now) was: append result went along with cflags at
>> virtually no additional cost. And uring code changes became super clean/minimal
>> with further revisions.
>> While indirect-offset requires doing allocation/mgmt in application,
>> io-uring submission
>> and in completion path (which seems trickier), and those CQE flags
>> still get written
>> user-space and serve no purpose for append-write.
>
> I have to say that storing the results in the CQE generally make
> so much more sense. I wonder if we need a per-fd "large CGE" flag
> that adds two extra u64s to the CQE, and some ops just require this
> version.

I have been pondering the same thing, we could make certain ops consume
two CQEs if it makes sense. It's a bit ugly on the app side with two
different CQEs for a request, though. We can't just treat it as a large
CQE, as they might not be sequential if we happen to wrap. But maybe
it's not too bad.

--
Jens Axboe

2020-07-20 16:50:49

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 7:39 PM Jens Axboe <[email protected]> wrote:
>
> On 7/10/20 7:10 AM, Christoph Hellwig wrote:
> > On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote:
> >> Append required special treatment (conversion for sector to bytes) for io_uring.
> >> And we were planning a user-space wrapper to abstract that.
> >>
> >> But good part (as it seems now) was: append result went along with cflags at
> >> virtually no additional cost. And uring code changes became super clean/minimal
> >> with further revisions.
> >> While indirect-offset requires doing allocation/mgmt in application,
> >> io-uring submission
> >> and in completion path (which seems trickier), and those CQE flags
> >> still get written
> >> user-space and serve no purpose for append-write.
> >
> > I have to say that storing the results in the CQE generally make
> > so much more sense. I wonder if we need a per-fd "large CGE" flag
> > that adds two extra u64s to the CQE, and some ops just require this
> > version.
>
> I have been pondering the same thing, we could make certain ops consume
> two CQEs if it makes sense. It's a bit ugly on the app side with two
> different CQEs for a request, though. We can't just treat it as a large
> CQE, as they might not be sequential if we happen to wrap. But maybe
> it's not too bad.

Did some work on the two-cqe scheme for zone-append.
First CQE is the same (as before), while second CQE does not keep
res/flags and instead has 64bit result to report append-location.
It would look like this -

struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
+ union {
+ struct {
+ __s32 res; /* result code for this event */
+ __u32 flags;
+ };
+ __u64 append_res; /*only used for append, in
secondary cqe */
+ };

And kernel will produce two CQEs for append completion-

static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
{
- struct io_uring_cqe *cqe;
+ struct io_uring_cqe *cqe, *cqe2 = NULL;

- cqe = io_get_cqring(ctx);
+ if (unlikely(req->flags & REQ_F_ZONE_APPEND))
+ /* obtain two CQEs for append. NULL if two CQEs are not available */
+ cqe = io_get_two_cqring(ctx, &cqe2);
+ else
+ cqe = io_get_cqring(ctx);
+
if (likely(cqe)) {
WRITE_ONCE(cqe->user_data, req->user_data);
WRITE_ONCE(cqe->res, res);
WRITE_ONCE(cqe->flags, cflags);
+ /* update secondary cqe for zone-append */
+ if (req->flags & REQ_F_ZONE_APPEND) {
+ WRITE_ONCE(cqe2->append_res,
+ (u64)req->append_offset << SECTOR_SHIFT);
+ WRITE_ONCE(cqe2->user_data, req->user_data);
+ }
mutex_unlock(&ctx->uring_lock);


This seems to go fine in Kernel.
But the application will have few differences such as:

- When it submits N appends, and decides to wait for all completions
it needs to specify min_complete as 2*N (or at least 2N-1).
Two appends will produce 4 completion events, and if application
decides to wait for both it must specify 4 (or 3).

io_uring_enter(unsigned int fd, unsigned int to_submit,
unsigned int min_complete, unsigned int flags,
sigset_t *sig);

- Completion-processing sequence for mixed-workload (few reads + few
appends, on the same ring).
Currently there is a one-to-one relationship. Application looks at N
CQE entries, and treats each as distinct IO completion - a for loop
does the work.
With two-cqe scheme, extracting, from a bunch of completion, the ones
for read (one cqe) and append (two cqe): flow gets somewhat
non-linear.

Perhaps this is not too bad, but felt that it must be put here upfront.

--
Kanchan Joshi

2020-07-20 16:53:13

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 7:41 PM Kanchan Joshi <[email protected]> wrote:
>
> On Fri, Jul 10, 2020 at 7:21 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Jul 10, 2020 at 02:49:32PM +0100, Christoph Hellwig wrote:
> > > On Fri, Jul 10, 2020 at 02:48:24PM +0100, Matthew Wilcox wrote:
> > > > If we're going to go the route of changing the CQE, how about:
> > > >
> > > > struct io_uring_cqe {
> > > > __u64 user_data; /* sqe->data submission passed back */
> > > > - __s32 res; /* result code for this event */
> > > > - __u32 flags;
> > > > + union {
> > > > + struct {
> > > > + __s32 res; /* result code for this event */
> > > > + __u32 flags;
> > > > + };
> > > > + __s64 res64;
> > > > + };
> > > > };
> > > >
> > > > then we don't need to change the CQE size and it just depends on the SQE
> > > > whether the CQE for it uses res+flags or res64.
> > >
> > > How do you return a status code or short write when you just have
> > > a u64 that is needed for the offset?
> >
> > it's an s64 not a u64 so you can return a negative errno. i didn't
> > think we allowed short writes for objects-which-have-a-pos.
>
> If we are doing this for zone-append (and not general cases), "__s64
> res64" should work -.
> 64 bits = 1 (sign) + 23 (bytes-copied: cqe->res) + 40
> (written-location: chunk_sector bytes limit)

And this is for the scheme when single CQE is used with bits
refactoring into "_s64 res64" instead of res/flags.

41 bits for zone-append completion = in bytes, sufficient to cover
chunk_sectors size zone
1+22 bits for zone-append bytes-copied = can cover 4MB bytes copied
(single I/O is capped at 4MB in NVMe)

+ * zone-append specific flags
+#define APPEND_OFFSET_BITS (41)
+#define APPEND_RES_BITS (23)
+
+/*
* IO completion data structure (Completion Queue Entry)
*/
struct io_uring_cqe {
- __u64 user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
+ __u64 user_data; /* sqe->data submission passed back */
+ union {
+ struct {
+ __s32 res; /* result code for
this event */
+ __u32 flags;
+ };
+ /* Alternate for zone-append */
+ struct {
+ union {
+ /*
+ * kernel uses this to store append result
+ * Most significant 23 bits to return number of
+ * bytes or error, and least significant 41 bits
+ * to return zone-relative offset in bytes
+ * */
+ __s64 res64;
+ /*for user-space ease, kernel does not use*/
+ struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u64 append_offset :
APPEND_OFFSET_BITS;
+ __s32 append_res : APPEND_RES_BITS;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __s32 append_res : APPEND_RES_BITS;
+ __u64 append_offset :
APPEND_OFFSET_BITS;
+#endif
+ }__attribute__ ((__packed__));
+ };
+ };
+ };
};

--
Joshi

2020-07-20 17:05:35

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 7:13 PM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Jul 10, 2020 at 06:59:45PM +0530, Kanchan Joshi wrote:
> > > block doesn't work for the case of writes to files that don't have
> > > to be aligned in any way. And that I think is the more broadly
> > > applicable use case than zone append on block devices.
> >
> > But when can it happen that we do zone-append on a file (zonefs I
> > asssume), and device returns a location (write-pointer essentially)
> > which is not in multiple of 512b?
>
> All the time. You open a file with O_APPEND. You write a record to
> it of any kind of size, then the next write will return the position
> it got written at, which can be anything.
I understand if this is about cached write and we are talking about
O_APPEND in general.
But for direct block I/O write and ZoneFS writes, page-cache is not
used, so write(and zone-append result) will be aligned to underlying
block size.
Even though this patchset uses O_APPEND, it filters regular files and
non zoned-block devices by using new FMODE_ZONE_APPEND flag.

--
Joshi

2020-07-20 17:15:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Mon, Jul 20, 2020 at 10:19:57PM +0530, Kanchan Joshi wrote:
> On Fri, Jul 10, 2020 at 7:41 PM Kanchan Joshi <[email protected]> wrote:
> > If we are doing this for zone-append (and not general cases), "__s64
> > res64" should work -.
> > 64 bits = 1 (sign) + 23 (bytes-copied: cqe->res) + 40
> > (written-location: chunk_sector bytes limit)

No, don't do this.

struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
+ union {
+ struct {
+ __s32 res; /* result code for this event */
+ __u32 flags;
+ };
+ __s64 res64;
+ };
};

Return the value in bytes in res64, or a negative errno. Done.

> * IO completion data structure (Completion Queue Entry)
> */
> struct io_uring_cqe {
> - __u64 user_data; /* sqe->data submission passed back */
> - __s32 res; /* result code for this event */
> - __u32 flags;
> + __u64 user_data; /* sqe->data submission passed back */
> + union {
> + struct {
> + __s32 res; /* result code for
> this event */
> + __u32 flags;
> + };
> + /* Alternate for zone-append */
> + struct {
> + union {
> + /*
> + * kernel uses this to store append result
> + * Most significant 23 bits to return number of
> + * bytes or error, and least significant 41 bits
> + * to return zone-relative offset in bytes
> + * */
> + __s64 res64;
> + /*for user-space ease, kernel does not use*/
> + struct {
> +#if defined(__LITTLE_ENDIAN_BITFIELD)
> + __u64 append_offset :
> APPEND_OFFSET_BITS;
> + __s32 append_res : APPEND_RES_BITS;
> +#elif defined(__BIG_ENDIAN_BITFIELD)
> + __s32 append_res : APPEND_RES_BITS;
> + __u64 append_offset :
> APPEND_OFFSET_BITS;
> +#endif
> + }__attribute__ ((__packed__));
> + };
> + };
> + };
> };
>
> --
> Joshi

2020-07-20 20:18:20

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jul 20, 2020 at 10:19:57PM +0530, Kanchan Joshi wrote:
> > On Fri, Jul 10, 2020 at 7:41 PM Kanchan Joshi <[email protected]> wrote:
> > > If we are doing this for zone-append (and not general cases), "__s64
> > > res64" should work -.
> > > 64 bits = 1 (sign) + 23 (bytes-copied: cqe->res) + 40
> > > (written-location: chunk_sector bytes limit)
>
> No, don't do this.
>
> struct io_uring_cqe {
> __u64 user_data; /* sqe->data submission passed back */
> - __s32 res; /* result code for this event */
> - __u32 flags;
> + union {
> + struct {
> + __s32 res; /* result code for this event */
> + __u32 flags;
> + };
> + __s64 res64;
> + };
> };
>
> Return the value in bytes in res64, or a negative errno. Done.

I concur. Can do away with bytes-copied. It's either in its entirety
or not at all.

2020-07-21 01:00:45

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 2020/07/21 5:17, Kanchan Joshi wrote:
> On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <[email protected]> wrote:
>>
>> On Mon, Jul 20, 2020 at 10:19:57PM +0530, Kanchan Joshi wrote:
>>> On Fri, Jul 10, 2020 at 7:41 PM Kanchan Joshi <[email protected]> wrote:
>>>> If we are doing this for zone-append (and not general cases), "__s64
>>>> res64" should work -.
>>>> 64 bits = 1 (sign) + 23 (bytes-copied: cqe->res) + 40
>>>> (written-location: chunk_sector bytes limit)
>>
>> No, don't do this.
>>
>> struct io_uring_cqe {
>> __u64 user_data; /* sqe->data submission passed back */
>> - __s32 res; /* result code for this event */
>> - __u32 flags;
>> + union {
>> + struct {
>> + __s32 res; /* result code for this event */
>> + __u32 flags;
>> + };
>> + __s64 res64;
>> + };
>> };
>>
>> Return the value in bytes in res64, or a negative errno. Done.
>
> I concur. Can do away with bytes-copied. It's either in its entirety
> or not at all.
>

SAS SMR drives may return a partial completion. So the size written may be less
than requested, but not necessarily 0, which would be an error anyway since any
condition that would lead to 0B being written will cause the drive to fail the
command with an error.

Also, the completed size should be in res in the first cqe to follow io_uring
current interface, no ?. The second cqe would use the res64 field to return the
written offset. Wasn't that the plan ?

--
Damien Le Moal
Western Digital Research

2020-07-21 01:15:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Tue, Jul 21, 2020 at 12:59:59AM +0000, Damien Le Moal wrote:
> On 2020/07/21 5:17, Kanchan Joshi wrote:
> > On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <[email protected]> wrote:
> >> struct io_uring_cqe {
> >> __u64 user_data; /* sqe->data submission passed back */
> >> - __s32 res; /* result code for this event */
> >> - __u32 flags;
> >> + union {
> >> + struct {
> >> + __s32 res; /* result code for this event */
> >> + __u32 flags;
> >> + };
> >> + __s64 res64;
> >> + };
> >> };
> >>
> >> Return the value in bytes in res64, or a negative errno. Done.
> >
> > I concur. Can do away with bytes-copied. It's either in its entirety
> > or not at all.
> >
>
> SAS SMR drives may return a partial completion. So the size written may be less
> than requested, but not necessarily 0, which would be an error anyway since any
> condition that would lead to 0B being written will cause the drive to fail the
> command with an error.

Why might it return a short write? And, given how assiduous programmers
are about checking for exceptional conditions, is it useful to tell
userspace "only the first 512 bytes of your 2kB write made it to storage"?
Or would we rather just tell userspace "you got an error" and _not_
tell them that the first N bytes made it to storage?

> Also, the completed size should be in res in the first cqe to follow io_uring
> current interface, no ?. The second cqe would use the res64 field to return the
> written offset. Wasn't that the plan ?

two cqes for one sqe seems like a bad idea to me.

2020-07-21 01:31:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 7/20/20 7:15 PM, Matthew Wilcox wrote:
>> Also, the completed size should be in res in the first cqe to follow
>> io_uring current interface, no ?. The second cqe would use the res64
>> field to return the written offset. Wasn't that the plan ?
>
> two cqes for one sqe seems like a bad idea to me.

I have to agree with that, it's counter to everything else. The app will
then have to wait for two CQEs when it issues that one "special" SQE,
which is really iffy. And we'd have to promise that they are adjacent in
the ring. This isn't necessarily a problem right now, but I've been
playing with un-serialized completions and this would then become an
issue. The io_uring interface is clearly defined as "any sqe will either
return an error on submit (if the error is not specific to the sqe
contents), or post a completion event". Not two events, one.

And imho, zoned device append isn't an interesting enough use case to
warrant doing something special. If there was a super strong (and
generic) use case for passing back more information in the cqe then
maybe it would be considered. But it'd have to be a killer application.
If that's not the case, then the use case should work within the
constraints of the existing API.

--
Jens Axboe

2020-07-21 02:20:37

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On 2020/07/21 10:15, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 12:59:59AM +0000, Damien Le Moal wrote:
>> On 2020/07/21 5:17, Kanchan Joshi wrote:
>>> On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <[email protected]> wrote:
>>>> struct io_uring_cqe {
>>>> __u64 user_data; /* sqe->data submission passed back */
>>>> - __s32 res; /* result code for this event */
>>>> - __u32 flags;
>>>> + union {
>>>> + struct {
>>>> + __s32 res; /* result code for this event */
>>>> + __u32 flags;
>>>> + };
>>>> + __s64 res64;
>>>> + };
>>>> };
>>>>
>>>> Return the value in bytes in res64, or a negative errno. Done.
>>>
>>> I concur. Can do away with bytes-copied. It's either in its entirety
>>> or not at all.
>>>
>>
>> SAS SMR drives may return a partial completion. So the size written may be less
>> than requested, but not necessarily 0, which would be an error anyway since any
>> condition that would lead to 0B being written will cause the drive to fail the
>> command with an error.
>
> Why might it return a short write? And, given how assiduous programmers
> are about checking for exceptional conditions, is it useful to tell
> userspace "only the first 512 bytes of your 2kB write made it to storage"?
> Or would we rather just tell userspace "you got an error" and _not_
> tell them that the first N bytes made it to storage?

If the write hits a bad sector on disk half-way through, a SAS drive may return
a short write with a non 0 residual. SATA drives will fail the entire command
and libata will retry the failed command. That said, if the drive fails to remap
a bad sector and return an error to the host, it is generally an indicator that
one should go to the store to get a new drive :)

Yes, you have a good point. Returning an error for the entire write would be
fine. The typical error handling for a failed write to a zone is for the user to
first do a zone report to inspect the zone condition and WP position, resync its
view of the zone state and restart the write in the same zone or somewhere else.
So failing the entire write is OK.
I am actually not 100% sure what the bio interface does if the "restart
remaining" of a partially failed request fails again after all retry attempts.
The entire BIO is I think failed. Need to check. So the high level user would
not see the partial failure as that stays within the scsi layer.

>> Also, the completed size should be in res in the first cqe to follow io_uring
>> current interface, no ?. The second cqe would use the res64 field to return the
>> written offset. Wasn't that the plan ?
>
> two cqes for one sqe seems like a bad idea to me.

Yes, this is not very nice. I got lost in the thread. I thought that was the plan.


--
Damien Le Moal
Western Digital Research