2012-10-01 22:23:48

by Kent Overstreet

[permalink] [raw]
Subject: [RFC, PATCH] Extensible AIO interface

So, I and other people keep running into things where we really need to
add an interface to pass some auxiliary... stuff along with a pread() or
pwrite().

A few examples:

* IO scheduler hints. Some userspace program wants to, per IO, specify
either priorities or a cgroup - by specifying a cgroup you can have a
fileserver in userspace that makes use of cfq's per cgroup bandwidth
quotas.

* Cache hints. For bcache and other things, userspace may want to specify
"this data should be cached", "this data should bypass the cache", etc.

* Passing checksums out to userspace. We've got bio integrity, which is
a (somewhat) generic interface for passing data checksums between the
filesystem and the hardware. There are various circumstances under which
you may want to pass these checksums out to userspace, and if so we
ought to have a generic way of doing it.

Hence, AIO attributes.

The way it works is I stole the reserved2 field in struct iocb. This
becomes a pointer to struct iocb_attr_list.

An iocb_attr_list is some number of iocb_attrs appended together, along
with the total number of bytes of attributes.

An iocb_attr has an id field, and a size field - and some amount of data
specific to that attribute.

The size fields mean we can iterate through the attributes and find one
with a specific id with generic code - we don't have to know anything
about the attributes we don't care about.

I also added a pointer to struct iocb_attr_list to struct bio. Now, we
can define new attributes, and then anywhere in the block layer (say
cfq, or some driver code) we can lookup any attributes that were
specified for this io.

cfq can then schedule as userspace wants, or bcache can get its cache
hints...

That's pretty much it, for the moment. It's intended to be simple and
extensible.

* FUTURE STUFF:

Return values:

Some attributes are probably going to want to return something to
userspace.

If nothing else, we want this so that userspace can tell if anything
handled the attributes it specified - as dynamic as the io stack can be,
with something extensible like this there really isn't any generic way
of knowing ahead of time if something is going to interpret any
attribute - we want to return at least an error code.

One could imagine sticking the return in the attribute itself, but I
don't want to do this. For some things (checksums), the attribute will
contain a pointer to a buffer - that's fine. But I don't want the
attributes themselves to be writeable.

The reason for this is that struct iocb point to the attributes isn't
completely ideal, and is IMO a stopgap solution. It would be better if
the attributes were inline with the iocbs. But for this we may (?) want
new aio syscalls to replace io_submit()/io_getevents(), but that isn't
something I want to rush into.

If the attributes are inline with the iocbs, the good and natural thing
to do for return values is stick them inline with the io_event
completions.

But I'm not quite sure what I want to do in the meantime, if we end up
needing return values in the short term.


commit 34232e6f28112f049d633f4ecd2d34e536f8c7cf
Author: Kent Overstreet <[email protected]>
Date: Mon Oct 1 13:17:56 2012 -0700

Extensible AIO interface

diff --git a/fs/aio.c b/fs/aio.c
index 71f613c..54acc17 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -424,6 +424,7 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
+ req->ki_attrs = NULL;
req->ki_eventfd = NULL;

return req;
@@ -538,6 +539,7 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
req->ki_dtor(req);
if (req->ki_iovec != &req->ki_inline_vec)
kfree(req->ki_iovec);
+ kfree(req->ki_attrs);
kmem_cache_free(kiocb_cachep, req);
ctx->reqs_active--;

@@ -1504,6 +1506,33 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
return 0;
}

+static int aio_setup_attrs(struct iocb *iocb, struct kiocb *req)
+{
+ u64 size;
+
+ if (!iocb->aio_attrs)
+ return 0;
+
+ if (unlikely(get_user(size, (u64 *) iocb->aio_attrs)))
+ return -EFAULT;
+
+ if (unlikely(size > PAGE_SIZE))
+ return -EFAULT;
+
+ req->ki_attrs = kmalloc(size, GFP_KERNEL);
+ if (unlikely(!req->ki_attrs))
+ return -ENOMEM;
+
+ if (unlikely(copy_from_user(req->ki_attrs,
+ (void *) iocb->aio_attrs, size))) {
+ kfree(req->ki_attrs);
+ req->ki_attrs = NULL;
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, struct kiocb_batch *batch,
bool compat)
@@ -1513,7 +1542,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
ssize_t ret;

/* enforce forwards compatibility on users */
- if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+ if (unlikely(iocb->aio_reserved1 ||
+ (iocb->aio_attrs &&
+ !(iocb->aio_flags & IOCB_FLAG_ATTR)))) {
pr_debug("EINVAL: io_submit: reserve field set\n");
return -EINVAL;
}
@@ -1538,6 +1569,11 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
return -EAGAIN;
}
req->ki_filp = file;
+
+ ret = aio_setup_attrs(iocb, req);
+ if (ret)
+ goto out_put_req;
+
if (iocb->aio_flags & IOCB_FLAG_RESFD) {
/*
* If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..f58f44f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -371,6 +371,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
unsigned long flags;

bio->bi_private = dio;
+ bio->bi_attrs = dio->iocb->ki_attrs;

spin_lock_irqsave(&dio->bio_lock, flags);
dio->refcount++;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 31ff6db..60dd6bc 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -119,6 +119,8 @@ struct kiocb {
* for cancellation */
struct list_head ki_batch; /* batch allocation */

+ struct iocb_attr_list *ki_attrs;
+
/*
* If the aio_resfd field of the userspace iocb is not zero,
* this is the underlying eventfd context to deliver events to.
@@ -235,6 +237,55 @@ static inline struct kiocb *list_kiocb(struct list_head *h)
return list_entry(h, struct kiocb, ki_list);
}

+static inline struct iocb_attr *iocb_attr_next(struct iocb_attr_list *attrs,
+ struct iocb_attr *attr)
+{
+ void *end = ((void *) attrs) + attrs->size;
+
+ if (attr)
+ attr = ((void *) attr) + attr->size;
+ else
+ attr = attrs->attrs;
+
+ if ((void *) &attr->data > end)
+ return NULL;
+
+ if (((void *) attr) + attr->size > end)
+ return NULL;
+
+ return attr;
+}
+
+static inline void *__iocb_attr_lookup(struct iocb_attr_list *attrs, unsigned id)
+{
+ struct iocb_attr *attr = NULL;
+
+ if (!attrs)
+ return NULL;
+
+ while (1) {
+ attr = iocb_attr_next(attrs, attr);
+ if (!attr)
+ return NULL;
+
+ if (attr->id == id)
+ return attr;
+ }
+
+ return NULL;
+}
+
+#define iocb_attr_lookup(attrs, id) \
+({ \
+ struct iocb_attr *_attr; \
+ \
+ _attr = __iocb_attr_lookup((attrs), IOCB_ATTR_ ## id); \
+ if (_attr->size != sizeof(struct iocb_attr_ ## id)) \
+ _attr = NULL; \
+ \
+ (struct iocb_attr_ ## id *) _attr; \
+})
+
/* for sysctl: */
extern unsigned long aio_nr;
extern unsigned long aio_max_nr;
diff --git a/include/linux/aio_abi.h b/include/linux/aio_abi.h
index 86fa7a7..1f46460 100644
--- a/include/linux/aio_abi.h
+++ b/include/linux/aio_abi.h
@@ -53,6 +53,7 @@ enum {
* is valid.
*/
#define IOCB_FLAG_RESFD (1 << 0)
+#define IOCB_FLAG_ATTR (1 << 1)

/* read() from /dev/aio returns these structures. */
struct io_event {
@@ -92,7 +93,7 @@ struct iocb {
__s64 aio_offset;

/* extra parameters */
- __u64 aio_reserved2; /* TODO: use this for a (struct sigevent *) */
+ __u64 aio_attrs;

/* flags for the "struct iocb" */
__u32 aio_flags;
@@ -104,6 +105,26 @@ struct iocb {
__u32 aio_resfd;
}; /* 64 bytes */

+struct iocb_attr {
+ __u32 size;
+ __u32 id;
+ __u8 data[0];
+};
+
+struct iocb_attr_list {
+ __u64 size;
+ struct iocb_attr attrs[];
+};
+
+enum {
+ IOCB_ATTR_proxy_pid,
+};
+
+struct iocb_attr_proxy_pid {
+ struct iocb_attr attr;
+ __u64 pid;
+};
+
#undef IFBIG
#undef IFLITTLE

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7b7ac9c..e2e37f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -68,6 +68,9 @@ struct bio {
bio_end_io_t *bi_end_io;

void *bi_private;
+
+ struct iocb_attr_list *bi_attrs;
+
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio


2012-10-01 23:12:37

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().

Sure. Martin (cc:ed) will sympathize.

> A few examples:
>
> * IO scheduler hints...
> * Cache hints...
>
> * Passing checksums out to userspace. We've got bio integrity, which is
> a (somewhat) generic interface for passing data checksums between the
> filesystem and the hardware.

Hmm, careful here. I think that in DIF/DIX the checksums are
per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr
limit in this patch would be magically creating different max IO size
limits on different architectures. That doesn't seem great.

> Hence, AIO attributes.

I have to be honest: I really don't like tying the interface to AIO, but
I guess it's the only per-io facility we have today. It'd be nice to
include sync O_DIRECT when designing the interface to make sure that it
is possible to use generic syscalls in the future without running up
against unexpected problems.

> An iocb_attr has an id field, and a size field - and some amount of data
> specific to that attribute.

I'd hope that we can come up with a less fragile interface. The kernel
would have to scan the attributes to make sure that there aren't
malicious sizes. I only quickly glanced at the loops, but it seemed
like you could have a 0 size attribute in there and _next() would spin
forever.

- z

2012-10-01 23:22:46

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Mon, Oct 01, 2012 at 04:12:22PM -0700, Zach Brown wrote:
> On Mon, Oct 01, 2012 at 03:23:41PM -0700, Kent Overstreet wrote:
> > So, I and other people keep running into things where we really need to
> > add an interface to pass some auxiliary... stuff along with a pread() or
> > pwrite().
>
> Sure. Martin (cc:ed) will sympathize.
>
> > A few examples:
> >
> > * IO scheduler hints...
> > * Cache hints...
> >
> > * Passing checksums out to userspace. We've got bio integrity, which is
> > a (somewhat) generic interface for passing data checksums between the
> > filesystem and the hardware.
>
> Hmm, careful here. I think that in DIF/DIX the checksums are
> per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr
> limit in this patch would be magically creating different max IO size
> limits on different architectures. That doesn't seem great.

Not just per sector, Per hardware sector. For passing around checksums
userspace would have to find out the hardware sector size and checksum
type/size via a different interface, and then the attribute would
contain a pointer to a buffer that can hold the appropriate number of
checksums.

>
> > Hence, AIO attributes.
>
> I have to be honest: I really don't like tying the interface to AIO, but
> I guess it's the only per-io facility we have today. It'd be nice to
> include sync O_DIRECT when designing the interface to make sure that it
> is possible to use generic syscalls in the future without running up
> against unexpected problems.

It'd certainly useful with regular sync IO, I just want to take it
one step at a time particularly since for sync IO we'll probably need
new syscalls.

But yes you're right, it would be good to keep in mind.

> > An iocb_attr has an id field, and a size field - and some amount of data
> > specific to that attribute.
>
> I'd hope that we can come up with a less fragile interface. The kernel
> would have to scan the attributes to make sure that there aren't
> malicious sizes. I only quickly glanced at the loops, but it seemed
> like you could have a 0 size attribute in there and _next() would spin
> forever.

Ouch, yeah that's wrong :/

I don't think there's anything fragile about the basic idea though. Or
do you have some way of improving upon it in mind?

The idea with the size field is that it's just sizeof(the particular
attribute struct), so when userspace is appending attributes it just
sets size = sizeof() and attr_list->size += attr->size.

The kernel is going to have to sanity check the size fields of the
individual attributes anyways to verify the size of the last attr
doesn't extend off the end of the attr list, so I think it makes sense
to keep the current semantics of the size fields and just also check
that the size field is nonzero (actually >= sizeof(struct iocb_attr)).

2012-10-01 23:44:42

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

> Not just per sector, Per hardware sector. For passing around checksums
> userspace would have to find out the hardware sector size and checksum
> type/size via a different interface, and then the attribute would
> contain a pointer to a buffer that can hold the appropriate number of
> checksums.

All problems fall to another layer of indirection? :) But yes, that's
fair. I was obviously just assuming that the checksums would be in the
attribute.

But now we're talking about layers of user pointers. Would the
attribute parser need to verify/copy pointers before downstream kernel
code tries to work with it? Would it be up to the attribute consumers
to verify the pointers that the core doesn't really touch? Are these
second pointers native (enter compat goo) or u64s?

> I don't think there's anything fragile about the basic idea though. Or
> do you have some way of improving upon it in mind?

Nothing super great is springing to mind, no.

> The idea with the size field is that it's just sizeof(the particular
> attribute struct), so when userspace is appending attributes it just
> sets size = sizeof() and attr_list->size += attr->size.

I suppose. But this also raises the spectre of aligning the packed
attributes to match their struct definitions. It's the netlink(3)
macros all over again, right? I guess unaligned accesses aren't *that*
big a deal. But still.

And what about duplicate instances of a given attribute id? Use the
first? The last? Error? Depends on the id?

It just seems like there are a lot of corner cases that can go wrong
with an API that is so free form. I'd like something a lot harder to
make mistakes with.

- z
(being That Guy today, apparently :/)

2012-10-02 00:22:25

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
> > Not just per sector, Per hardware sector. For passing around checksums
> > userspace would have to find out the hardware sector size and checksum
> > type/size via a different interface, and then the attribute would
> > contain a pointer to a buffer that can hold the appropriate number of
> > checksums.
>
> All problems fall to another layer of indirection? :)

Yep :)

> But yes, that's
> fair. I was obviously just assuming that the checksums would be in the
> attribute.
>
> But now we're talking about layers of user pointers. Would the
> attribute parser need to verify/copy pointers before downstream kernel
> code tries to work with it? Would it be up to the attribute consumers
> to verify the pointers that the core doesn't really touch?

The generic code wouldn't know about any user pointers inside
attributes, so it'd have to be downstream consumers. Hopefully there
won't be many attributes with user pointers in them (I don't expect
there to be), so we won't have too much of this messyness.

> Are these
> second pointers native (enter compat goo) or u64s?

Outside the scope of the generic implementation, but AFAIK u64s are the
sane way so I'd prefer to just enforce that.

> > I don't think there's anything fragile about the basic idea though. Or
> > do you have some way of improving upon it in mind?
>
> Nothing super great is springing to mind, no.
>
> > The idea with the size field is that it's just sizeof(the particular
> > attribute struct), so when userspace is appending attributes it just
> > sets size = sizeof() and attr_list->size += attr->size.
>
> I suppose. But this also raises the spectre of aligning the packed
> attributes to match their struct definitions. It's the netlink(3)
> macros all over again, right? I guess unaligned accesses aren't *that*
> big a deal. But still.

I was just thinking about that a few minutes ago. Since the way I'm
doing it embeds struct iocb_attr inside the specific attributes, if we
stick __aligned(8) on struct iocb_attr that should solve alignment.

> And what about duplicate instances of a given attribute id? Use the
> first? The last? Error? Depends on the id?

I suspect duplicate instances will be useful and the sanest approach
for a few things, so I don't want to disallow it.

But - perhaps we could define whether duplicates are allowed of a given
attribute along with the attribute ids, then the kernel in generic code
could check for duplicates of attrs for which it was disallowed and
error.

I think I really like that idea.

> It just seems like there are a lot of corner cases that can go wrong
> with an API that is so free form. I'd like something a lot harder to
> make mistakes with.
>
> - z
> (being That Guy today, apparently :/)

I think it's got to be free form to be useful, but that doesn't mean we
can't avoid as many mistakes as possible ahead of time so please
continue to point out anything you can think of :)

2012-10-02 00:48:06

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Mon, Oct 01, 2012 at 04:44:39PM -0700, Zach Brown wrote:
> And what about duplicate instances of a given attribute id? Use the
> first? The last? Error? Depends on the id?

I thought of a better idea, instead of explicitly checking for
disallowed dups:

We want to return -ENOTHANDLED for not handled attributes anyways, so
let's just do that for dups - that'll catch erronious usage just fine
and a generic mechanism's better than a one off hack any day.

This does mean we can't punt on return values, which isn't a bad thing.

Also, if we've got duplicate attributes userspace needs to be able to
figure out which return value was for which attribute.

Two possibilities: one, return values come out in the same order
attributes went in. That'd work, but I dislike the subtlety and I expect
it'd be a pain for userspace.

Instead, let's just stick a u64 cookie in the attribute and include that
in the return, just like we do everywhere else.

2012-10-02 17:41:37

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

Kent Overstreet <[email protected]> writes:

> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().
>
> A few examples:
>
> * IO scheduler hints. Some userspace program wants to, per IO, specify
> either priorities or a cgroup - by specifying a cgroup you can have a
> fileserver in userspace that makes use of cfq's per cgroup bandwidth
> quotas.

You can do this today by splitting I/O between processes and placing
those processes in different cgroups. For io priority, there is
ioprio_set, which incurs an extra system call, but can be used. Not
elegant, but possible.

> * Cache hints. For bcache and other things, userspace may want to specify
> "this data should be cached", "this data should bypass the cache", etc.

Please explain how you will differentiate this from posix_fadvise.

> * Passing checksums out to userspace. We've got bio integrity, which is
> a (somewhat) generic interface for passing data checksums between the
> filesystem and the hardware. There are various circumstances under which
> you may want to pass these checksums out to userspace, and if so we
> ought to have a generic way of doing it.

Yes, that needs a new interface.

> Hence, AIO attributes.

*No.* Start with the non-AIO case first.

> * FUTURE STUFF:
>
> Return values:
>
> Some attributes are probably going to want to return something to
> userspace.
>
> If nothing else, we want this so that userspace can tell if anything
> handled the attributes it specified - as dynamic as the io stack can be,
> with something extensible like this there really isn't any generic way
> of knowing ahead of time if something is going to interpret any
> attribute - we want to return at least an error code.

Seems odd to me. Why not expose supported attributes via some other
call? fcntl?

> One could imagine sticking the return in the attribute itself, but I
> don't want to do this. For some things (checksums), the attribute will
> contain a pointer to a buffer - that's fine. But I don't want the
> attributes themselves to be writeable.

One could imagine that attributes don't return anything, because, well,
they're properties of something else, and properties don't return
anything.

Cheers,
Jeff

2012-10-02 17:43:27

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

> The generic code wouldn't know about any user pointers inside
> attributes, so it'd have to be downstream consumers. Hopefully there
> won't be many attributes with user pointers in them (I don't expect
> there to be), so we won't have too much of this messyness.

I really don't like this. We should have learned this lesson with ioctl
structs that are nested pointers.

What about security bits that are trying to determine if attributes are
OK?

What about contexts that can't easily deal with userspace pointers? We
tend to copy into relatively more accessible kernel memory as early as
possible.

What about fuse trying to extend this interface out to their fs clients?
Look at the horrible mess it implements to bounce nested ioctl data
parsing between the kernel's user pointer copying and the fuse client's
parsing of that copied data.

Let's not do this again, please. I think it's a fallacy to claim that
the interface can be opaque to high levels and only parsed by lower
levels. The interface should be explicit and fully specified on entry
so that all levels have trivial access to it.

- z

2012-10-02 19:34:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

Kent Overstreet <[email protected]> writes:

> So, I and other people keep running into things where we really need to
> add an interface to pass some auxiliary... stuff along with a pread() or
> pwrite().

How would you enumerate this?
How does the application know what the underlying stack supports/need?
How is versioning handled?

Frankly, in this form it looks more like a special purpose hack than a general
facility.

-Andi

--
[email protected] -- Speaking for myself only

2012-10-02 21:41:23

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Tue, Oct 02, 2012 at 10:43:23AM -0700, Zach Brown wrote:
> > The generic code wouldn't know about any user pointers inside
> > attributes, so it'd have to be downstream consumers. Hopefully there
> > won't be many attributes with user pointers in them (I don't expect
> > there to be), so we won't have too much of this messyness.
>
> I really don't like this. We should have learned this lesson with ioctl
> structs that are nested pointers.

I will in no way claim to be skilled at kernel/userspace interface
design (my expertise is more in the block layer), so feel free to
educate me as much as you have the patience for :)

The comparision with ioctl is definitely apt. (And I don't care for
ioctl either, my first preference is for textual interfaces wherever
possible - but I don't think that's a reasonable approach here :p)

I'm not sure what you're exact complaint about nested pointers is - I
agree we want to avoid pointers as much as they can, for multiple
reasons. I do expect to get rid of one layer of indirection eventually,
but that'll probably require new syscalls.

Also, I emphatically hope and expect that nested pointers will _not_ be
the norm.

> What about security bits that are trying to determine if attributes are
> OK?

Seems to me it'd be no different from security considerations when
introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
sometimes no way around it.

It really has to be ad hoc if it's extensible, unfortunately.

The only way of getting around _that_ would be with some kind of
reflective type system, so that generic code could parse (in some
fashion) the types of the various attributes, and for pointers copy the
user data into the kernel and do whatever access controls in generic
code.

Which might not be a crazy idea if it could be applied to ioctls...
hmm...

> What about contexts that can't easily deal with userspace pointers? We
> tend to copy into relatively more accessible kernel memory as early as
> possible.

That tends not to be the case submission side. submit_bio() and
make_request functions are all run in normal process context.

And also, like I mentioned I expect nested pointers to be fairly unusual
so this shouldn't come up often.

> What about fuse trying to extend this interface out to their fs clients?
> Look at the horrible mess it implements to bounce nested ioctl data
> parsing between the kernel's user pointer copying and the fuse client's
> parsing of that copied data.

Ugh, yeah that's something I should look into/think about. It's been
ages since I looked at fuse, and I never looked at how it handled ioctls
(not sure if that even existed when i was looking at it).

> Let's not do this again, please. I think it's a fallacy to claim that
> the interface can be opaque to high levels and only parsed by lower
> levels. The interface should be explicit and fully specified on entry
> so that all levels have trivial access to it.

Hmm - that is definitely a good principle.

To restate a bit, the parsing and validation ought to be done in generic
centralized code to the greatest extent possible (I _think_ that it's
not always possible in general, but maybe it will be in practice).

Also - IMO, one of the nastier things about ioctls is that
parsing/validation tends to be completely mixed with implementation. It
would be nice to get away from that.

Pondering a bit - that actually is the situation with my patch for
attributes that are simple data; it's just data and the data is
trivially accessible at all levels.

But, this isn't just an issue for attributes that contain user pointers
- the first attribute we prototyped was the proxy_pid attr, so a process
can have io done in another cgroup's context.

That requires validation, permission checking, and depending on how it's
used refcounting. That stuff _definitely_ shouldn't be buried in the
middle of block/cfq-iosched.c.

Two approaches I can think of:

For every attr, define a struct user_iocb_attr_foo and struct
kern_iocb_attr_foo. Outside of the attr parsing code, nothing in the
kernel sees the user version - they see a kern version which has had
said parsing/validation done on it.

I do like this approach in that it ought to make things easy to
audit/hard to screw up.

One difficulty I can see with that approach is refcounting - if
parsing/validation is done on kernel entry, we've got to take refcounts
to whatever kernel structures are referred to (like a different
task_struct). This can often be avoided if we delay parsing/validation
until when it's actually used, and then use rcu.

Or maybe we could go with a halfway approach - the userspace structs are
passed around within the kernel like in my existing patch, but for any
attr that isn't simple data we define a parse_iocb_attr_foo function,
and those functions are implemented in a common location - not scattered
around defined where they're used.

2012-10-02 22:34:22

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

>> Hmm, careful here. I think that in DIF/DIX the checksums are
>> per-sector, not per IO, right? That'd mean that the PAGE_SIZE attr
>> limit in this patch would be magically creating different max IO size
>> limits on different architectures. That doesn't seem great.

Kent> Not just per sector, Per hardware sector.

Per logical block (or for some devices less).


Kent> For passing around checksums userspace would have to find out the
Kent> hardware sector size and checksum type/size via a different
Kent> interface,

The relevant information is already exported in sysfs. Including the
format, how many bytes of integrity metadata go with how many bytes of
data, etc.

--
Martin K. Petersen Oracle Linux Engineering

2012-10-03 00:20:36

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> Kent Overstreet <[email protected]> writes:
>
> > So, I and other people keep running into things where we really need to
> > add an interface to pass some auxiliary... stuff along with a pread() or
> > pwrite().
> >
> > A few examples:
> >
> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > either priorities or a cgroup - by specifying a cgroup you can have a
> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > quotas.
>
> You can do this today by splitting I/O between processes and placing
> those processes in different cgroups. For io priority, there is
> ioprio_set, which incurs an extra system call, but can be used. Not
> elegant, but possible.

Yes - those are things I'm trying to replace. Doing it that way is a
real pain, both as it's a lousy interface for this and it does impact
performance (ioprio_set doesn't really work too well with aio, too).

> > * Cache hints. For bcache and other things, userspace may want to specify
> > "this data should be cached", "this data should bypass the cache", etc.
>
> Please explain how you will differentiate this from posix_fadvise.

Oh sorry, I think about SSD caching so much I forget to say that's what
I'm talking about. posix_fadvise is for the page cache, we want
something different for an SSD cache (IMO it'd be really ugly to use it
for both, and posix_fadvise() can't really specifify everything we'd
want to for an SSD cache).

> > * Passing checksums out to userspace. We've got bio integrity, which is
> > a (somewhat) generic interface for passing data checksums between the
> > filesystem and the hardware. There are various circumstances under which
> > you may want to pass these checksums out to userspace, and if so we
> > ought to have a generic way of doing it.
>
> Yes, that needs a new interface.
>
> > Hence, AIO attributes.
>
> *No.* Start with the non-AIO case first.

Why? It is orthogonal to AIO (and I should make that clearer), but to do
it for sync IO we'd need new syscalls that take an extra argument so IMO
it's a bit easier to start with AIO.

Might be worth implementing the sync interface sooner rather than later
just to discover any potential issues, I suppose.


> > * FUTURE STUFF:
> >
> > Return values:
> >
> > Some attributes are probably going to want to return something to
> > userspace.
> >
> > If nothing else, we want this so that userspace can tell if anything
> > handled the attributes it specified - as dynamic as the io stack can be,
> > with something extensible like this there really isn't any generic way
> > of knowing ahead of time if something is going to interpret any
> > attribute - we want to return at least an error code.
>
> Seems odd to me. Why not expose supported attributes via some other
> call? fcntl?

It's not possible in general - consider stacking block devices, and
attrs that are supported only by specific block drivers. I.e. if you've
got lvm on top of bcache or bcache on top of md, we can pass the attr
down with the IO but we can't determine ahead of time, in general, where
the IO is going to go.

But that probably isn't true for most attrs so it probably would be a
good idea to have an interface for querying what's supported, and even
for device specific ones you could query what a device supports.

> > One could imagine sticking the return in the attribute itself, but I
> > don't want to do this. For some things (checksums), the attribute will
> > contain a pointer to a buffer - that's fine. But I don't want the
> > attributes themselves to be writeable.
>
> One could imagine that attributes don't return anything, because, well,
> they're properties of something else, and properties don't return
> anything.

With a strict definition of attribute, yeah. One of the real uses cases
we have for this is per IO timings, for aio - right now we've got an
interface for the kernel to tell userspace how long a syscall took
(don't think it's upstream yet - Paul's been behind that stuff), but it
only really makes sense with synchronous syscalls.

These AIO attributes would be useful for that too, but I'd _much_ prefer
if the timing information was explicitly returned instead of using a
pointer to a buffer.

2012-10-03 01:28:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > Kent Overstreet <[email protected]> writes:
> >
> > > So, I and other people keep running into things where we really need to
> > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > pwrite().
> > >
> > > A few examples:
> > >
> > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > quotas.
> >
> > You can do this today by splitting I/O between processes and placing
> > those processes in different cgroups. For io priority, there is
> > ioprio_set, which incurs an extra system call, but can be used. Not
> > elegant, but possible.
>
> Yes - those are things I'm trying to replace. Doing it that way is a
> real pain, both as it's a lousy interface for this and it does impact
> performance (ioprio_set doesn't really work too well with aio, too).
>
> > > * Cache hints. For bcache and other things, userspace may want to specify
> > > "this data should be cached", "this data should bypass the cache", etc.
> >
> > Please explain how you will differentiate this from posix_fadvise.
>
> Oh sorry, I think about SSD caching so much I forget to say that's what
> I'm talking about. posix_fadvise is for the page cache, we want
> something different for an SSD cache (IMO it'd be really ugly to use it
> for both, and posix_fadvise() can't really specifify everything we'd
> want to for an SSD cache).

Similar discussions about posix_fadvise() are being had for marking
ranges of files as volatile (i.e. useful for determining what can be
evicted from a cache when space reclaim is required).

https://lkml.org/lkml/2012/10/2/501

If you have requirements for specific cache management, then it
might be worth seeing if you can steer an existing interface
proposal for some form of cache management in the direction you
need.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-03 01:41:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

Hello, Kent.

On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
> Seems to me it'd be no different from security considerations when
> introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
> sometimes no way around it.
>
> It really has to be ad hoc if it's extensible, unfortunately.
>
> The only way of getting around _that_ would be with some kind of
> reflective type system, so that generic code could parse (in some
> fashion) the types of the various attributes, and for pointers copy the
> user data into the kernel and do whatever access controls in generic
> code.

I'm not userland API expert by any stretch of imagination but the
basic mechanism to pass data around seems sane to me and aio as stinky
as it is seems to be the only logical stuff for IOs w/ extra
attributes although alignment is always painful with any form of
concatenated opaque structures.

However, I don't think it's a good idea to try to implement something
which is a neutral transport of opaque data between userland and lower
layers. Things like that sound attractive with unlimited
possibilities but reality seems to have the tendancy to make a big
mess out of setups like that.

So, if we're gonna do this, let's define what attributes we want to
have and let them be processed at the interface layer and fed to lower
layers afterwards - e.g. for cgroup context association, associate the
resulting bios with the target cgroup from the aio layer.

I'm quite skeptical of general usefulness of having opaque knobs to
lower IO stack which don't have proper generic abstraction. Nobody
can make proper use of things like that. Well, not nobody, maybe if
the lower stack, the interface and the application are implemented by
a single organization over relatively short span of time, maybe it
would be useful for them, but that isn't something which generic
interface design should focus on.

It's okay to allow some side channel thing for specific hacky uses but
I really hope the general design were focused around properly
abstracted attributes which can be understood and handled by the upper
layer.

Thanks.

--
tejun

2012-10-03 02:41:49

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
> On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > > Kent Overstreet <[email protected]> writes:
> > >
> > > > So, I and other people keep running into things where we really need to
> > > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > > pwrite().
> > > >
> > > > A few examples:
> > > >
> > > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > > quotas.
> > >
> > > You can do this today by splitting I/O between processes and placing
> > > those processes in different cgroups. For io priority, there is
> > > ioprio_set, which incurs an extra system call, but can be used. Not
> > > elegant, but possible.
> >
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
> >
> > > > * Cache hints. For bcache and other things, userspace may want to specify
> > > > "this data should be cached", "this data should bypass the cache", etc.
> > >
> > > Please explain how you will differentiate this from posix_fadvise.
> >
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
>
> Similar discussions about posix_fadvise() are being had for marking
> ranges of files as volatile (i.e. useful for determining what can be
> evicted from a cache when space reclaim is required).
>
> https://lkml.org/lkml/2012/10/2/501

Hmm, interesting

Speaking as an implementor though, hints that aren't associated with any
specific IO are harder to make use of - stuff is in the cache. What you
really want is to know, for a given IO, whether to cache it or not, and
possibly where in the LRU to stick it.

Well, it's quite possible that different implementations would have no
trouble making use of those kinds of hints, I'm no doubt biased by
having implemented bcache. With bcache though, cache replacement is done
in terms of physical address space, not logical (i.e. the address space
of the device being cached).

So to handle posix_fadvise, we'd have to walk the btree and chase
pointers to buckets, and modify the bucket priorities up or down... but
what about the other data in those buckets? It's not clear what should
happen, but there isn't any good way to take that into account.

(The exception is dropping data from the cache entirely, we can just
invalidate that part of the keyspace and garbage collection will reclaim
the buckets they pointed to. Bcache does that for discard requests,
currently).

> If you have requirements for specific cache management, then it
> might be worth seeing if you can steer an existing interface
> proposal for some form of cache management in the direction you
> need.

Certainly - I don't plan on implementing anything bcache specific, or
implementing anything from scratch if there's a good proposal out there.
But a per-io interface does seem useful from an implementation pov and
natural to use for at least some classes of applications.

2012-10-03 03:00:57

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Wed, Oct 03, 2012 at 10:41:06AM +0900, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Oct 02, 2012 at 02:41:13PM -0700, Kent Overstreet wrote:
> > Seems to me it'd be no different from security considerations when
> > introducing new ioctls. I.e., messy, ad hoc, easy to get wrong, but
> > sometimes no way around it.
> >
> > It really has to be ad hoc if it's extensible, unfortunately.
> >
> > The only way of getting around _that_ would be with some kind of
> > reflective type system, so that generic code could parse (in some
> > fashion) the types of the various attributes, and for pointers copy the
> > user data into the kernel and do whatever access controls in generic
> > code.
>
> I'm not userland API expert by any stretch of imagination but the
> basic mechanism to pass data around seems sane to me and aio as stinky
> as it is seems to be the only logical stuff for IOs w/ extra
> attributes although alignment is always painful with any form of
> concatenated opaque structures.
>
> However, I don't think it's a good idea to try to implement something
> which is a neutral transport of opaque data between userland and lower
> layers. Things like that sound attractive with unlimited
> possibilities but reality seems to have the tendancy to make a big
> mess out of setups like that.

I don't see how the "neutral transport of opaque data" itself is a bad
thing. We want something simple and sane to build actual interfaces on
top of - once we've got that, we can either build clean generic well
defined interfaces or we can make a mess like with ioctls :P

It's like any other mechanism. There's good syscalls and bad syscalls...

> So, if we're gonna do this, let's define what attributes we want to
> have and let them be processed at the interface layer and fed to lower
> layers afterwards - e.g. for cgroup context association, associate the
> resulting bios with the target cgroup from the aio layer.

That sounds perfectly reasonable to me (the emails with Zach ended up
heading in that direction).

> I'm quite skeptical of general usefulness of having opaque knobs to
> lower IO stack which don't have proper generic abstraction. Nobody
> can make proper use of things like that. Well, not nobody, maybe if
> the lower stack, the interface and the application are implemented by
> a single organization over relatively short span of time, maybe it
> would be useful for them, but that isn't something which generic
> interface design should focus on.

I think it could work fine, but I'm not convinced either way on what the
correct approach is.

Say we implement an attr to control a block layer cache. That attr could
be parsed/validated in high level code (if there's any to do) - that I
don't object to. But the high level code isn't going to /know/ whether
there was any block cache in the stack that handled the attr. If the
attr is passed down to the block cache, that block cache can return that
it was handled.

Or if we implement an attr that says "return whether it was a cache hit
or miss, and maybe other statistics". Similar thing.

We could conceivably confine attrs to the upper layer, and define in
kernel interfaces for passing all this stuff around, but I'm doubtful
it's worth the trouble - at least if attrs themselves are implemented
sanely. And honestly I think it'd make (more of) a mess of the block
layer and the rest of the io stack to have to explicitly pass around
cache hints/cache controlling options/whatever else we think of in the
future - especially when most of this stuff isn't going to be in use
most of the time.

But, like I sort of mentioned in another email with Zach - passing the
attrs from userspace around is _not_ mutually exclusive with standard
code for parsing/validating them that exists in one place. We shouldn't
have driver code reaching in and grabbing the raw attr unless there _is_
no parsing/validation to do.


> It's okay to allow some side channel thing for specific hacky uses but
> I really hope the general design were focused around properly
> abstracted attributes which can be understood and handled by the upper
> layer.

Completely agreed. I want to leave that side channel open for
experimentation, and so we have a way of implementing one off hacky
stuff when we need to - but normal mainline stuff should be sane and
well designed.

2012-10-03 19:54:46

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

Kent Overstreet <[email protected]> writes:

> On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
>> Kent Overstreet <[email protected]> writes:
>>
>> > So, I and other people keep running into things where we really need to
>> > add an interface to pass some auxiliary... stuff along with a pread() or
>> > pwrite().
>> >
>> > A few examples:
>> >
>> > * IO scheduler hints. Some userspace program wants to, per IO, specify
>> > either priorities or a cgroup - by specifying a cgroup you can have a
>> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
>> > quotas.
>>
>> You can do this today by splitting I/O between processes and placing
>> those processes in different cgroups. For io priority, there is
>> ioprio_set, which incurs an extra system call, but can be used. Not
>> elegant, but possible.
>
> Yes - those are things I'm trying to replace. Doing it that way is a
> real pain, both as it's a lousy interface for this and it does impact
> performance (ioprio_set doesn't really work too well with aio, too).

ioprio_set works fine with aio, since the I/O is issued in the caller's
context. Perhaps you're thinking of writeback I/O?

>> > * Cache hints. For bcache and other things, userspace may want to specify
>> > "this data should be cached", "this data should bypass the cache", etc.
>>
>> Please explain how you will differentiate this from posix_fadvise.
>
> Oh sorry, I think about SSD caching so much I forget to say that's what
> I'm talking about. posix_fadvise is for the page cache, we want
> something different for an SSD cache (IMO it'd be really ugly to use it
> for both, and posix_fadvise() can't really specifify everything we'd
> want to for an SSD cache).

DESCRIPTION
Programs can use posix_fadvise() to announce an intention to
access file data in a specific pattern in the future, thus
allowing the kernel to perform appropriate optimizations.

That description seems broad enough to include disk caches as well. You
haven't exactly stated what's missing.

>> > Hence, AIO attributes.
>>
>> *No.* Start with the non-AIO case first.
>
> Why? It is orthogonal to AIO (and I should make that clearer), but to do
> it for sync IO we'd need new syscalls that take an extra argument so IMO
> it's a bit easier to start with AIO.
>
> Might be worth implementing the sync interface sooner rather than later
> just to discover any potential issues, I suppose.

Looking back to preadv and pwritev, it was wrong to only add them to
libaio (and that later got corrected). I'd just like to see things
start out with the sync interfaces, since you'll get more eyes on the
code (not everyone cares about aio) and that helps to work out any
interface issues.

>> > * FUTURE STUFF:
>> >
>> > Return values:
>> >
>> > Some attributes are probably going to want to return something to
>> > userspace.
>> >
>> > If nothing else, we want this so that userspace can tell if anything
>> > handled the attributes it specified - as dynamic as the io stack can be,
>> > with something extensible like this there really isn't any generic way
>> > of knowing ahead of time if something is going to interpret any
>> > attribute - we want to return at least an error code.
>>
>> Seems odd to me. Why not expose supported attributes via some other
>> call? fcntl?
>
> It's not possible in general - consider stacking block devices, and
> attrs that are supported only by specific block drivers. I.e. if you've
> got lvm on top of bcache or bcache on top of md, we can pass the attr
> down with the IO but we can't determine ahead of time, in general, where
> the IO is going to go.

If the io stack is static (meaning you setup a device once, then open it
and do io to it, and it doesn't change while you're doing io), how would
you not know where the IO is going to go?

> But that probably isn't true for most attrs so it probably would be a
> good idea to have an interface for querying what's supported, and even
> for device specific ones you could query what a device supports.

OK.

>> > One could imagine sticking the return in the attribute itself, but I
>> > don't want to do this. For some things (checksums), the attribute will
>> > contain a pointer to a buffer - that's fine. But I don't want the
>> > attributes themselves to be writeable.
>>
>> One could imagine that attributes don't return anything, because, well,
>> they're properties of something else, and properties don't return
>> anything.
>
> With a strict definition of attribute, yeah. One of the real uses cases
> we have for this is per IO timings, for aio - right now we've got an
> interface for the kernel to tell userspace how long a syscall took
> (don't think it's upstream yet - Paul's been behind that stuff), but it
> only really makes sense with synchronous syscalls.

Something beyond recording the time spent in the kernel? Paul who? I
agree the per io timing for aio may be coarse-grained today (you can
time the difference between io_submit returning and the event being
returned by io_getevents, but that says nothing of when the io was
issued to the block layer). I'm curious to know exactly what
granularity you want here, and what an application would do with that
information. You can currently access a whole lot of detail of the io
path through blktrace, but that is not easily done from within an
application.

> These AIO attributes would be useful for that too, but I'd _much_ prefer
> if the timing information was explicitly returned instead of using a
> pointer to a buffer.

I'm having a hard time understanding exactly what you are timing.

Cheers,
Jeff

2012-10-03 21:58:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

Hello, Kent.

On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
> > However, I don't think it's a good idea to try to implement something
> > which is a neutral transport of opaque data between userland and lower
> > layers. Things like that sound attractive with unlimited
> > possibilities but reality seems to have the tendancy to make a big
> > mess out of setups like that.
>
> I don't see how the "neutral transport of opaque data" itself is a bad
> thing. We want something simple and sane to build actual interfaces on
> top of - once we've got that, we can either build clean generic well
> defined interfaces or we can make a mess like with ioctls :P
>
> It's like any other mechanism. There's good syscalls and bad syscalls...

Depending on what a feature aims for, the design and implementation
vary greatly. If you go for completely generic extensible stuff which
can be used to warp space-time continuum, it's easy to end up with a
monstrosity with generic and programmable parser, verifier, accessor
and so on.

> Say we implement an attr to control a block layer cache. That attr could
> be parsed/validated in high level code (if there's any to do) - that I
> don't object to. But the high level code isn't going to /know/ whether
> there was any block cache in the stack that handled the attr. If the
> attr is passed down to the block cache, that block cache can return that
> it was handled.

My point is that if it doesn't fit the generic abstract model as in
fadvise(2), it probably isn't worth supporting in any generic manner.

> > It's okay to allow some side channel thing for specific hacky uses but
> > I really hope the general design were focused around properly
> > abstracted attributes which can be understood and handled by the upper
> > layer.
>
> Completely agreed. I want to leave that side channel open for
> experimentation, and so we have a way of implementing one off hacky
> stuff when we need to - but normal mainline stuff should be sane and
> well designed.

So, I think we can aim for something simple and modest (the only thing
I can think of at the moment is task association) and provide simple
framework which can be used for specific custom usages. Let's please
not go overboard with generic parser / verifier which supports pointer
indirection or whatnot.

Thanks.

--
tejun

2012-10-04 01:04:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Tue, Oct 02, 2012 at 07:41:10PM -0700, Kent Overstreet wrote:
> On Wed, Oct 03, 2012 at 11:28:25AM +1000, Dave Chinner wrote:
> > On Tue, Oct 02, 2012 at 05:20:29PM -0700, Kent Overstreet wrote:
> > > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> > > > Kent Overstreet <[email protected]> writes:
> > > >
> > > > > So, I and other people keep running into things where we really need to
> > > > > add an interface to pass some auxiliary... stuff along with a pread() or
> > > > > pwrite().
> > > > >
> > > > > A few examples:
> > > > >
> > > > > * IO scheduler hints. Some userspace program wants to, per IO, specify
> > > > > either priorities or a cgroup - by specifying a cgroup you can have a
> > > > > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> > > > > quotas.
> > > >
> > > > You can do this today by splitting I/O between processes and placing
> > > > those processes in different cgroups. For io priority, there is
> > > > ioprio_set, which incurs an extra system call, but can be used. Not
> > > > elegant, but possible.
> > >
> > > Yes - those are things I'm trying to replace. Doing it that way is a
> > > real pain, both as it's a lousy interface for this and it does impact
> > > performance (ioprio_set doesn't really work too well with aio, too).
> > >
> > > > > * Cache hints. For bcache and other things, userspace may want to specify
> > > > > "this data should be cached", "this data should bypass the cache", etc.
> > > >
> > > > Please explain how you will differentiate this from posix_fadvise.
> > >
> > > Oh sorry, I think about SSD caching so much I forget to say that's what
> > > I'm talking about. posix_fadvise is for the page cache, we want
> > > something different for an SSD cache (IMO it'd be really ugly to use it
> > > for both, and posix_fadvise() can't really specifify everything we'd
> > > want to for an SSD cache).
> >
> > Similar discussions about posix_fadvise() are being had for marking
> > ranges of files as volatile (i.e. useful for determining what can be
> > evicted from a cache when space reclaim is required).
> >
> > https://lkml.org/lkml/2012/10/2/501
>
> Hmm, interesting
>
> Speaking as an implementor though, hints that aren't associated with any
> specific IO are harder to make use of - stuff is in the cache. What you
> really want is to know, for a given IO, whether to cache it or not, and
> possibly where in the LRU to stick it.

I can see how it might be useful, but it needs to have a defined set
of attributes that a file IO is allowed to have. If you don't define
the set, then what really have is an arbitrary set of storage-device
specific interfaces.

Of course, once we have a defined set of per-file IO policy
attributes, we don't really need per-IO attributes - you can just
set them through a range interface like fadvise() or fallocate().

> Well, it's quite possible that different implementations would have no
> trouble making use of those kinds of hints, I'm no doubt biased by
> having implemented bcache. With bcache though, cache replacement is done
> in terms of physical address space, not logical (i.e. the address space
> of the device being cached).
>
> So to handle posix_fadvise, we'd have to walk the btree and chase
> pointers to buckets, and modify the bucket priorities up or down... but
> what about the other data in those buckets? It's not clear what should
> happen, but there isn't any good way to take that into account.
>
> (The exception is dropping data from the cache entirely, we can just
> invalidate that part of the keyspace and garbage collection will reclaim
> the buckets they pointed to. Bcache does that for discard requests,
> currently).

It sounds to me like you are saying is that the design of bcache is
unsuited to file-level management of caching policy, and that is why
you want to pass attributes directly to bcache with specific IOs. Is
that a fair summary of the problem you are describing here?

My problem with this approach has nothing to do with the per-IO
nature of it - it's to do with the layering violations and the
amount of storage specific knowledge needed to make effective use of
it. i.e. it seems like an interface that can only be used by people
intimately familiar with underlying storage implementation. You
happen to be one of those people, so I figure that you don't see a
problem with that. ;)

However, it also implies that an application must understand and use
a specific storage configuration that matches the attributes an
application sends. I understand how this model is appealling to
Google because they control the whole application and storage stack
(hardware and software) from top to bottom. However, I just don't
think that it is a solution that the rest of the world can use
effectively.

The scope of data locality, aging and IO priority policy
control is much larger than just controlling SSD caching.
SSD caching is just a another implementation of automatic tiering,
and HSMs have been doing this for years and years. It's the same
problem - space management and deciding what to leave in the
frequently accessed pool of fast storage for best performance.

Given that we have VFS level hot inode and offset range tracking not
that far away, we're soon going to have file-level access frequency
data available to both userspace and filesystems. Hence widespread
support for automatic heterogenous data teiring controlled at the
file range level isn't that far away.

As such, it follows that the management interface for data locality
policy (e.g. access frequency hints) needs to align with the method
of tracking access frequency that is be proposed. i.e. it should
also be be file range based. And if the hints are abstract, then the
underlying storage layers can translate that hint into something
appropriate for the given storage layer. Storage layer specific
hints (e.g. cache this IO) do not mean anything to layers that
don't have the functionality that is being asked for.

I'll also point out that a file range interface is the natural level
at which to manage access policies from an application developer's
POV, as it matches their existing view of how they store data. Most
applications don't know anything about how storage is implemented,
but they do know which files or parts of files they access
frequently.

Realistically, this is a complex problem, but I think we need to
solve the general access policy management problem rather inventing
ways of punching application/storage specific access information
through to random layers of the storage stack from userspace....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-04 19:38:06

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Wed, Oct 03, 2012 at 03:15:26PM -0400, Jeff Moyer wrote:
> Kent Overstreet <[email protected]> writes:
>
> > On Tue, Oct 02, 2012 at 01:41:17PM -0400, Jeff Moyer wrote:
> >> Kent Overstreet <[email protected]> writes:
> >>
> >> > So, I and other people keep running into things where we really need to
> >> > add an interface to pass some auxiliary... stuff along with a pread() or
> >> > pwrite().
> >> >
> >> > A few examples:
> >> >
> >> > * IO scheduler hints. Some userspace program wants to, per IO, specify
> >> > either priorities or a cgroup - by specifying a cgroup you can have a
> >> > fileserver in userspace that makes use of cfq's per cgroup bandwidth
> >> > quotas.
> >>
> >> You can do this today by splitting I/O between processes and placing
> >> those processes in different cgroups. For io priority, there is
> >> ioprio_set, which incurs an extra system call, but can be used. Not
> >> elegant, but possible.
> >
> > Yes - those are things I'm trying to replace. Doing it that way is a
> > real pain, both as it's a lousy interface for this and it does impact
> > performance (ioprio_set doesn't really work too well with aio, too).
>
> ioprio_set works fine with aio, since the I/O is issued in the caller's
> context. Perhaps you're thinking of writeback I/O?

Until you want to issue different IOs with different priorities...

> >> > * Cache hints. For bcache and other things, userspace may want to specify
> >> > "this data should be cached", "this data should bypass the cache", etc.
> >>
> >> Please explain how you will differentiate this from posix_fadvise.
> >
> > Oh sorry, I think about SSD caching so much I forget to say that's what
> > I'm talking about. posix_fadvise is for the page cache, we want
> > something different for an SSD cache (IMO it'd be really ugly to use it
> > for both, and posix_fadvise() can't really specifify everything we'd
> > want to for an SSD cache).
>
> DESCRIPTION
> Programs can use posix_fadvise() to announce an intention to
> access file data in a specific pattern in the future, thus
> allowing the kernel to perform appropriate optimizations.
>
> That description seems broad enough to include disk caches as well. You
> haven't exactly stated what's missing.

It _could_ work for SSD caches, but that doesn't mean you'd want it to -
it doesn't have any way of specifying which cache you want the hint to
apply to, and there are certainly circumstances under which you
_wouldn't_ want it to apply to both.

And making it apply to SSD caches would be silently changing the
behavior, and also like I mentioned it's not sufficient for SSD caches.

> >> > Hence, AIO attributes.
> >>
> >> *No.* Start with the non-AIO case first.
> >
> > Why? It is orthogonal to AIO (and I should make that clearer), but to do
> > it for sync IO we'd need new syscalls that take an extra argument so IMO
> > it's a bit easier to start with AIO.
> >
> > Might be worth implementing the sync interface sooner rather than later
> > just to discover any potential issues, I suppose.
>
> Looking back to preadv and pwritev, it was wrong to only add them to
> libaio (and that later got corrected). I'd just like to see things
> start out with the sync interfaces, since you'll get more eyes on the
> code (not everyone cares about aio) and that helps to work out any
> interface issues.

I agree we don't want to leave out sync versions, but honestly this
stuff is more useful with AIO and that's the easier place to start.

> > It's not possible in general - consider stacking block devices, and
> > attrs that are supported only by specific block drivers. I.e. if you've
> > got lvm on top of bcache or bcache on top of md, we can pass the attr
> > down with the IO but we can't determine ahead of time, in general, where
> > the IO is going to go.
>
> If the io stack is static (meaning you setup a device once, then open it
> and do io to it, and it doesn't change while you're doing io), how would
> you not know where the IO is going to go?

With something like dm, md or bcache - you've got multiple underlying
devices, and of those underlying devices which one the IO goes to is not
something you can in general predict ahead of time.

> > But that probably isn't true for most attrs so it probably would be a
> > good idea to have an interface for querying what's supported, and even
> > for device specific ones you could query what a device supports.
>
> OK.
>
> >> > One could imagine sticking the return in the attribute itself, but I
> >> > don't want to do this. For some things (checksums), the attribute will
> >> > contain a pointer to a buffer - that's fine. But I don't want the
> >> > attributes themselves to be writeable.
> >>
> >> One could imagine that attributes don't return anything, because, well,
> >> they're properties of something else, and properties don't return
> >> anything.
> >
> > With a strict definition of attribute, yeah. One of the real uses cases
> > we have for this is per IO timings, for aio - right now we've got an
> > interface for the kernel to tell userspace how long a syscall took
> > (don't think it's upstream yet - Paul's been behind that stuff), but it
> > only really makes sense with synchronous syscalls.
>
> Something beyond recording the time spent in the kernel? Paul who? I
> agree the per io timing for aio may be coarse-grained today (you can
> time the difference between io_submit returning and the event being
> returned by io_getevents, but that says nothing of when the io was
> issued to the block layer). I'm curious to know exactly what
> granularity you want here, and what an application would do with that
> information. You can currently access a whole lot of detail of the io
> path through blktrace, but that is not easily done from within an
> application.

Paul Turner, scheduler guy.

Believe it's both syscall time and IO time (i.e. what you'd get from
blktrace). It's basically used for visibility in filesystem type stuff,
for monitoring latency - rpc latency isn't enough, you really need to
know why things are slow and that could be as simple as a disk going
bad.

2012-10-04 19:50:20

by Kent Overstreet

[permalink] [raw]
Subject: Re: [RFC, PATCH] Extensible AIO interface

On Thu, Oct 04, 2012 at 06:58:06AM +0900, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Oct 02, 2012 at 08:00:20PM -0700, Kent Overstreet wrote:
> > > However, I don't think it's a good idea to try to implement something
> > > which is a neutral transport of opaque data between userland and lower
> > > layers. Things like that sound attractive with unlimited
> > > possibilities but reality seems to have the tendancy to make a big
> > > mess out of setups like that.
> >
> > I don't see how the "neutral transport of opaque data" itself is a bad
> > thing. We want something simple and sane to build actual interfaces on
> > top of - once we've got that, we can either build clean generic well
> > defined interfaces or we can make a mess like with ioctls :P
> >
> > It's like any other mechanism. There's good syscalls and bad syscalls...
>
> Depending on what a feature aims for, the design and implementation
> vary greatly. If you go for completely generic extensible stuff which
> can be used to warp space-time continuum, it's easy to end up with a
> monstrosity with generic and programmable parser, verifier, accessor
> and so on.

I don't think that's concrete enough that I can comment - I think this
is becoming too abstract.

You didn't have any complaints when I showed you the code I posted, I
don't plan on making it really any more complicated than that - I think
we do need explicit return values but honestly that makes it less
generic.


> > Say we implement an attr to control a block layer cache. That attr could
> > be parsed/validated in high level code (if there's any to do) - that I
> > don't object to. But the high level code isn't going to /know/ whether
> > there was any block cache in the stack that handled the attr. If the
> > attr is passed down to the block cache, that block cache can return that
> > it was handled.
>
> My point is that if it doesn't fit the generic abstract model as in
> fadvise(2), it probably isn't worth supporting in any generic manner.

How so? Do you mean the file range part? I think that's orthogonal to
the rest (the hints fadvise specifies could be used per IO or with a
file range like they are now), but the hints themselves are inadequate
for SSD caches.

> > > It's okay to allow some side channel thing for specific hacky uses but
> > > I really hope the general design were focused around properly
> > > abstracted attributes which can be understood and handled by the upper
> > > layer.
> >
> > Completely agreed. I want to leave that side channel open for
> > experimentation, and so we have a way of implementing one off hacky
> > stuff when we need to - but normal mainline stuff should be sane and
> > well designed.
>
> So, I think we can aim for something simple and modest (the only thing
> I can think of at the moment is task association) and provide simple
> framework which can be used for specific custom usages. Let's please
> not go overboard with generic parser / verifier which supports pointer
> indirection or whatnot.

I wasn't seriously proposing implementing a generic parser/verifier -
certainly not just for this, that was idle musing; all I'm saying is
that when an attr needs parsing/verification, that should be done in the
attr code, not driver code.