2009-01-08 08:26:10

by Greg Banks

[permalink] [raw]
Subject: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

Instead of a single list which confusingly contains a mixture of
cache_request and cache_reader structures in various states, use two
separate lists.

Both new lists contain cache_request structures, the cache_reader
structure is eliminated. It's only purpose was to hold state which
supports partial reads of upcalls from userspace. However the
implementation of partial reads is broken in the presence of the
multi-threaded rpc.mountd, in two different ways.

Firstly, the kernel code assumes that each reader uses a separate
struct file; because rpc.mountd fork()s *after* opening the cache
file descriptor this is not true. Thus the single struct file and
the single rp->offset field are shared between multiple threads.
Unfortunately rp->offset is not maintained in a safe manner. This can
lead to the BUG_ON() in cache_read() being tripped.

Secondly, even if the kernel code worked perfectly it's sharing
a single offset between multiple reading rpc.mountd threads. If a
thread does a partial read, there's no way to match up the remaining
bytes in the upcall to the thread that read the initial part. So a
partial read will result in any second reading thread that comes
along being given a spurious part of an upcall. Both threads will
then fail to parse their respective mangled upcalls. At the very
least this will result in clients seeing NFS calls which triggered
an upcall being randomly dropped under load.

The "right" way to fix this would be to implement a primitive such as
recvmsg() that an rpc.mountd thread could use to atomically retrieve an
entire upcall message. However in this case we know that the size of
the messages is limited by existing code to PAGE_SIZE and by usage to
even less. We also know that gssd and recent rpc.mountd do 2048 byte
read()s, so partial reads should be unnecessary. These circumstances
should allow removing support for partial reads.

Having made that decision, we can remove struct cache_reader and
greatly simplify all the code that deals with the upcall queue and
with cache file descriptors.

Further, the old code kept in it's single list cache_requests objects
in two different states: waiting to be sent up to an rpc.mountd thread
in response to a read(), and waiting for a reply (or pre-emptive reply)
from an rpc.mountd thread which arrives in a write(). The difference
was tracked by some very gnarly code which relied on the relative
position of cache_reader and cache_request objects in the single list.
This is very hard code to understand and debug. The new code uses
two separate lists and much simpler logic.

Signed-off-by: Greg Banks <[email protected]>
---

include/linux/sunrpc/cache.h | 3
net/sunrpc/cache.c | 246 ++++++++++++--------------------
2 files changed, 97 insertions(+), 152 deletions(-)

Index: bfields/include/linux/sunrpc/cache.h
===================================================================
--- bfields.orig/include/linux/sunrpc/cache.h
+++ bfields/include/linux/sunrpc/cache.h
@@ -97,7 +97,8 @@ struct cache_detail {

/* fields for communication over channel */
spinlock_t queue_lock;
- struct list_head queue;
+ struct list_head to_read;
+ struct list_head to_write;
wait_queue_head_t queue_wait;

struct proc_dir_entry *proc_ent;
Index: bfields/net/sunrpc/cache.c
===================================================================
--- bfields.orig/net/sunrpc/cache.c
+++ bfields/net/sunrpc/cache.c
@@ -361,7 +361,8 @@ int cache_register(struct cache_detail *
rwlock_init(&cd->hash_lock);
spin_lock_init(&cd->queue_lock);
init_waitqueue_head(&cd->queue_wait);
- INIT_LIST_HEAD(&cd->queue);
+ INIT_LIST_HEAD(&cd->to_read);
+ INIT_LIST_HEAD(&cd->to_write);
spin_lock(&cache_list_lock);
cd->nextcheck = 0;
cd->entries = 0;
@@ -659,106 +660,91 @@ void cache_clean_deferred(void *owner)
}

/*
- * communicate with user-space
+ * Caches communicate with user-space.
*
* We have a magic /proc file - /proc/sunrpc/<cachename>/channel.
- * On read, you get a full request, or block.
- * On write, an update request is processed.
+ *
+ * On read, you get a full request. If the length passed
+ * to read() is too short, you get nothing and the message is dropped,
+ * which is bad. So you should use a sufficently large length,
+ * for example PAGE_SIZE. If there are no requests queued,
+ * read() returns 0.
+ *
+ * On write, an update is processed. This may, as a side effect,
+ * cause a previously queued request to be de-queued and removed.
+ * Userspace can also pre-emptively write updates which the kernel
+ * has not yet requested.
+ *
* Poll works if anything to read, and always allows write.
*
- * Implemented by linked list of requests. Each open file has
- * a ->private that also exists in this list. New requests are added
- * to the end and may wakeup and preceding readers.
- * New readers are added to the head. If, on read, an item is found with
- * CACHE_UPCALLING clear, we free it from the list.
+ * The channel is implemented by two linked lists of cache_request
+ * objects. cd->to_read is requests which have been generated in
+ * the kernel and are waiting for a userspace process to read them.
+ * cd->to_write is requests which have been read by userspace and
+ * are awaiting a reply to be written.
*
+ * Both lists are protected by cd->queue_lock.
*/

-struct cache_queue {
- struct list_head list;
- int reader; /* if 0, then request */
-};
struct cache_request {
- struct cache_queue q;
+ struct list_head list;
struct cache_head *item;
char * buf;
int len;
- int readers;
-};
-struct cache_reader {
- struct cache_queue q;
- int offset; /* if non-0, we have a refcnt on next request */
};

static ssize_t
cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
{
- struct cache_reader *rp = filp->private_data;
- struct cache_request *rq;
+ struct cache_request *rq = NULL;
struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
int err;

if (count == 0)
return 0;

- again:
+ /* de-queue the next request which is waiting to be read */
spin_lock(&cd->queue_lock);
- /* need to find next request */
- while (rp->q.list.next != &cd->queue &&
- list_entry(rp->q.list.next, struct cache_queue, list)
- ->reader) {
- struct list_head *next = rp->q.list.next;
- list_move(&rp->q.list, next);
- }
- if (rp->q.list.next == &cd->queue) {
- spin_unlock(&cd->queue_lock);
- BUG_ON(rp->offset);
- return 0;
+ if (!list_empty(&cd->to_read)) {
+ rq = container_of(cd->to_read.next, struct cache_request, list);
+ list_del_init(&rq->list);
}
- rq = container_of(rp->q.list.next, struct cache_request, q.list);
- BUG_ON(rq->q.reader);
- if (rp->offset == 0)
- rq->readers++;
spin_unlock(&cd->queue_lock);

- if (rp->offset == 0 && !test_bit(CACHE_PENDING, &rq->item->flags)) {
- err = -EAGAIN;
- spin_lock(&cd->queue_lock);
- list_move(&rp->q.list, &rq->q.list);
- spin_unlock(&cd->queue_lock);
- } else {
- if (rp->offset + count > rq->len)
- count = rq->len - rp->offset;
- err = -EFAULT;
- if (copy_to_user(buf, rq->buf + rp->offset, count))
- goto out;
- rp->offset += count;
- if (rp->offset >= rq->len) {
- rp->offset = 0;
- spin_lock(&cd->queue_lock);
- list_move(&rp->q.list, &rq->q.list);
- spin_unlock(&cd->queue_lock);
- }
- err = 0;
- }
- out:
- if (rp->offset == 0) {
- /* need to release rq */
- spin_lock(&cd->queue_lock);
- rq->readers--;
- if (rq->readers == 0 &&
- !test_bit(CACHE_PENDING, &rq->item->flags)) {
- list_del(&rq->q.list);
- spin_unlock(&cd->queue_lock);
- cache_put(rq->item, cd);
- kfree(rq->buf);
- kfree(rq);
- } else
- spin_unlock(&cd->queue_lock);
- }
- if (err == -EAGAIN)
- goto again;
- return err ? err : count;
+ if (rq == NULL)
+ return 0; /* no queued requests */
+
+ err = -EAGAIN; /* gnb:TODO...this used to cause a loop, wtf */
+ if (!test_bit(CACHE_PENDING, &rq->item->flags))
+ goto error;
+
+ /* gnb:TODO whine to dmesg; stat */
+ err = -EFAULT;
+ if (count < rq->len)
+ goto error; /* We make no pretence at handling short reads */
+ count = rq->len;
+
+ err = -EFAULT;
+ if (copy_to_user(buf, rq->buf, count))
+ goto error;
+
+ /*
+ * Done reading, append to the list of requests
+ * which are waiting for a write from userspace.
+ */
+ spin_lock(&cd->queue_lock);
+ list_add_tail(&rq->list, &cd->to_write);
+ spin_unlock(&cd->queue_lock);
+
+ return count;
+
+error:
+ /* need to release rq */
+ cache_put(rq->item, cd);
+ kfree(rq->buf);
+ kfree(rq);
+
+ return err;
}

static ssize_t
@@ -796,28 +782,21 @@ out:
static unsigned int
cache_poll(struct file *filp, poll_table *wait)
{
- unsigned int mask;
- struct cache_reader *rp = filp->private_data;
- struct cache_queue *cq;
+ unsigned int mask = 0;
struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;

poll_wait(filp, &cd->queue_wait, wait);

- /* alway allow write */
- mask = POLL_OUT | POLLWRNORM;
+ if (filp->f_mode & FMODE_WRITE)
+ mask = POLL_OUT | POLLWRNORM;

- if (!rp)
- return mask;
-
- spin_lock(&cd->queue_lock);
-
- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
+ if (filp->f_mode & FMODE_READ) {
+ spin_lock(&cd->queue_lock);
+ if (!list_empty(&cd->to_read))
mask |= POLLIN | POLLRDNORM;
- break;
- }
- spin_unlock(&cd->queue_lock);
+ spin_unlock(&cd->queue_lock);
+ }
+
return mask;
}

@@ -826,26 +805,23 @@ cache_ioctl(struct inode *ino, struct fi
unsigned int cmd, unsigned long arg)
{
int len = 0;
- struct cache_reader *rp = filp->private_data;
- struct cache_queue *cq;
+ struct cache_request *rq;
struct cache_detail *cd = PDE(ino)->data;

- if (cmd != FIONREAD || !rp)
+ if (cmd != FIONREAD)
+ return -EINVAL;
+ if (!(filp->f_mode & FMODE_READ))
return -EINVAL;

spin_lock(&cd->queue_lock);

- /* only find the length remaining in current request,
- * or the length of the next request
+ /* only find the length of the next request
*/
- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
- struct cache_request *rq =
- container_of(cq, struct cache_request, q);
- len = rq->len - rp->offset;
- break;
- }
+ if (!list_empty(&cd->to_read)) {
+ rq = container_of(cd->to_read.next, struct cache_request, list);
+ len = rq->len;
+ }
+
spin_unlock(&cd->queue_lock);

return put_user(len, (int __user *)arg);
@@ -854,51 +830,20 @@ cache_ioctl(struct inode *ino, struct fi
static int
cache_open(struct inode *inode, struct file *filp)
{
- struct cache_reader *rp = NULL;
-
nonseekable_open(inode, filp);
if (filp->f_mode & FMODE_READ) {
struct cache_detail *cd = PDE(inode)->data;
-
- rp = kmalloc(sizeof(*rp), GFP_KERNEL);
- if (!rp)
- return -ENOMEM;
- rp->offset = 0;
- rp->q.reader = 1;
atomic_inc(&cd->readers);
- spin_lock(&cd->queue_lock);
- list_add(&rp->q.list, &cd->queue);
- spin_unlock(&cd->queue_lock);
}
- filp->private_data = rp;
return 0;
}

static int
cache_release(struct inode *inode, struct file *filp)
{
- struct cache_reader *rp = filp->private_data;
struct cache_detail *cd = PDE(inode)->data;

- if (rp) {
- spin_lock(&cd->queue_lock);
- if (rp->offset) {
- struct cache_queue *cq;
- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
- container_of(cq, struct cache_request, q)
- ->readers--;
- break;
- }
- rp->offset = 0;
- }
- list_del(&rp->q.list);
- spin_unlock(&cd->queue_lock);
-
- filp->private_data = NULL;
- kfree(rp);
-
+ if (filp->f_mode & FMODE_READ) {
cd->last_close = get_seconds();
atomic_dec(&cd->readers);
}
@@ -918,26 +863,39 @@ static const struct file_operations cach
.release = cache_release,
};

+static struct cache_request *
+cache_queue_find_locked(struct list_head *listp, struct cache_head *h)
+{
+ struct cache_request *rq;
+
+ list_for_each_entry(rq, listp, list) {
+ if (rq->item == h)
+ return rq;
+ }
+ return NULL;
+}

static void cache_remove_queued(struct cache_detail *cd, struct cache_head *h)
{
- struct cache_queue *cq;
+ struct cache_request *rq;
+
+ /* find and de-queue */
spin_lock(&cd->queue_lock);
- list_for_each_entry(cq, &cd->queue, list)
- if (!cq->reader) {
- struct cache_request *rq = container_of(cq, struct cache_request, q);
- if (rq->item != h)
- continue;
- if (rq->readers != 0)
- continue;
- list_del(&rq->q.list);
- spin_unlock(&cd->queue_lock);
- cache_put(rq->item, cd);
- kfree(rq->buf);
- kfree(rq);
- return;
- }
+
+ rq = cache_queue_find_locked(&cd->to_read, h);
+ if (!rq)
+ rq = cache_queue_find_locked(&cd->to_write, h);
+ if (rq)
+ list_del(&rq->list);
+
spin_unlock(&cd->queue_lock);
+
+ /* if found, destroy */
+ if (rq) {
+ cache_put(rq->item, cd);
+ kfree(rq->buf);
+ kfree(rq);
+ }
}

/*
@@ -1063,13 +1021,11 @@ static int cache_make_upcall(struct cach
kfree(rq);
return -EAGAIN;
}
- rq->q.reader = 0;
rq->item = cache_get(h);
rq->buf = buf;
rq->len = PAGE_SIZE - len;
- rq->readers = 0;
spin_lock(&cd->queue_lock);
- list_add_tail(&rq->q.list, &cd->queue);
+ list_add_tail(&rq->list, &cd->to_read);
spin_unlock(&cd->queue_lock);
wake_up(&cd->queue_wait);
return 0;

--
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


2009-01-08 19:57:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

On Thu, Jan 08, 2009 at 07:25:20PM +1100, Greg Banks wrote:
> Instead of a single list which confusingly contains a mixture of
> cache_request and cache_reader structures in various states, use two
> separate lists.
>
> Both new lists contain cache_request structures, the cache_reader
> structure is eliminated. It's only purpose was to hold state which
> supports partial reads of upcalls from userspace. However the
> implementation of partial reads is broken in the presence of the
> multi-threaded rpc.mountd, in two different ways.
>
> Firstly, the kernel code assumes that each reader uses a separate
> struct file; because rpc.mountd fork()s *after* opening the cache
> file descriptor this is not true. Thus the single struct file and
> the single rp->offset field are shared between multiple threads.
> Unfortunately rp->offset is not maintained in a safe manner. This can
> lead to the BUG_ON() in cache_read() being tripped.

Hm, have you seen this happen?

All the current upcalls are small enough (and mountd's default read
buffer large enough) that messages should always be read in one gulp.

>
> Secondly, even if the kernel code worked perfectly it's sharing
> a single offset between multiple reading rpc.mountd threads.

I made what I suspect is a similar patch a while ago and never got
around to submitting it. (Arrgh! Bad me!) Appended below if it's of
any use to you to compare. (Happy to apply either your patch or mine
depending which looks better; I haven't tried to compare carefully.)

--b.

commit f948255274fdc8b5932ad4e88d5610c6ea3de9f7
Author: J. Bruce Fields <[email protected]>
Date: Wed Dec 5 14:52:51 2007 -0500

sunrpc: simplify dispatching of upcalls to file descriptors

If you want to handle a lot of upcall requests, especially requests that
may take a while (e.g. because they require talking to ldap or kerberos
servers, or spinning up a newly accessed disk), then you need to process
upcalls in parallel.

When there are multiple requests available for processing, we'd prefer
not to hand the same request to multiple readers, as that either wastes
resources (or requires upcall-processing to detect such cases itself).

The current code keeps a single list containing all open file
descriptors and requests. Each file descriptor is advanced through the
list until it finds a request. Multiple requests may find the same
request at once, and requests are left on the list until they are
responded to. Under this scheme the likelihood of two readers getting
the same request is high.

You can mitigate the problem by reading from a single file descriptor
that's shared between processes (as mountd currently does), but that
requires the read to always provide a buffer large enough to get the
whole request in one atomic read. That's less practical for gss
init_sec_context calls, which could vary in size from a few hundred
bytes to 100k or so.

So, instead move processes that are being read out off the common list
of requests into the file descriptor's private area, and move them to
the end of the list after they're read.

That makes it a little less likely they'll be re-read by another process
immediately.

This also simplifies the code.

Signed-off-by: J. Bruce Fields <[email protected]>

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 636c8e0..5512fbb 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -670,39 +670,31 @@ void cache_clean_deferred(void *owner)
* On write, an update request is processed.
* Poll works if anything to read, and always allows write.
*
- * Implemented by linked list of requests. Each open file has
- * a ->private that also exists in this list. New requests are added
- * to the end and may wakeup and preceding readers.
- * New readers are added to the head. If, on read, an item is found with
- * CACHE_UPCALLING clear, we free it from the list.
- *
+ * Keep a list of requests associated with each cache descriptor.
+ * On read from an open file descriptor, pick the request at the
+ * head of the list and move it to the file descriptor's private area.
+ * After it is read, return it to the tail of the list. Keep requests
+ * on the list until we notice that they have been responded to (at
+ * which point CACHE_PENDING is cleared).
*/

static DEFINE_SPINLOCK(queue_lock);
static DEFINE_MUTEX(queue_io_mutex);

-struct cache_queue {
- struct list_head list;
- int reader; /* if 0, then request */
-};
struct cache_request {
- struct cache_queue q;
+ struct list_head list;
struct cache_head *item;
char * buf;
+ int offset;
int len;
- int readers;
-};
-struct cache_reader {
- struct cache_queue q;
- int offset; /* if non-0, we have a refcnt on next request */
};

static ssize_t
cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
{
- struct cache_reader *rp = filp->private_data;
- struct cache_request *rq;
+ struct cache_request *rq = filp->private_data;
struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
+ struct list_head *queue = &cd->queue;
int err;

if (count == 0)
@@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
mutex_lock(&queue_io_mutex); /* protect against multiple concurrent
* readers on this file */
again:
- spin_lock(&queue_lock);
- /* need to find next request */
- while (rp->q.list.next != &cd->queue &&
- list_entry(rp->q.list.next, struct cache_queue, list)
- ->reader) {
- struct list_head *next = rp->q.list.next;
- list_move(&rp->q.list, next);
+ if (rq == NULL) {
+ if (!list_empty(queue)) {
+ spin_lock(&queue_lock);
+ rq = list_first_entry(queue, struct cache_request,
+ list);
+ list_del_init(&rq->list);
+ filp->private_data = rq;
+ spin_unlock(&queue_lock);
+ }
}
- if (rp->q.list.next == &cd->queue) {
- spin_unlock(&queue_lock);
+ if (rq == NULL) {
mutex_unlock(&queue_io_mutex);
- BUG_ON(rp->offset);
- return 0;
+ return -EAGAIN;
}
- rq = container_of(rp->q.list.next, struct cache_request, q.list);
- BUG_ON(rq->q.reader);
- if (rp->offset == 0)
- rq->readers++;
- spin_unlock(&queue_lock);

- if (rp->offset == 0 && !test_bit(CACHE_PENDING, &rq->item->flags)) {
+ if (!test_bit(CACHE_PENDING, &rq->item->flags))
err = -EAGAIN;
- spin_lock(&queue_lock);
- list_move(&rp->q.list, &rq->q.list);
- spin_unlock(&queue_lock);
- } else {
- if (rp->offset + count > rq->len)
- count = rq->len - rp->offset;
+ else {
+ if (rq->offset + count > rq->len)
+ count = rq->len - rq->offset;
err = -EFAULT;
- if (copy_to_user(buf, rq->buf + rp->offset, count))
+ if (copy_to_user(buf, rq->buf + rq->offset, count))
goto out;
- rp->offset += count;
- if (rp->offset >= rq->len) {
- rp->offset = 0;
- spin_lock(&queue_lock);
- list_move(&rp->q.list, &rq->q.list);
- spin_unlock(&queue_lock);
- }
+ rq->offset += count;
+ if (rq->offset >= rq->len)
+ rq->offset = 0;
err = 0;
}
out:
- if (rp->offset == 0) {
- /* need to release rq */
+ if (rq->offset == 0) {
+ filp->private_data = NULL;
spin_lock(&queue_lock);
- rq->readers--;
- if (rq->readers == 0 &&
- !test_bit(CACHE_PENDING, &rq->item->flags)) {
- list_del(&rq->q.list);
- spin_unlock(&queue_lock);
+ if (!test_bit(CACHE_PENDING, &rq->item->flags)) {
cache_put(rq->item, cd);
kfree(rq->buf);
kfree(rq);
} else
- spin_unlock(&queue_lock);
+ list_add_tail(&rq->list, queue);
+ spin_unlock(&queue_lock);
}
if (err == -EAGAIN)
goto again;
@@ -808,26 +785,19 @@ static unsigned int
cache_poll(struct file *filp, poll_table *wait)
{
unsigned int mask;
- struct cache_reader *rp = filp->private_data;
- struct cache_queue *cq;
struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
+ struct list_head *queue = &cd->queue;

poll_wait(filp, &queue_wait, wait);

/* alway allow write */
mask = POLL_OUT | POLLWRNORM;

- if (!rp)
- return mask;
-
spin_lock(&queue_lock);

- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
- mask |= POLLIN | POLLRDNORM;
- break;
- }
+ if (filp->private_data || !list_empty(queue))
+ mask |= POLLIN | POLLRDNORM;
+
spin_unlock(&queue_lock);
return mask;
}
@@ -837,11 +807,11 @@ cache_ioctl(struct inode *ino, struct file *filp,
unsigned int cmd, unsigned long arg)
{
int len = 0;
- struct cache_reader *rp = filp->private_data;
- struct cache_queue *cq;
+ struct cache_request *rq = filp->private_data;
struct cache_detail *cd = PDE(ino)->data;
+ struct list_head *queue = &cd->queue;

- if (cmd != FIONREAD || !rp)
+ if (cmd != FIONREAD)
return -EINVAL;

spin_lock(&queue_lock);
@@ -849,14 +819,11 @@ cache_ioctl(struct inode *ino, struct file *filp,
/* only find the length remaining in current request,
* or the length of the next request
*/
- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
- struct cache_request *cr =
- container_of(cq, struct cache_request, q);
- len = cr->len - rp->offset;
- break;
- }
+ if (!rq)
+ rq = list_first_entry(queue, struct cache_request, list);
+ if (rq)
+ len = rq->len - rq->offset;
+
spin_unlock(&queue_lock);

return put_user(len, (int __user *)arg);
@@ -865,51 +832,33 @@ cache_ioctl(struct inode *ino, struct file *filp,
static int
cache_open(struct inode *inode, struct file *filp)
{
- struct cache_reader *rp = NULL;
-
nonseekable_open(inode, filp);
if (filp->f_mode & FMODE_READ) {
struct cache_detail *cd = PDE(inode)->data;

- rp = kmalloc(sizeof(*rp), GFP_KERNEL);
- if (!rp)
- return -ENOMEM;
- rp->offset = 0;
- rp->q.reader = 1;
atomic_inc(&cd->readers);
- spin_lock(&queue_lock);
- list_add(&rp->q.list, &cd->queue);
- spin_unlock(&queue_lock);
}
- filp->private_data = rp;
+ filp->private_data = NULL;
return 0;
}

static int
cache_release(struct inode *inode, struct file *filp)
{
- struct cache_reader *rp = filp->private_data;
+ struct cache_request *rq = filp->private_data;
struct cache_detail *cd = PDE(inode)->data;
+ struct list_head *queue = &cd->queue;
+
+ filp->private_data = NULL;

- if (rp) {
+ if (rq) {
+ rq->offset = 0;
spin_lock(&queue_lock);
- if (rp->offset) {
- struct cache_queue *cq;
- for (cq= &rp->q; &cq->list != &cd->queue;
- cq = list_entry(cq->list.next, struct cache_queue, list))
- if (!cq->reader) {
- container_of(cq, struct cache_request, q)
- ->readers--;
- break;
- }
- rp->offset = 0;
- }
- list_del(&rp->q.list);
+ list_add_tail(&rq->list, queue);
spin_unlock(&queue_lock);

- filp->private_data = NULL;
- kfree(rp);
-
+ }
+ if (filp->f_mode & FMODE_READ) {
cd->last_close = get_seconds();
atomic_dec(&cd->readers);
}
@@ -932,20 +881,15 @@ static const struct file_operations cache_file_operations = {

static void queue_loose(struct cache_detail *detail, struct cache_head *ch)
{
- struct cache_queue *cq;
+ struct cache_request *rq;
spin_lock(&queue_lock);
- list_for_each_entry(cq, &detail->queue, list)
- if (!cq->reader) {
- struct cache_request *cr = container_of(cq, struct cache_request, q);
- if (cr->item != ch)
- continue;
- if (cr->readers != 0)
- continue;
- list_del(&cr->q.list);
+ list_for_each_entry(rq, &detail->queue, list)
+ if (rq->item == ch) {
+ list_del(&rq->list);
spin_unlock(&queue_lock);
- cache_put(cr->item, detail);
- kfree(cr->buf);
- kfree(cr);
+ cache_put(rq->item, detail);
+ kfree(rq->buf);
+ kfree(rq);
return;
}
spin_unlock(&queue_lock);
@@ -1074,13 +1018,12 @@ static int cache_make_upcall(struct cache_detail *detail, struct cache_head *h)
kfree(crq);
return -EAGAIN;
}
- crq->q.reader = 0;
crq->item = cache_get(h);
crq->buf = buf;
crq->len = PAGE_SIZE - len;
- crq->readers = 0;
+ crq->offset = 0;
spin_lock(&queue_lock);
- list_add_tail(&crq->q.list, &detail->queue);
+ list_add_tail(&crq->list, &detail->queue);
spin_unlock(&queue_lock);
wake_up(&queue_wait);
return 0;

2009-01-09 02:48:38

by Greg Banks

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

J. Bruce Fields wrote:
> On Thu, Jan 08, 2009 at 07:25:20PM +1100, Greg Banks wrote:
>
>> [...]
>> Firstly, the kernel code assumes that each reader uses a separate
>> struct file; because rpc.mountd fork()s *after* opening the cache
>> file descriptor this is not true. Thus the single struct file and
>> the single rp->offset field are shared between multiple threads.
>> Unfortunately rp->offset is not maintained in a safe manner. This can
>> lead to the BUG_ON() in cache_read() being tripped.
>>
>
> Hm, have you seen this happen?
>

Yes, at last count at least a half-dozen times with two separate
customers. For future reference, the SGI bug report is PV 988959.
Here's a representative crash report:

System dropped to KDB with following crash string:

<4>kernel BUG at net/sunrpc/cache.c:578!
<4>rpc.mountd[50696]: bugcheck! 0 [1]
<4>nfsd: request from insecure port (10.150.3.54:911)!
<4>Modules linked in: af_packet svcxprt_rdma nfsd exportfs lockd<4>nfsd: request from insecure port (10.150.3.59:838)!
<4> nfs_acl sunrpc rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_umad iw_cxgb3 cxgb3 mlx4_ib mlx4_core iscsi_trgt crc32c libcrc32c mst_pciconf mst_pci binfmt_misc button ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables scsi_dump dump_blockdev dump_gzi zlib_deflate dump xfs_quota xfs_dmapi dmapi<4>nfsd: request from insecure port (10.150.3.84:744)!
<4>nfsd: request from insecure port (10.150.3.68:768)!
<4> nls_iso8859_1 loop sgi_xvm sgi_os_lib ib_ipoib ib_cm ib_sa ipv6 kdbm_vm kdbm_task kdbm_sched kdbm_pg mmtimer mca_recovery arsess xpmem xp numatools mspec shpchp pci_hotplug ib_mthca ide_cd ib_mad ehci_hcd ohci_hcd cdrom ib_core tg3 usbcore pata_sil680siimage xfs fan thermal processor ide_generic qla2xxx firmware_class mptfc scsi_transport_fc mptspi scsi_transport_spi sg mptsas mptscsih mptbase scsi_transport_sas qla1280 sata_vsc libata sgiioc4 ioc4 sd_mod scsi_mod ide_disk ide_core
<4>
<4>Pid: 50696, CPU 1, comm: rpc.mountd
<4>psr : 0000101008526010 ifs : 800000000000050d ip : [<a000000214362120>] Tainted: P U
<4>ip is at cache_read+0x2a0/0x840 [sunrpc]
<4>unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003
<4>rnat: 0000000000000000 bsps: 0000000000000000 pr : 95a95666659aa959
<4>ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
<4>csd : 0000000000000000 ssd : 0000000000000000
<4>b0 : a000000214362120 b6 : a000000100431640 b7 : a00000010000c320
<4>f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf
<4>f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000
<4>f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
<4>r1 : a000000100b31db0 r2 : e000006003118e6b r3 : a00000010087fad8
<4>r8 : 0000000000000029 r9 : ffffffffffff8e68 r10 : ffffffffffff0028
<4>r11 : 8000020310000380 r12 : e0000171ebb77e20 r13 : e0000171ebb70000
<4>r14 : 0000000000004000 r15 : a00000010087fad8 r16 : a00000010087fae0
<4>r17 : e0000171e575fe18 r18 : 0000000000000027 r19 : 0000020000000000
<4>r20 : 0fd0000000000000 r21 : 80000001fdc00820 r22 : 0000000000000004
<4>r23 : 0000000000000820 r24 : 80000001fdc00000 r25 : 0000000000000820
<4>r26 : 0000000000000000 r27 : e000006003118e6b r28 : e000006003118e68
<4>r29 : ffffffffffff8e68 r30 : e000006003120000 r31 : ffffffffffff0028
<4>
<4>Call Trace:
<4> [<a000000100014ce0>] show_stack+0x40/0xa0
<4> sp=e0000171ebb779b0 bsp=e0000171ebb71478
<4> [<a0000001000155e0>] show_regs+0x840/0x880
<4> sp=e0000171ebb77b80 bsp=e0000171ebb71420
<4> [<a000000100039460>] die+0x1c0/0x360
<4> sp=e0000171ebb77b80 bsp=e0000171ebb713d0
<4> [<a000000100039650>] die_if_kernel+0x50/0x80
<4> sp=e0000171ebb77ba0 bsp=e0000171ebb713a0
<4> [<a000000100559b80>] ia64_bad_break+0x260/0x540
<4> sp=e0000171ebb77ba0 bsp=e0000171ebb71378
<4> [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
<4> sp=e0000171ebb77c50 bsp=e0000171ebb71378
<4> [<a000000214362120>] cache_read+0x2a0/0x840 [sunrpc]
<4> sp=e0000171ebb77e20 bsp=e0000171ebb71310
<4> [<a0000001001615a0>] vfs_read+0x240/0x3e0
<4> sp=e0000171ebb77e20 bsp=e0000171ebb712c0
<4> [<a000000100161e10>] sys_read+0x70/0xe0
<4> sp=e0000171ebb77e20 bsp=e0000171ebb71248
<4> [<a00000010000c980>] ia64_ret_from_syscall+0x0/0x20
<4> sp=e0000171ebb77e30 bsp=e0000171ebb71248
<4> [<a000000000010620>] __kernel_syscall_via_break+0x0/0x20
<4> sp=e0000171ebb78000 bsp=e0000171ebb71248


Rpc.mountd is doing a read() on /proc/net/rpc/auth.unix.ip/channel. The
BUG_ON() tripped is here:

553 static ssize_t
554 cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
555 {
556 struct cache_reader *rp = filp->private_data;
557 struct cache_request *rq;
558 struct cache_detail *cd = PDE(filp->f_dentry->d_inode)->data;
...
567 spin_lock(&queue_lock);
568 /* need to find next request */
569 while (rp->q.list.next != &cd->queue &&
570 list_entry(rp->q.list.next, struct cache_queue, list)
571 ->reader) {
572 struct list_head *next = rp->q.list.next;
573 list_move(&rp->q.list, next);
574 }
575 if (rp->q.list.next == &cd->queue) {
576 spin_unlock(&queue_lock);
577 up(&queue_io_sem);
578 BUG_ON(rp->offset); <----- SPLAT!
579 return 0;
580 }


Note there's one obvious problem: the BUG_ON() is testing the condition
(rp->offset) outside any locking. In at least one dump rp->offset is
actually zero by the time the page got saved to the dump device, so I
suspect a race here.

> All the current upcalls are small enough (and mountd's default read
> buffer large enough) that messages should always be read in one gulp.
>
Indeed. This is why SGI shipped a workaround, comprising a backported
commit to mountd to use the larger read buffer size and thus avoid
triggering the partial read path. Unfortunately this seems not to have
helped. I *suspect* that the BUG_ON() is racing with another thread
running through this code later in the same function:

596 if (copy_to_user(buf, rq->buf + rp->offset, count))
597 goto out;
598 rp->offset += count;
599 if (rp->offset >= rq->len) {
600 rp->offset = 0;
601 spin_lock(&queue_lock);


Note that even if userspace does full sized reads, rp->offset briefly
becomes nonzero in code which is serialised by queue_io_mutex. The
BUG_ON() is not serialised by that lock.

The two customers to which this is occurring have in common a very large
set of NFS clients (>6000 in one case) all mounting over TCP/IPoIB at
the same time. This puts a lot of stress on the auth.unix.ip cache :-/

>
>> Secondly, even if the kernel code worked perfectly it's sharing
>> a single offset between multiple reading rpc.mountd threads.
>>
>
> I made what I suspect is a similar patch a while ago and never got
> around to submitting it. (Arrgh! Bad me!)
Heh, I do way too much of the same thing :-(

> Appended below if it's of
> any use to you to compare. (Happy to apply either your patch or mine
> depending which looks better; I haven't tried to compare carefully.)
>

Ok, I'll take a look at yours.
> [...]
>
> You can mitigate the problem by reading from a single file descriptor
> that's shared between processes (as mountd currently does), but that
> requires the read to always provide a buffer large enough to get the
> whole request in one atomic read. That's less practical for gss
> init_sec_context calls, which could vary in size from a few hundred
> bytes to 100k or so.
>
I'm confused -- doesn't the current cache_make_upcall() code allocate a
buffer of length PAGE_SIZE and not allow it to be resized?
> [...]
> -struct cache_queue {
> - struct list_head list;
> - int reader; /* if 0, then request */
> -};
> struct cache_request {
> - struct cache_queue q;
> + struct list_head list;
> struct cache_head *item;
> char * buf;
> + int offset;
> int len;
> - int readers;
> -};
> -struct cache_reader {
> - struct cache_queue q;
> - int offset; /* if non-0, we have a refcnt on next request */
> };
>
>
Looking very similar so far.

> static ssize_t
> cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
> {
> - struct cache_reader *rp = filp->private_data;
> - struct cache_request *rq;
> + struct cache_request *rq = filp->private_data;
> struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
> + struct list_head *queue = &cd->queue;
> int err;
>
> if (count == 0)
> @@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
> mutex_lock(&queue_io_mutex); /* protect against multiple concurrent
> * readers on this file */
>

Ah, so you still have a single global lock which is serialising all
reads and writes to all caches.

Also, I think you'd want to sample filp->private_data after taking
queue_io_mutex.
> + if (rq == NULL) {
> mutex_unlock(&queue_io_mutex);
> - BUG_ON(rp->offset);
>

Good riddance to that BUG_ON().
> - return 0;
> + return -EAGAIN;
> }
> - rq = container_of(rp->q.list.next, struct cache_request, q.list);
> - BUG_ON(rq->q.reader);
> - if (rp->offset == 0)
> - rq->readers++;
> - spin_unlock(&queue_lock);
>
> - if (rp->offset == 0 && !test_bit(CACHE_PENDING, &rq->item->flags)) {
> + if (!test_bit(CACHE_PENDING, &rq->item->flags))
> err = -EAGAIN;
> - spin_lock(&queue_lock);
> - list_move(&rp->q.list, &rq->q.list);
> - spin_unlock(&queue_lock);
> - } else {
> - if (rp->offset + count > rq->len)
> - count = rq->len - rp->offset;
> + else {
> + if (rq->offset + count > rq->len)
> + count = rq->len - rq->offset;
> err = -EFAULT;
> - if (copy_to_user(buf, rq->buf + rp->offset, count))
> + if (copy_to_user(buf, rq->buf + rq->offset, count))
> goto out;
>

Ok, so you try to keep partial read support but move the offset into the
upcall request itself. Interesting idea.

I think partial reads are Just Too Hard to do properly, i.e. without
risk of racy message corruption under all combinations of message size
and userspace behaviour . In particular I think your code will corrupt
upcall data if multiple userspace threads race to do partial reads on
the same struct file (as rpc.mountd is doing at SGI's customer sites).

For the same reasons I think the FIONREAD support is utterly pointless
(even though I preserved it).

But I still don't understand how this 100K message size number for gssd
can happen? If that's really necessary then the design equation changes
considerably. This seems to be the most significant difference between
our patches.

A smaller issue is that you keep a single list and use the value of the
CACHE_PENDING bit to tell the difference between states. I think this
could be made to work; however my technique of using two lists makes
most of the code even simpler at the small cost of having to do two list
searches in queue_loose().

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


2009-01-09 02:57:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > [...]
> >
> > You can mitigate the problem by reading from a single file descriptor
> > that's shared between processes (as mountd currently does), but that
> > requires the read to always provide a buffer large enough to get the
> > whole request in one atomic read. That's less practical for gss
> > init_sec_context calls, which could vary in size from a few hundred
> > bytes to 100k or so.
> >
> I'm confused -- doesn't the current cache_make_upcall() code allocate a
> buffer of length PAGE_SIZE and not allow it to be resized?

Yeah, sorry for the confusion: this was written as cleanup in
preparation for patches to support larger gss init_sec_context calls
needed for spkm3, which I'm told likes to send across entire certificate
trains in the initial NULL calls. (But the spkm3 work is stalled for
now).

--b.

2009-01-09 03:20:33

by Greg Banks

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

J. Bruce Fields wrote:
> On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
>
>> J. Bruce Fields wrote:
>>
>>> [...]
>>>
>>> whole request in one atomic read. That's less practical for gss
>>> init_sec_context calls, which could vary in size from a few hundred
>>> bytes to 100k or so.
>>>
>>>
>> I'm confused -- doesn't the current cache_make_upcall() code allocate a
>> buffer of length PAGE_SIZE and not allow it to be resized?
>>
>
> Yeah, sorry for the confusion: this was written as cleanup in
> preparation for patches to support larger gss init_sec_context calls
> needed for spkm3, which I'm told likes to send across entire certificate
> trains in the initial NULL calls. (But the spkm3 work is stalled for
> now).
>
Aha.

So if at some point in the future we actually need to send 100K in an
upcall, I think we have two options:

a) support partial reads but do so properly:
- track offset in the cache_request
- also track reader's pid in the cache request so partially read
requests are matched to threads
- handle multiple requests being in a state where they have been
partially read
- handle the case where a thread dies after doing a partial read but
before finishing, so the request is left dangling
- handle the similar case where a thread does a partial read then fails
to ever finish the read without dying
- handle both the "multiple struct files, 1 thread per struct file" and
"1 struct file, multiple threads" cases cleanly

b) don't support partial reads but require userspace to do larger full
reads. I don't think 100K is too much to ask.

My patch does most of what we need for option b). Yours does some of
what we need for option a). Certainly a) is a lot more complex.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


2009-01-09 16:54:11

by Chuck Lever III

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

On Jan 8, 2009, at Jan 8, 2009, 10:12 PM, Greg Banks wrote:
> J. Bruce Fields wrote:
>> On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> [...]
>>>>
>>>> whole request in one atomic read. That's less practical for gss
>>>> init_sec_context calls, which could vary in size from a few
>>>> hundred
>>>> bytes to 100k or so.
>>>>
>>>>
>>> I'm confused -- doesn't the current cache_make_upcall() code
>>> allocate a
>>> buffer of length PAGE_SIZE and not allow it to be resized?
>>>
>>
>> Yeah, sorry for the confusion: this was written as cleanup in
>> preparation for patches to support larger gss init_sec_context calls
>> needed for spkm3, which I'm told likes to send across entire
>> certificate
>> trains in the initial NULL calls. (But the spkm3 work is stalled for
>> now).
>>
> Aha.
>
> So if at some point in the future we actually need to send 100K in an
> upcall, I think we have two options:
>
> a) support partial reads but do so properly:
> - track offset in the cache_request
> - also track reader's pid in the cache request so partially read
> requests are matched to threads
> - handle multiple requests being in a state where they have been
> partially read
> - handle the case where a thread dies after doing a partial read but
> before finishing, so the request is left dangling
> - handle the similar case where a thread does a partial read then
> fails
> to ever finish the read without dying
> - handle both the "multiple struct files, 1 thread per struct file"
> and
> "1 struct file, multiple threads" cases cleanly
>
> b) don't support partial reads but require userspace to do larger full
> reads. I don't think 100K is too much to ask.

How about:

c) Use an mmap like API to avoid copying 100K of data between user
space and kernel.

> My patch does most of what we need for option b). Yours does some of
> what we need for option a). Certainly a) is a lot more complex.
>
> --
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2009-01-09 21:29:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > Appended below if it's of
> > any use to you to compare. (Happy to apply either your patch or mine
> > depending which looks better; I haven't tried to compare carefully.)
> >
>
> Ok, I'll take a look at yours.

Thanks for doing this, by the way.

> > static ssize_t
> > cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
> > {
> > - struct cache_reader *rp = filp->private_data;
> > - struct cache_request *rq;
> > + struct cache_request *rq = filp->private_data;
> > struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
> > + struct list_head *queue = &cd->queue;
> > int err;
> >
> > if (count == 0)
> > @@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
> > mutex_lock(&queue_io_mutex); /* protect against multiple concurrent
> > * readers on this file */
> >
>
> Ah, so you still have a single global lock which is serialising all
> reads and writes to all caches.

Yes, making this per-cd seems sensible (though if the problem is
typically a single cache (auth_unix) then I don't know how significant a
help it is).

> Also, I think you'd want to sample filp->private_data after taking
> queue_io_mutex.

Whoops, yes.

> > + if (rq == NULL) {
> > mutex_unlock(&queue_io_mutex);
> > - BUG_ON(rp->offset);
> >
>
> Good riddance to that BUG_ON().
> > - return 0;
> > + return -EAGAIN;
> > }
> > - rq = container_of(rp->q.list.next, struct cache_request, q.list);
> > - BUG_ON(rq->q.reader);
> > - if (rp->offset == 0)
> > - rq->readers++;
> > - spin_unlock(&queue_lock);
> >
> > - if (rp->offset == 0 && !test_bit(CACHE_PENDING, &rq->item->flags)) {
> > + if (!test_bit(CACHE_PENDING, &rq->item->flags))
> > err = -EAGAIN;
> > - spin_lock(&queue_lock);
> > - list_move(&rp->q.list, &rq->q.list);
> > - spin_unlock(&queue_lock);
> > - } else {
> > - if (rp->offset + count > rq->len)
> > - count = rq->len - rp->offset;
> > + else {
> > + if (rq->offset + count > rq->len)
> > + count = rq->len - rq->offset;
> > err = -EFAULT;
> > - if (copy_to_user(buf, rq->buf + rp->offset, count))
> > + if (copy_to_user(buf, rq->buf + rq->offset, count))
> > goto out;
> >
>
> Ok, so you try to keep partial read support but move the offset into the
> upcall request itself. Interesting idea.
>
> I think partial reads are Just Too Hard to do properly, i.e. without
> risk of racy message corruption under all combinations of message size
> and userspace behaviour . In particular I think your code will corrupt
> upcall data if multiple userspace threads race to do partial reads on
> the same struct file (as rpc.mountd is doing at SGI's customer sites).

Yes, but what mountd's doing is just dumb, as far as I can tell; is
there any real reason not to just keep a separate open for each thread?

If we just tell userland to keep a number of opens equal to the number
of concurrent upcalls it wants to handle, and then all of this becomes
very easy.

Forget sharing a single struct file between tasks that do too-small
reads: we should make sure that we don't oops, but that's all.

> For the same reasons I think the FIONREAD support is utterly pointless
> (even though I preserved it).
>
> But I still don't understand how this 100K message size number for gssd
> can happen? If that's really necessary then the design equation changes
> considerably. This seems to be the most significant difference between
> our patches.

So, the somewhat depressing situation with spkm3, which was to be the
public-key-based gss mechanism for nfs: we (citi) implemented it (modulo
this problem and maybe one or two others), but found some problems along
the way that required revising the spec. I think there might have been
an implementation for one other platform too. But the ietf seems to
have given up on spkm3, and instead is likely to pursue a new mechanism,
pku2u. Nobody's implementing that yet. Consulting my local expert, it
sounds like pku2u will have some more reasonable bound on init-sec-ctx
token sizes. (Not sure if it'll be under a page, without checking, but
it shouldn't be much more than that, anyway).

So: the immediate pressure for larger upcalls is probably gone. And
anyway more changes would be required (the ascii-hex encoding and
upcall-downcall matching are very wasteful in this case, etc.), so we'd
likely end up using a different mechanism.

That said, I think it's easy enough to handle just the case of multiple
reads on unshared struct files that it might make sense to keep that
piece.

> A smaller issue is that you keep a single list and use the value of the
> CACHE_PENDING bit to tell the difference between states. I think this
> could be made to work; however my technique of using two lists makes
> most of the code even simpler at the small cost of having to do two list
> searches in queue_loose().

OK. When exactly do they get moved between lists? I'll take a look at
the code.

--b.

2009-01-09 21:41:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

On Fri, Jan 09, 2009 at 04:29:21PM -0500, bfields wrote:
> On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
> > A smaller issue is that you keep a single list and use the value of the
> > CACHE_PENDING bit to tell the difference between states. I think this
> > could be made to work; however my technique of using two lists makes
> > most of the code even simpler at the small cost of having to do two list
> > searches in queue_loose().
>
> OK. When exactly do they get moved between lists? I'll take a look at
> the code.

The one thing I'd be curious about here would be robustness in the face
of a userland daemon that was restarted: would requests marked as read
but not responded to be stuck there indefinitely at the time the daemon
went down? That wouldn't be fatal, since if nothing else the client
will retry eventually, but it might lead to some unnecessary delays if
the admin needed to restart a daemon for some reason.

--b.

2009-01-09 23:37:37

by Greg Banks

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

J. Bruce Fields wrote:
> On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
>
>> J. Bruce Fields wrote:
>>
>>
>
>>> static ssize_t
>>> cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
>>> {
>>> - struct cache_reader *rp = filp->private_data;
>>> - struct cache_request *rq;
>>> + struct cache_request *rq = filp->private_data;
>>> struct cache_detail *cd = PDE(filp->f_path.dentry->d_inode)->data;
>>> + struct list_head *queue = &cd->queue;
>>> int err;
>>>
>>> if (count == 0)
>>> @@ -711,60 +703,45 @@ cache_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
>>> mutex_lock(&queue_io_mutex); /* protect against multiple concurrent
>>> * readers on this file */
>>>
>>>
>> Ah, so you still have a single global lock which is serialising all
>> reads and writes to all caches.
>>
>
> Yes, making this per-cd seems sensible (though if the problem is
> typically a single cache (auth_unix) then I don't know how significant a
> help it is).
>
The usual pattern of traffic I see (on a SLES10 platform with the older
set of export caches) when the first NFS packet arrives during a client
mounts is:

- upcall to mountd via auth.unix.ip cache
- mountd writes pre-emptively to nfsd.export and nfsd.expkey caches
- mountd write reply to auth.unix.ip cache

So it's not just a single cache.

However, I have no measurements for any performance improvement. Based
on earlier experience I believe the elapsed mounting time to be
dominated by the latency of the forward and reverse DNS lookup that
mountd does, so the improvement is probably small.

>
>>
>> I think partial reads are Just Too Hard to do properly, i.e. without
>> risk of racy message corruption under all combinations of message size
>> and userspace behaviour . In particular I think your code will corrupt
>> upcall data if multiple userspace threads race to do partial reads on
>> the same struct file (as rpc.mountd is doing at SGI's customer sites).
>>
>
> Yes, but what mountd's doing is just dumb, as far as I can tell; is
> there any real reason not to just keep a separate open for each thread?
>
None at all. The current rpc.mountd behaviour is a historical accident
of the "look, we can put a fork() here and everything will Just Work"
variety. I was hoping to avoid changes to the current userspace
behaviour to limit deployment hassles with shipping a fix, but
ultimately it can be changed.

> If we just tell userland to keep a number of opens equal to the number
> of concurrent upcalls it wants to handle, and then all of this becomes
> very easy.
>
If we put that requirement on userspace, and partial reads are still
necessary, then your approach of using filp->private_data as a parking
spot for requests currently being read would be the right way to go.

> Forget sharing a single struct file between tasks that do too-small
> reads: we should make sure that we don't oops, but that's all.
>
We should definitely not oops :-) Consistently delivering correct
messages to userspace would be nice too.

> So, the somewhat depressing situation with spkm3, which was to be the
> public-key-based gss mechanism for nfs: we (citi) implemented it (modulo
> this problem and maybe one or two others), but found some problems along
> the way that required revising the spec.
This is frequently the way with specs :-/

> [...]
> So: the immediate pressure for larger upcalls is probably gone.
Sweet.

> That said, I think it's easy enough to handle just the case of multiple
> reads on unshared struct files that it might make sense to keep that
> piece.
>
>
>> A smaller issue is that you keep a single list and use the value of the
>> CACHE_PENDING bit to tell the difference between states. [...]
>>
>
> OK. When exactly do they get moved between lists?
Immediately after being copied out to userspace in cache_read().

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


2009-01-09 23:48:42

by Greg Banks

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

J. Bruce Fields wrote:
> On Fri, Jan 09, 2009 at 04:29:21PM -0500, bfields wrote:
>
>> On Fri, Jan 09, 2009 at 01:40:47PM +1100, Greg Banks wrote:
>>
>>> A smaller issue is that you keep a single list and use the value of the
>>> CACHE_PENDING bit to tell the difference between states. I think this
>>> could be made to work; however my technique of using two lists makes
>>> most of the code even simpler at the small cost of having to do two list
>>> searches in queue_loose().
>>>
>> OK. When exactly do they get moved between lists? I'll take a look at
>> the code.
>>
>
> The one thing I'd be curious about here would be robustness in the face
> of a userland daemon that was restarted:
Fair enough.
> would requests marked as read
> but not responded to be stuck there indefinitely at the time the daemon
> went down?
I don't think so. Requests on the to_write list have the CACHE_PENDING
bit set, and that will trigger a call to cache_remove_queued() in either
of cache_fresh_unlocked() or cache_clean(). The latter will happen when
the cache_head expires or when the .../flush file is written, e.g. the
next exportfs change. One of those should happen if the daemons are
started normally.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


2009-01-10 01:36:33

by Greg Banks

[permalink] [raw]
Subject: Re: [patch 10/14] sunrpc: Reorganise the queuing of cache upcalls.

Chuck Lever wrote:
> On Jan 8, 2009, at Jan 8, 2009, 10:12 PM, Greg Banks wrote:
>>
>> a) support partial reads but do so properly: [...]
>>
>> b) don't support partial reads but require userspace to do larger full
>> reads. I don't think 100K is too much to ask.
>
> How about:
>
> c) Use an mmap like API to avoid copying 100K of data between user
> space and kernel.

I think that would be a significant complexity hit, and even at 100K the
size of the data just doesn't warrant it.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.