LinuxLists.cc - Improving read/write/close system call reliability when used with pthreads

2007-08-12 13:39:46

Subject: Improving read/write/close system call reliability when used with pthreads

Hello!

The attached patch attempts to improve read/write/close system call
reliability when used with pthreads and pipes. At least two issues
are handled:

1. Read/write system calls hanging indefinitely when the fd is closed
by another thread.

2. Read/write fd (reuse) alias problems in combination with -
ERESTARTSYS.

For (1), a simple scenario has two threads: a reader and a watchdog.
The reader thread is stuck in a blocking read of an unreliable pipe.
The watchdog thread is designed to cancel the blocking read using
close on the reader's fd.

Unfortunately, with the current kernel, the reader thread does not
wake up by the close call issued by the watchdog. Instead, the reader
thread remains stuck unless some unrelated signal happens to arrive,
in which case it wakes up and terminates with -EBADF. This is also
the current workaround: have the process signal itself after the
close call in the watchdog thread.

For (2), the scenario can be as (1) above, adding an open system call
between the close and signal events. Since the read fd has been
closed it's now reused by open for a completely unrelated file, and
when read finally wakes up by the signal it starts reading it.

In both of these cases, I believe it would be desirable to have read
terminate immediately with -EBADF. Similar scenarios can be made for
the write system call (and perhaps others as well).

The attached sample programs (watchdog.c and aliasing.c) demonstrate
the effects.

Please regard the attached patch as proof-of-concept to trying to
solve these problems. The FIXME certainly needs more thought, as do
locking schemes, performance impact, related fd issues, and the
approach taken in general. I'd be delighted if it would be possible
to sort out. :-)

Many thanks,
Fredrik

Notes regarding the patch:

The patch adds a required_fds list to task_struct. System calls such
as read and write register their fd respectively in this list, as the
fd:s are required for successful completion of the calls. When close
is called, it scans the required_fds lists and marks closed fd:s with
-EBADF, as well as notifies the file with the new file->f_op-
>closing_fd file operation. This allows files to wake up waiting
read/write calls. When read/write wakes up, it checks its
required_fds list and cancels if its fd has been marked -EBADF.

fs/open.c | 63 +++++++++++++++++++++++++++++++++++++
+++++++++
fs/pipe.c | 28 ++++++++++++++++++++
fs/read_write.c | 6 ++++
include/linux/fs.h | 5 +++
include/linux/init_task.h | 1
include/linux/sched.h | 13 ++++++++-
kernel/fork.c | 1
7 files changed, 115 insertions(+), 2 deletions(-)

Notes regarding POSIX:

Whether or not close may cancel I/O operations appears to be
implementation-defined if I interpret the following section
correctly, unless anyone knows more specific details?

The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Copyright 2001-2004 The IEEE and The Open Group, All Rights reserved.

When there is an outstanding cancelable asynchronous I/O operation
against fildes when close() is called, that I/O operation may be
canceled. An I/O operation that is not canceled completes as if the
close() operation had not yet occurred. All operations that are not
canceled shall complete as if the close() blocked until the
operations completed. The close() operation itself need not block
awaiting such I/O completion. Whether any I/O operation is canceled,
and which I/O operation may be canceled upon close(), is
implementation-defined.

watchdog.c:

#include <sys/types.h>
#include <sys/uio.h>
#include <pthread.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <errno.h>

int inout[2] = { -1, -1 };

void *watchdog(void *arg)
{
fprintf(stderr, "Sleeping...\n");
sleep(1);
fprintf(stderr, "Closing...\n");
close(inout[0]);
fprintf(stderr, "Closed.\n");
return NULL;
}

void wakeup(int sig)
{
fprintf(stderr, "Alarmed.\n");
}

int main(int argc, char *argv[])
{
pthread_t th;
char buf[1];
ssize_t r;

pipe(inout);
pthread_create(&th, NULL, watchdog, NULL);
signal(SIGALRM, wakeup);
alarm(5);

fprintf(stderr, "Reading...\n");
r = read(inout[0], buf, sizeof(buf));
if (r == -1)
perror("read");
fprintf(stderr, "Read done.\n");

pthread_join(th, NULL);
fprintf(stderr, "Exit.\n");

return 0;
}

aliasing.c:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <pthread.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <errno.h>

int inout[2] = { -1, -1 };

void *aliasing(void *arg)
{
fprintf(stderr, "Sleeping...\n");
sleep(1);
fprintf(stderr, "Closing...\n");
close(inout[0]);
fprintf(stderr, "Closed.\n");
fprintf(stderr, "Opening...\n");
open(__FILE__, O_RDONLY);
fprintf(stderr, "Opened.\n");
return NULL;
}

void wakeup(int sig)
{
fprintf(stderr, "Alarmed.\n");
}

int main(int argc, char *argv[])
{
pthread_t th;
char buf[1];
ssize_t r;

pipe(inout);
pthread_create(&th, NULL, aliasing, NULL);
signal(SIGALRM, wakeup);
alarm(5);

fprintf(stderr, "Reading...\n");
r = read(inout[0], buf, 1);
if (r == -1)
perror("read");
else
fprintf(stderr, "Alias read!\n");
fprintf(stderr, "Read done.\n");

pthread_join(th, NULL);
fprintf(stderr, "Exit.\n");

return 0;
}

required-fds.patch (not sure Apple Mail can handle this properly
though...):

--- linux-2.6.19-gentoo-r5/include/linux/init_task.h 2007-07-12
22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/include/linux/init_task.h 2007-08-12
03:51:34.000000000 +0200
@@ -126,6 +126,7 @@
.thread = INIT_THREAD, \
.fs = &init_fs, \
.files = &init_files, \
+ .required_fds = LIST_HEAD_INIT(tsk.required_fds), \
.signal = &init_signals, \
.sighand = &init_sighand, \
.nsproxy = &init_nsproxy, \
--- linux-2.6.19-gentoo-r5/include/linux/sched.h 2007-07-12
22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/include/linux/sched.h 2007-08-12
13:47:44.000000000 +0200
@@ -907,6 +907,8 @@
struct fs_struct *fs;
/* open file information */
struct files_struct *files;
+/* file descriptors required to complete current I/O operation
successfully */
+ struct list_head required_fds;
/* namespaces */
struct nsproxy *nsproxy;
/* signal handlers */
@@ -930,7 +932,7 @@
/* Thread group tracking */
u32 parent_exec_id;
u32 self_exec_id;
-/* Protection of (de-)allocation: mm, files, fs, tty, keyrings */
+/* Protection of (de-)allocation: mm, files, required_fds, fs, tty,
keyrings */
spinlock_t alloc_lock;
/* Protection of the PI data structures: */
@@ -1025,6 +1027,13 @@
#endif
};
+#define REQUIRED_FD_INIT(fd) { .fd = fd }
+
+struct required_fd {
+ struct list_head list;
+ int fd;
+};
+
static inline pid_t process_group(struct task_struct *tsk)
{
return tsk->signal->pgrp;
@@ -1429,7 +1438,7 @@
(thread_group_leader(p) && !thread_group_empty(p))
/*
- * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
+ * Protects ->fs, ->files, ->mm, ->group_info, ->comm, -
>required_fds, keyring
* subscriptions and synchronises with wait4(). Also used in
procfs. Also
* pins the final release of task.io_context. Also protects ->cpuset.
*
--- linux-2.6.19-gentoo-r5/include/linux/fs.h 2007-07-12
22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/include/linux/fs.h 2007-08-12
13:27:02.000000000 +0200
@@ -1120,6 +1120,7 @@
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
+ void (*closing_fd) (struct inode *, struct file *, struct
files_struct *files, unsigned int fd);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
@@ -1497,6 +1498,10 @@
int mode);
extern struct file *filp_open(const char *, int, int);
extern struct file * dentry_open(struct dentry *, struct vfsmount *,
int);
+extern int required_fds_are_bad(struct task_struct *task);
+extern void add_required_fd(struct task_struct *task,
+ struct required_fd *req_fd);
+extern void del_required_fds(struct task_struct *task);
extern int filp_close(struct file *, fl_owner_t id);
extern char * getname(const char __user *);
--- linux-2.6.19-gentoo-r5/kernel/fork.c 2007-07-12
22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/kernel/fork.c 2007-08-12
13:12:59.000000000 +0200
@@ -1156,6 +1156,7 @@
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespaces;
+ INIT_LIST_HEAD(&p->required_fds);
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ?
child_tidptr : NULL;
/*
--- linux-2.6.19-gentoo-r5/fs/read_write.c 2007-07-12
22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/fs/read_write.c 2007-08-12
13:19:59.000000000 +0200
@@ -355,10 +355,12 @@
asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf,
size_t count)
{
+ struct required_fd req_fd = REQUIRED_FD_INIT(fd);
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;
+ add_required_fd(current, &req_fd);
file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
@@ -366,6 +368,7 @@
file_pos_write(file, pos);
fput_light(file, fput_needed);
}
+ del_required_fds(current);
return ret;
}
@@ -373,10 +376,12 @@
asmlinkage ssize_t sys_write(unsigned int fd, const char __user *
buf, size_t count)
{
+ struct required_fd req_fd = REQUIRED_FD_INIT(fd);
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;
+ add_required_fd(current, &req_fd);
file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
@@ -384,6 +389,7 @@
file_pos_write(file, pos);
fput_light(file, fput_needed);
}
+ del_required_fds(current);
return ret;
}
--- linux-2.6.19-gentoo-r5/fs/open.c 2007-07-12 22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/fs/open.c 2007-08-12 16:30:03.000000000
+0200
@@ -2,6 +2,9 @@
* linux/fs/open.c
*
* Copyright (C) 1991, 1992 Linus Torvalds
+ *
+ * 2007-08-12 Implemented required_fds functionality to close files
+ * reliably with pthreads, by Fredrik Noring
<[email protected]>.
*/
#include <linux/string.h>
@@ -1015,6 +1018,53 @@
#endif
+void mark_required_fd_bad(struct task_struct *task, unsigned int fd)
+{
+ struct required_fd *req_fd;
+
+ task_lock(task);
+ list_for_each_entry(req_fd, &task->required_fds, list)
+ if (req_fd->fd == fd)
+ req_fd->fd = -EBADF;
+ task_unlock(task);
+}
+
+int required_fds_are_bad(struct task_struct *task)
+{
+ struct required_fd *req_fd;
+ int ret = 0;
+
+ task_lock(task);
+ list_for_each_entry(req_fd, &task->required_fds, list)
+ if (req_fd->fd == -EBADF) {
+ ret = 1;
+ break;
+ }
+ task_unlock(task);
+
+ return ret;
+}
+
+EXPORT_SYMBOL(required_fds_are_bad);
+
+void add_required_fd(struct task_struct *task, struct required_fd
*req_fd)
+{
+ task_lock(task);
+ list_add(&req_fd->list, &task->required_fds);
+ task_unlock(task);
+}
+
+EXPORT_SYMBOL(add_required_fd);
+
+void del_required_fds(struct task_struct *task)
+{
+ task_lock(task);
+ INIT_LIST_HEAD(&task->required_fds);
+ task_unlock(task);
+}
+
+EXPORT_SYMBOL(del_required_fds);
+
/*
* "id" is the POSIX thread ID. We use the
* files pointer for this..
@@ -1048,9 +1098,17 @@
{
struct file * filp;
struct files_struct *files = current->files;
+ struct task_struct *task;
struct fdtable *fdt;
int retval;
+ /* FIXME: Proof of concept but obviously inefficient. Mark
+ * within f_op->closing_fd instead, where we know the tasks
+ * that are really interested in learning this? */
+ for_each_process(task)
+ if(task->files == files)
+ mark_required_fd_bad(task, fd);
+
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
if (fd >= fdt->max_fds)
@@ -1062,6 +1120,10 @@
FD_CLR(fd, fdt->close_on_exec);
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
+ if (filp->f_op && filp->f_op->closing_fd) {
+ struct inode *inode = filp->f_dentry->d_inode;
+ filp->f_op->closing_fd(inode, filp, files, fd);
+ }
retval = filp_close(filp, files);
/* can't restart close syscall because file table entry was cleared */
--- linux-2.6.19-gentoo-r5/fs/pipe.c 2007-07-12 22:03:14.000000000 +0200
+++ linux-2.6.19-required-fds/fs/pipe.c 2007-08-12 13:43:13.000000000
+0200
@@ -316,6 +316,11 @@
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
+ if (required_fds_are_bad(current)) {
+ if (!ret)
+ ret = -EBADF;
+ break;
+ }
pipe_wait(pipe);
}
mutex_unlock(&inode->i_mutex);
@@ -488,6 +493,11 @@
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
}
+ if (required_fds_are_bad(current)) {
+ if (!ret)
+ ret = -EBADF;
+ break;
+ }
pipe->waiting_writers++;
pipe_wait(pipe);
pipe->waiting_writers--;
@@ -576,6 +586,18 @@
return mask;
}
+static void
+pipe_closing_fd(struct inode *inode, struct file *file,
+ struct files_struct *files, unsigned int fd)
+{
+ struct pipe_inode_info *pipe;
+
+ mutex_lock(&inode->i_mutex);
+ pipe = inode->i_pipe;
+ wake_up_interruptible(&pipe->wait);
+ mutex_unlock(&inode->i_mutex);
+}
+
static int
pipe_release(struct inode *inode, int decr, int decw)
{
@@ -727,6 +749,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_read_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_read_release,
.fasync = pipe_read_fasync,
};
@@ -739,6 +762,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_write_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_write_release,
.fasync = pipe_write_fasync,
};
@@ -752,6 +776,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_rdwr_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_rdwr_release,
.fasync = pipe_rdwr_fasync,
};
@@ -764,6 +789,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_read_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_read_release,
.fasync = pipe_read_fasync,
};
@@ -776,6 +802,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_write_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_write_release,
.fasync = pipe_write_fasync,
};
@@ -789,6 +816,7 @@
.poll = pipe_poll,
.ioctl = pipe_ioctl,
.open = pipe_rdwr_open,
+ .closing_fd = pipe_closing_fd,
.release = pipe_rdwr_release,
.fasync = pipe_rdwr_fasync,
};

2007-08-12 14:17:33

by Paul Jackson

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

Fredrik wrote:
> required-fds.patch (not sure Apple Mail can handle this properly
> though...):

You suspected correctly - Apple Mail line wrapped it at 72 columns
(added newlines in any line going past column 72.) This makes it
impossible to apply the patch without alot of hand editing.

As you probably also suspect (correctly, again), given the volume of
patches we handle on lkml, the burden is on the submitter to get the
patch format correct, so it applies cleanly.

You will have to experiment some, sending the patch to yourself and
being sure that it applies cleanly. I'm not an Apple Mail expert, so
can't help; sorry. Perhaps some other email program ...?

For example, this line wrapped:

fs/open.c | 63 +++++++++++++++++++++++++++++++++++++
+++++++++

and more critically, in the patch, many lines, such as:

+/* file descriptors required to complete current I/O operation
successfully */
+ struct list_head required_fds;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-08-12 15:06:48

by Alan

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

> For (1), a simple scenario has two threads: a reader and a watchdog.
> The reader thread is stuck in a blocking read of an unreliable pipe.
> The watchdog thread is designed to cancel the blocking read using
> close on the reader's fd.

The reader still has the fd open so the fd is not yet closed. This seems
completely sane to me. Closing the *writer* fd would create an EOF which
is the proper response.

Alan

2007-08-12 16:34:38

by Fredrik Noring

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

Alan,

Alan Cox wrote:
> The reader still has the fd open so the fd is not yet closed. This
> seems
> completely sane to me. Closing the *writer* fd would create an EOF
> which
> is the proper response.

The fd is removed from the file descriptor table, but sure the file
(as in the kernel struct file * pointer) is still valid. The problem
is that the fd used when entering the system call, as an alias for
the file pointer, is no longer valid. As a consequence, -ERESTARTSYS
no longer works reliably.

Second problem is that a process may not always have access to the
writer fd and therefore may not be able to close it. So when closing
the reader fd the process has to rely on the following:

- Having a signal such as SIGALRM wake up the reader (in case the
writer misbehaves indefinitely). This results in -EBADF returned,
provided:

- Hope that the fd has not been reused meanwhile (in which case the
reader likely will begin reading the wrong file when it wakes up).

Applications can take either of two approaches:

1) Never close reader fd, but wouldn't it be practical to be able to
make things like a threaded watchdog?

2) Do close reader fd, but what results can then applications
reliably expect? What would be the sane intention of applications
closing reader fd? Do programmers expect all of the current results?

POSIX appears to leave it "implementation-defined" provided I
interpret this correctly, of course. So wouldn't it be great to make
one of the current results a reliable feature?

(A "reader" could be a "writer" above for similar cases.)

All the best,
Fredrik

2007-08-12 18:31:56

by Fredrik Noring

[permalink] [raw]

Subject: Re: [PATCH] Improving read/write/close system call reliability when used with pthreads

Paul,

Paul Jackson wrote:
> You suspected correctly - Apple Mail line wrapped it at 72 columns
> (added newlines in any line going past column 72.) This makes it
> impossible to apply the patch without alot of hand editing.

Of course, sorry about that. New try!

Regarding the FIXME: Marking bad fd:s in close currently walks all
tasks, but it ought to be possible to do this much more efficiently.
For example only tasks that actually use the file table in question.
Alternatively, only tasks that are waiters in the pipe_inode_info-
>wait queue (they are the ones that are woken up to handle the close
anyway; the wake up btw can perhaps also be more fine-grained).

I have a feeling that the latter is the most efficient, but perhaps a
bit more tricky regarding races. Plus, this puts the burden of
marking bad fd:s onto all the implementations of the f_op->closing_fd
file operation.

The rest of the bookkeeping to handle required_fds should be
efficient, I believe.

Regarding generality: For maximum benefit of being able to close all
kinds even non-pipe fd:s reliably, a lot of places need to be
updated. This is not a requirement though, as current behaviour is
maintained otherwise. (It might however be worthwhile to have
sys_read/sys_write etc. refuse -ERESTARTSYS when its fd apparently is
bad.)

Cheers,
Fredrik

Attachments:

required-fds.patch (8.91 kB)

2007-08-12 18:47:56

by Alan

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

> POSIX appears to leave it "implementation-defined" provided I
> interpret this correctly, of course. So wouldn't it be great to make
> one of the current results a reliable feature?

Given that 99.99% of programs don't appear to care and you materially
slow down a critical path for every read and write I'm skeptical.

Teaching the pipe code specifically to behave more nicely in this case
might be worthwhile (and as it happens with sockets in the same case you
can use shutdown() before close to get your preferred behaviour)

Alan

2007-08-12 19:20:45

by Fredrik Noring

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

Alan,

Alan Cox wrote:
> Given that 99.99% of programs don't appear to care and you materially
> slow down a critical path for every read and write I'm skeptical.

I've made required_fds a struct list_head list to accommodate for
multiple fd:s (not sure that's absolutely needed though), so a couple
of pointers need to be updated in sys_read/write. However, if that's
too slow, and we only care about bookkeeping a single fd which is the
case with sys_read/write, that can be changed into a single fd
integer assignment.

Would that be fast enough?

Fredrik

2007-08-13 03:15:07

by David Schwartz

[permalink] [raw]

Subject: RE: Improving read/write/close system call reliability when used with pthreads

> 2) Do close reader fd, but what results can then applications
> reliably expect? What would be the sane intention of applications
> closing reader fd? Do programmers expect all of the current results?

> Fredrik

Since there's no atomic "unlock and read" function, any code that could ever
close a socket in one thread while another thread is blocked on read might
call close just before another thread blocks in read. Nothing stops another
thread from opening something, getting the same file descriptor, and then
allowing the thread to call "read" on the wrong file descriptor entirely.

Since this can never be made sane in general, I see little point in making
one variation of what can go wrong a bit saner. It is still irresponsible to
code like this.

DS

2007-08-13 12:28:31

by Fredrik Noring

[permalink] [raw]

Subject: Re: Improving read/write/close system call reliability when used with pthreads

David,

True. Even though there is a point in making the kernel detect and
behave consistently in this case applications (often) end up with
their own mess they cannot easily handle. A few more use cases would
now work OK but probably not enough to make the improvement worthwhile.

Thanks,
Fredrik

13 aug 2007 kl. 05.14 skrev David Schwartz:
> Since there's no atomic "unlock and read" function, any code that
> could ever
> close a socket in one thread while another thread is blocked on
> read might
> call close just before another thread blocks in read. Nothing stops
> another
> thread from opening something, getting the same file descriptor,
> and then
> allowing the thread to call "read" on the wrong file descriptor
> entirely.
>
> Since this can never be made sane in general, I see little point in
> making
> one variation of what can go wrong a bit saner. It is still
> irresponsible to
> code like this.
>
> DS
>
>
>