2002-10-21 00:01:33

by Davide Libenzi

[permalink] [raw]
Subject: [patch] sys_epoll ...


Per Linus request I implemented a syscall interface to the old /dev/epoll
to avoid the creation of magic inodes inside /dev. Since the new
implementation shares 95% of the code with the old one, the old interface
is maintained for compatibility with existing users. The new syscall
interface adds three system calls ( Linus doesn't like multiplexing ) :


#define EP_CTL_ADD 1
#define EP_CTL_DEL 2
#define EP_CTL_MOD 3

asmlinkage int sys_epoll_create(int maxfds);
asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);


The function sys_epoll_create() creates a sys_epoll "object" by allocation
space for "maxfds" descriptors. The sys_epoll "object" is basically a
file descriptor, and this enable the new interface to :

1) Mantain compatibility with the existing interface
2) Avoid the creation of a sys_epoll_close() syscall
3) Reuse 95% of the existing code
4) Inherit the file* automatic cleanup code w/out having to code a
dedicated one

The function sys_epoll_ctl() is the controller interface and what it does
is pretty obvious. The "op" parameter is either EP_CTL_ADD, EP_CTL_DEL or
EP_CTL_MOD and the parameter "fd" is the target of the operation. The last
parameter "events" is used in both EP_CTL_ADD and EP_CTL_MOD and rapresent
the event interest mask. The function sys_epoll_wait() waits for events by
allowing a maximum timeout "timeout" in milliseconds and returns the
number of events ( struct pollfd ) that the caller will find available at
"*events". The port of the old /dev/epoll to 2.5.44 and the new sys_epoll
are available at :

http://www.xmailserver.org/linux-patches/nio-improve.html#patches




- Davide





2002-10-21 00:44:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

Davide Libenzi wrote:
>
> Per Linus request I implemented a syscall interface to the old /dev/epoll
> to avoid the creation of magic inodes inside /dev.

>From a (very) quick read:


+static struct vm_operations_struct eventpoll_mmap_ops = {
+ open: eventpoll_mm_open,
+ close: eventpoll_mm_close,
+};

Please use the C99 form.

+ ep = (struct eventpoll *) file->private_data;

The cast there just adds noise. private_data is void *. In
fact you lose a bit of compile-time checking by having the cast
there.


+ addr = do_mmap_pgoff(file, 0, EP_MAP_SIZE(maxfds + 1), PROT_READ | PROT_WRITE,
+ MAP_PRIVATE, 0);

You must hold current->mm->mmap-sem for writing before calling
do_mmap_pgoff(). I shall add the missing comment to do_mmap_pgoff.


+asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout)
+{
+ int error = -EBADF;
+ unsigned long eaddr;
+ struct file *file;
+ struct eventpoll *ep;
+ struct evpoll dvp;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d)\n",
+ current, epfd, events, timeout));
+
+ file = fget(epfd);
+ if (!file)
+ goto fget_fail;
+ ep = (struct eventpoll *) file->private_data;
+
+ error = -EINVAL;
+ if (!atomic_read(&ep->mmapped))

Will oops if not passed one of your fd's?

(Ditto in sys_epoll_ctl)

+ clear_bit(PG_reserved, &virt_to_page(pages[ii])->flags);

Please use SetPageReserved()/ClearPageReserved().


+static void ep_free(struct eventpoll *ep)
+{
+ int ii;
+ struct list_head *lnk;
+
+ lock_kernel();
+ for (ii = 0; ii <= ep->hmask; ii++) {
+ while ((lnk = list_first(&ep->hash[ii]))) {
+ struct epitem *dpi = list_entry(lnk, struct epitem, llink);
+

What is locking the nodes in this hashtable? Here it seems to
be lock_kernel. Elsewhere it seems to be write_lock_irqsave(&ep->lock);

+static inline struct epitem *ep_find_nl(struct eventpoll *ep, int fd)

Should be uninlined.

+ __copy_from_user(&pfd, buffer, sizeof(pfd));

Check for EFAULT.


+ if (ep->eventcnt || !timeout)
+ break;
+ if (signal_pending(current)) {
+ res = -EINTR;
+ break;
+ }
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ write_unlock_irqrestore(&ep->lock, flags);
+ timeout = schedule_timeout(timeout);

You should set current->state before performing the checks.

+ if (res > 0)
+ copy_to_user((void *) arg, &dvp, sizeof(struct evpoll));

EFAULT?


+ write_lock_irqsave(&ep->lock, flags);
+
+ res = -EINVAL;
+ if (numpages != (2 * ep->numpages))
+ goto out;
+
+ start = vma->vm_start;
+ for (ii = 0; ii < ep->numpages; ii++) {
+ if (remap_page_range(vma, start, __pa(ep->pages0[ii]),
+ PAGE_SIZE, vma->vm_page_prot))

remap_page_range() can sleep in pmd_alloc(), etc. May not be called under
spinlock.

2002-10-21 01:14:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Per Linus request I implemented a syscall interface to the old /dev/epoll
> > to avoid the creation of magic inodes inside /dev.
>
> >From a (very) quick read:
>
>
> +static struct vm_operations_struct eventpoll_mmap_ops = {
> + open: eventpoll_mm_open,
> + close: eventpoll_mm_close,
> +};
>
> Please use the C99 form.

Consider It Done.



> + ep = (struct eventpoll *) file->private_data;
>
> The cast there just adds noise. private_data is void *. In
> fact you lose a bit of compile-time checking by having the cast
> there.

CID



> + addr = do_mmap_pgoff(file, 0, EP_MAP_SIZE(maxfds + 1), PROT_READ | PROT_WRITE,
> + MAP_PRIVATE, 0);
>
> You must hold current->mm->mmap-sem for writing before calling
> do_mmap_pgoff(). I shall add the missing comment to do_mmap_pgoff.

True, CID



> +asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout)
> +{
> + int error = -EBADF;
> + unsigned long eaddr;
> + struct file *file;
> + struct eventpoll *ep;
> + struct evpoll dvp;
> +
> + DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d)\n",
> + current, epfd, events, timeout));
> +
> + file = fget(epfd);
> + if (!file)
> + goto fget_fail;
> + ep = (struct eventpoll *) file->private_data;
> +
> + error = -EINVAL;
> + if (!atomic_read(&ep->mmapped))
>
> Will oops if not passed one of your fd's?

Yes it will. I will find a solution asap.



> + clear_bit(PG_reserved, &virt_to_page(pages[ii])->flags);
>
> Please use SetPageReserved()/ClearPageReserved().

CID



> +static void ep_free(struct eventpoll *ep)
> +{
> + int ii;
> + struct list_head *lnk;
> +
> + lock_kernel();
> + for (ii = 0; ii <= ep->hmask; ii++) {
> + while ((lnk = list_first(&ep->hash[ii]))) {
> + struct epitem *dpi = list_entry(lnk, struct epitem, llink);
> +
>
> What is locking the nodes in this hashtable? Here it seems to
> be lock_kernel. Elsewhere it seems to be write_lock_irqsave(&ep->lock);



> +static inline struct epitem *ep_find_nl(struct eventpoll *ep, int fd)
>
> Should be uninlined.

CID



>
> + __copy_from_user(&pfd, buffer, sizeof(pfd));
>
> Check for EFAULT.

CID



> + if (ep->eventcnt || !timeout)
> + break;
> + if (signal_pending(current)) {
> + res = -EINTR;
> + break;
> + }
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + write_unlock_irqrestore(&ep->lock, flags);
> + timeout = schedule_timeout(timeout);
>
> You should set current->state before performing the checks.



> + if (res > 0)
> + copy_to_user((void *) arg, &dvp, sizeof(struct evpoll));
>
> EFAULT?

CID



> + write_lock_irqsave(&ep->lock, flags);
> +
> + res = -EINVAL;
> + if (numpages != (2 * ep->numpages))
> + goto out;
> +
> + start = vma->vm_start;
> + for (ii = 0; ii < ep->numpages; ii++) {
> + if (remap_page_range(vma, start, __pa(ep->pages0[ii]),
> + PAGE_SIZE, vma->vm_page_prot))
>
> remap_page_range() can sleep in pmd_alloc(), etc. May not be called under
> spinlock.

Yep, true. CID




- Davide



2002-10-21 01:39:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

On Sun, 20 Oct 2002, Andrew Morton wrote:

> + if (ep->eventcnt || !timeout)
> + break;
> + if (signal_pending(current)) {
> + res = -EINTR;
> + break;
> + }
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + write_unlock_irqrestore(&ep->lock, flags);
> + timeout = schedule_timeout(timeout);
>
> You should set current->state before performing the checks.

Why this Andrew ?



- Davide


2002-10-21 01:55:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

Davide Libenzi wrote:
>
> On Sun, 20 Oct 2002, Andrew Morton wrote:
>
> > + if (ep->eventcnt || !timeout)
> > + break;
> > + if (signal_pending(current)) {
> > + res = -EINTR;
> > + break;
> > + }
> > +
> > + set_current_state(TASK_INTERRUPTIBLE);
> > +
> > + write_unlock_irqrestore(&ep->lock, flags);
> > + timeout = schedule_timeout(timeout);
> >
> > You should set current->state before performing the checks.
>
> Why this Andrew ?
>

Well I'm assuming that you don't want to sleep if, say,
ep->eventcnt is non-zero. The code is currently (simplified):

add_wait_queue(...);
if (ep->eventcnt)
break;
/* window here */
set_current_state(TASK_INTERRUPTIBLE);
schedule();

If another CPU increments eventcnt and sends this task a wakeup in that
window, it is missed and we still sleep. The conventional fix for that
is:

add_wait_queue(...);
set_current_state(TASK_INTERRUPTIBLE);
if (ep->eventcnt)
break;
/* harmless window here */
schedule();

So if someone delivers a wakeup in the "harmless window" then this task
still calls schedule(), but the wakeup has turned the state from
TASK_INTERRUPTIBLE into TASK_RUNNING, so the schedule() doesn't actually
take this task off the runqueue. This task will zoom straight through the
schedule() and will then loop back and notice the incremented ep->eventcnt.

So it is important that the waker increment eventcnt _before_ delivering
the wake_up, too.

2002-10-21 02:03:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Well I'm assuming that you don't want to sleep if, say,
> ep->eventcnt is non-zero. The code is currently (simplified):
>
> add_wait_queue(...);
> if (ep->eventcnt)
> break;
> /* window here */
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
>
> If another CPU increments eventcnt and sends this task a wakeup in that
> window, it is missed and we still sleep. The conventional fix for that
> is:
>
> add_wait_queue(...);
> set_current_state(TASK_INTERRUPTIBLE);
> if (ep->eventcnt)
> break;
> /* harmless window here */
> schedule();
>
> So if someone delivers a wakeup in the "harmless window" then this task
> still calls schedule(), but the wakeup has turned the state from
> TASK_INTERRUPTIBLE into TASK_RUNNING, so the schedule() doesn't actually
> take this task off the runqueue. This task will zoom straight through the
> schedule() and will then loop back and notice the incremented ep->eventcnt.
>
> So it is important that the waker increment eventcnt _before_ delivering
> the wake_up, too.

It's true ... but the window is pretty small there :) Anyway it makes the
code more correct and I changed it. I have the new patch with your
suggestions ready and I will post as sonn as it'll pass a few tests.



- Davide


2002-10-21 03:19:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll ...

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Per Linus request I implemented a syscall interface to the old /dev/epoll
> > to avoid the creation of magic inodes inside /dev.
>
> >From a (very) quick read:

[ lots of required fixes ]

The new patch ( 0.3 ) with the fixes Andrew suggested is available here :

http://www.xmailserver.org/linux-patches/nio-improve.html#patches




- Davide


2002-10-21 06:45:35

by Dan Kegel

[permalink] [raw]
Subject: re: [patch] sys_epoll ...

Davide wrote:
>asmlinkage int sys_epoll_create(int maxfds);
>asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
>asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);

Hey Davide,
I've always been a bit bothered by the need to specify maxfds in
advance. What's the preferred way to handle the situation
where you guess wrong on the value of maxfds? Create a new
epoll and register all the old fds with it? (Sounds like
a good job for a userspace wrapper library.)

Regardless, thanks for pushing /dev/epoll along towards inclusion
in 2.5. I'm looking forward to seeing it it integrated.
Even if the interface doesn't please everyone, the performance
should...
- Dan

2002-10-21 16:55:15

by Davide Libenzi

[permalink] [raw]
Subject: re: [patch] sys_epoll ...

On Mon, 21 Oct 2002, Dan Kegel wrote:

> Davide wrote:
> >asmlinkage int sys_epoll_create(int maxfds);
> >asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
> >asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);
>
> Hey Davide,
> I've always been a bit bothered by the need to specify maxfds in
> advance. What's the preferred way to handle the situation
> where you guess wrong on the value of maxfds? Create a new
> epoll and register all the old fds with it? (Sounds like
> a good job for a userspace wrapper library.)

I'm currently looking at doing it automatically, w/out the need of
specifying it. I don't like it either ...


> Regardless, thanks for pushing /dev/epoll along towards inclusion
> in 2.5. I'm looking forward to seeing it it integrated.
> Even if the interface doesn't please everyone, the performance
> should...

You should thank Hanna Linder that bugged Linus way more than I did :)



- Davide