2002-10-30 00:33:20

by Davide Libenzi

[permalink] [raw]
Subject: [patch] sys_epoll 0.14 ...


Thanks to Andrew and John suggestions I coded another version of the
sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
patch to not waste bandwidth, the patch is available here :

http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff

Comments are welcome ...



- Davide



2002-10-30 03:29:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

Davide Libenzi wrote:
>
> Thanks to Andrew and John suggestions I coded another version of the
> sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> patch to not waste bandwidth, the patch is available here :
>
> http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
>
> Comments are welcome ...
>

Looking good to me, Davide. I think you've nailed everything there
except:

- Do we want to introduce new list macros, or change epoll a little
to use the existing list manipulators (I think the latter)

- Do we keep the vmalloc, replace it with alloc_pages() or go all the
way and remove the hash table?

Seems that you're leaning toward the final option but either way,
it would be best to get that vmalloc out of there - it's basically
hit-or-miss for anything larger than a single page - for a really
robust implementation it is best to avoid its use.

2002-10-30 03:35:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Tue, 29 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Thanks to Andrew and John suggestions I coded another version of the
> > sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> > patch to not waste bandwidth, the patch is available here :
> >
> > http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
> >
> > Comments are welcome ...
> >
>
> Looking good to me, Davide. I think you've nailed everything there
> except:

Also, a few more comments are scheduled to go in ...



- Davide


2002-10-30 03:34:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Tue, 29 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Thanks to Andrew and John suggestions I coded another version of the
> > sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> > patch to not waste bandwidth, the patch is available here :
> >
> > http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
> >
> > Comments are welcome ...
> >
>
> Looking good to me, Davide. I think you've nailed everything there
> except:
>
> - Do we want to introduce new list macros, or change epoll a little
> to use the existing list manipulators (I think the latter)

Andrew, sys_epoll uses linux/list.h interface. Doesn't it ?
And more, we can schedule the sys_close() review later if you want ...



> - Do we keep the vmalloc, replace it with alloc_pages() or go all the
> way and remove the hash table?
>
> Seems that you're leaning toward the final option but either way,
> it would be best to get that vmalloc out of there - it's basically
> hit-or-miss for anything larger than a single page - for a really
> robust implementation it is best to avoid its use.

This will be the last patch that will use the hash. So vmalloc() will go
away. I'm still failing to find creep in my proposed plan to remove the
hash ...



- Davide


2002-10-30 03:48:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

Davide Libenzi wrote:
>
> On Tue, 29 Oct 2002, Andrew Morton wrote:
>
> > Davide Libenzi wrote:
> > >
> > > Thanks to Andrew and John suggestions I coded another version of the
> > > sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> > > patch to not waste bandwidth, the patch is available here :
> > >
> > > http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
> > >
> > > Comments are welcome ...
> > >
> >
> > Looking good to me, Davide. I think you've nailed everything there
> > except:
> >
> > - Do we want to introduce new list macros, or change epoll a little
> > to use the existing list manipulators (I think the latter)
>
> Andrew, sys_epoll uses linux/list.h interface. Doesn't it ?

I was referring to these guys:

+#define list_first(head) (((head)->next != (head)) ? (head)->next: (struct list_head *) 0)
+#define list_last(head) (((head)->prev != (head)) ? (head)->prev: (struct list_head *) 0)
+#define list_next(pos, head) (((pos)->next != (head)) ? (pos)->next: (struct list_head *) 0)
+#define list_prev(pos, head) (((pos)->prev != (head)) ? (pos)->prev: (struct list_head *) 0)

if we are to add such things to list.h then lots of people need
to hum and hah over them first and ask questions like "why doesn't
it use list_empty?" ;)

It would be better to recode epoll's list walks to use the existing
list accessors.

2002-10-30 03:54:02

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Tue, 29 Oct 2002, Andrew Morton wrote:

> I was referring to these guys:
>
> +#define list_first(head) (((head)->next != (head)) ? (head)->next: (struct list_head *) 0)
> +#define list_last(head) (((head)->prev != (head)) ? (head)->prev: (struct list_head *) 0)
> +#define list_next(pos, head) (((pos)->next != (head)) ? (pos)->next: (struct list_head *) 0)
> +#define list_prev(pos, head) (((pos)->prev != (head)) ? (pos)->prev: (struct list_head *) 0)
>
> if we are to add such things to list.h then lots of people need
> to hum and hah over them first and ask questions like "why doesn't
> it use list_empty?" ;)
>
> It would be better to recode epoll's list walks to use the existing
> list accessors.

Andrew, don't they better describe what you're actually doing instead of
the list_empty() trick ?



- Davide


2002-10-30 04:13:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Tue, 29 Oct 2002, Andrew Morton wrote:

> They are a reasonable addition to the list library. They
> should be implemented as:
>
> /*
> * kernel-doc description goes here
> */
> static inline struct list_head *list_first(struct list_head *list)
> {
> if (list_empty(list))
> return NULL;
> return list->next;
> }
>
> But it shouldn't be quietly snuck in as part of epoll. Everyone in
> the world uses list.h.
>
> Given that they are used in just a handful of places in epoll and nowhere
> else in the kernel it is a little hard to justify adding them.
>
> Unless people leap out and say "I've always wanted one of them" it would
> be best to redo epoll to use
>
> while (!list_empty(list)) {
> item = list_entry(list, ...);
> list_del(item->list);
> ...
> }
>
> or one of the other eighty-seven list helpers which we already have.

Ok, let's drop them off ...



- Davide


2002-10-30 04:06:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

Davide Libenzi wrote:
>
> On Tue, 29 Oct 2002, Andrew Morton wrote:
>
> > I was referring to these guys:
> >
> > +#define list_first(head) (((head)->next != (head)) ? (head)->next: (struct list_head *) 0)
> > +#define list_last(head) (((head)->prev != (head)) ? (head)->prev: (struct list_head *) 0)
> > +#define list_next(pos, head) (((pos)->next != (head)) ? (pos)->next: (struct list_head *) 0)
> > +#define list_prev(pos, head) (((pos)->prev != (head)) ? (pos)->prev: (struct list_head *) 0)
> >
> > if we are to add such things to list.h then lots of people need
> > to hum and hah over them first and ask questions like "why doesn't
> > it use list_empty?" ;)
> >
> > It would be better to recode epoll's list walks to use the existing
> > list accessors.
>
> Andrew, don't they better describe what you're actually doing instead of
> the list_empty() trick ?
>

They are a reasonable addition to the list library. They
should be implemented as:

/*
* kernel-doc description goes here
*/
static inline struct list_head *list_first(struct list_head *list)
{
if (list_empty(list))
return NULL;
return list->next;
}

But it shouldn't be quietly snuck in as part of epoll. Everyone in
the world uses list.h.

Given that they are used in just a handful of places in epoll and nowhere
else in the kernel it is a little hard to justify adding them.

Unless people leap out and say "I've always wanted one of them" it would
be best to redo epoll to use

while (!list_empty(list)) {
item = list_entry(list, ...);
list_del(item->list);
...
}

or one of the other eighty-seven list helpers which we already have.

2002-10-30 17:09:35

by Mark Hamblin

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

I've been trying to understand these changes and I thought it might be nice
to share my notes with others. This could be a good example for how to do
certain things in the kernel. For example, from these changes, I figured
out that to create a new system call, you have to:

1) Add entry points into arch/i386/kernel/entry.S
2) Define the new system calls, starting with sys_ and using asmlinkage in
the function definition.
3) Add #defines for each into include/asm-i386/unistd.h
4) Increase NR_syscalls in include/linux/sys.h

I have included some other high-level notes, including a kind of ad-hoc
cross reference for eventpoll.c. I would appreciate any comments if anyone
thinks this kind of documentation would be useful.



diff -Nru linux-2.5.44.vanilla/arch/i386/kernel/entry.S
linux-2.5.44.epoll/arch/i386/kernel/entry.S
// Add entry points for the three new system calls
diff -Nru linux-2.5.44.vanilla/drivers/char/Makefile
linux-2.5.44.epoll/drivers/char/Makefile
// Add eventpoll
diff -Nru linux-2.5.44.vanilla/drivers/char/eventpoll.c
linux-2.5.44.epoll/drivers/char/eventpoll.c
// New module
// In cases where one entity is used by only one other entity, I indent
that entity under its user.
// Define struct eventpoll (created by open_eventpoll and stored in
file->private_data field)
// Define struct epitem, which is contained in a list by eventpoll.
// You can identify the system calls because they all use asmlinkage:
// sys_epoll_create
// Uses ep_getfd();
// Uses get_eventpoll_inode();
// sys_epoll_ctl
// sys_epoll_wait
// Additionally, you have these function pointer tables:
// eventpoll_fops
// Contains write_eventpoll
// Contains ioctl_eventpoll
// Contains mmap_eventpoll
// Contains open_eventpoll
// Contains close_eventpoll
// Uses ep_free();
// Contains poll_eventpoll
// Used by eventpoll_miscdev
// Used by eventpoll_init
// Used by eventpoll_exit
// Used by ep_getfd
// Used by get_eventpoll_inode
// eventpoll_mmap_ops
// Contains eventpoll_mm_open,
// Contains eventpoll_mm_close,
// Used by mmap_eventpoll (part of eventpoll_fops)
// eventpollfs_dentry_operations
// Contains eventpollfs_delete_dentry();
// Used by ep_getfd
// eventpoll_fs_type
// Uses eventpollfs_get_sb()
// Used by eventpoll_init
// Used by eventpoll_exit
// And then you have the functions:
// ep_free_pages();
// Used by ep_free
// Used by ep_do_alloc_pages
// Used by ioctl_eventpoll
// ep_find_nl();
// Used by ep_find
// Used by ep_remove
// Used by ioctl_eventpoll
// ep_find();
// Used by sys_epoll_ctl
// Used by write_eventpoll
// ep_insert();
// Used by sys_epoll_ctl
// Used by write_eventpoll
// Uses ep_hashresize();
// ep_remove();
// Used by sys_epoll_ctl
// Used by write_eventpoll
// notify_proc()
// Used by ep_free (as parm to file_notify_...)
// Used by ep_insert (as parm to file_notify_...)
// Used by ep_remove (as parm to file_notify_...)
// open_eventpoll();
// Used by eventpoll_fops
// Used by sys_epoll_create
// Uses ep_init();
// ep_poll();
// Used by sys_epoll_wait
// Used by ioctl_eventpoll
// ep_do_alloc_pages();
// Used by sys_epoll_create
// Used by ioctl_eventpoll
// Uses ep_alloc_pages();
diff -Nru linux-2.5.44.vanilla/fs/Makefile linux-2.5.44.epoll/fs/Makefile
// Add fcblist
diff -Nru linux-2.5.44.vanilla/fs/fcblist.c linux-2.5.44.epoll/fs/fcblist.c
// New module
diff -Nru linux-2.5.44.vanilla/fs/file_table.c
linux-2.5.44.epoll/fs/file_table.c
// Add calls to file_notify_init and file_notify_cleanup (defined in
fcblist)
diff -Nru linux-2.5.44.vanilla/fs/pipe.c linux-2.5.44.epoll/fs/pipe.c
// Add call to file_send_notify when pipe ceases to be full/empty.
// Add support for POLL_HUP to pipe_release
// Add code to set up/initialize PIPE_READFILE/WRITEFILE
diff -Nru linux-2.5.44.vanilla/include/asm-i386/poll.h
linux-2.5.44.epoll/include/asm-i386/poll.h
// Add POLLREMOVE (reference only by write_eventpoll, perhaps used by
user-code?)
diff -Nru linux-2.5.44.vanilla/include/asm-i386/unistd.h
linux-2.5.44.epoll/include/asm-i386/unistd.h
// Create #defines for the three news system calls.
diff -Nru linux-2.5.44.vanilla/include/linux/eventpoll.h
linux-2.5.44.epoll/include/linux/eventpoll.h
// New module
diff -Nru linux-2.5.44.vanilla/include/linux/fcblist.h
linux-2.5.44.epoll/include/linux/fcblist.h
// New module
diff -Nru linux-2.5.44.vanilla/include/linux/fs.h
linux-2.5.44.epoll/include/linux/fs.h
// Add file callback list and rw lock to ???.
diff -Nru linux-2.5.44.vanilla/include/linux/list.h
linux-2.5.44.epoll/include/linux/list.h
// Added #defines for various generic list-access operations.
// Only one used is list_first, by these functions: ep_free,
ep_hashresize, file_notify_cleanup
diff -Nru linux-2.5.44.vanilla/include/linux/pipe_fs_i.h
linux-2.5.44.epoll/include/linux/pipe_fs_i.h
// Add new fields for rdfile and wrfile
// Add macros PIPE_READFILE and PIPE_WRITEFILE to access new fields.
diff -Nru linux-2.5.44.vanilla/include/linux/sys.h
linux-2.5.44.epoll/include/linux/sys.h
// Increased NR_syscalls (by 4 instead of 3...why???)
diff -Nru linux-2.5.44.vanilla/include/net/sock.h
linux-2.5.44.epoll/include/net/sock.h
// Added code to sk_wake_async to call file_send_notify. This call given
priority over the older method.
// My only question here is what was sk->socket->file being used for
before? It's not a new field, and I see no other code
// that references it.
diff -Nru linux-2.5.44.vanilla/net/ipv4/tcp.c
linux-2.5.44.epoll/net/ipv4/tcp.c
// Changes old call to sock_wake_async to now call sk_wake_async. It
appears that sk_wake_async winds up calling
// sock_wake_async for the old case anyway.


2002-10-30 20:08:45

by Janet Morgan

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

> Thanks to Andrew and John suggestions I coded another version of the
> sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> patch to not waste bandwidth, the patch is available here :
>
> http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
>
> Comments are welcome ...


The previous and current versions of the sys_epoll patch are performing
comparably and continue to far exceed the results from standard poll.
Hopefully this is another endorsement for it's inclusion in the 2.5 kernel.

We re-ran the SMP tests that were used to collect the sys_epoll performance
data recently posted at http://lse.sourceforge.net/epoll/. Detailed data
follows for those who want a closer look, or check out the website for
more information.


Here are my results from pipetest:

Token Passes-Per-Second
Number sys_epoll- sys_epoll- Standard
of Pipes 0.9 (Old) 0.14 (New) poll()
---------- ----------------------------------------------
10 289160 285510 98469
100 283463 283948 16514
500 278084 281745 3346
1000 277441 279754 1667
2000 274988 278708 841
3000 273139 275484 545
4000 274039 273873 397
5000 274687 273741 295
6000 272489 271055 217
7000 271543 269841 168
8000 270669 270178 132
9000 268353 269213 105
10000 268974 272780 86
11000 269641 273580 74
12000 267961 273834 63
13000 267249 272904 55
14000 266456 272307 48
15000 265396 270909 43
16000 270268 272928 38



Here are Shailabh's results from dphtpdd:

Reply-rate
Number of sys_epoll sys_epoll Standard
Connections 0.9 (Old) 0.14 (New) poll()
---------- -------------------------------------------
0 13459.7 13438.8 13331.1
50 13218.3 13477.2 14098.9
100 13417.3 13598.6 14150.3
200 12781.0 13502.7 13792.2
400 13507.5 13507.5 14250.2
800 13428.3 13450.8 13184.6
1000 13358.6 13479.6 12900.2
5000 13715.7 13569.2 5039.6
10000 13106.6 13557.3 2359.2
15000 13181.7 13615.6 1669.4
20000 13203.4 13691.0 1270.0
25000 13441.9 13536.3 1008.7
30000 13101.0 13487.8 744.2
35000 13207.0 13648.5 705.0
40000 13213.4 13567.1 631.3
45000 12644.8 13638.5 559.5
50000 13466.3 13689.4 502.0
55000 13458.3 13647.5 452.5
60000 13459.4 13781.0 414.5

2002-10-30 20:33:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...


On Wed, 30 Oct 2002, Davide Libenzi wrote:
>
> Linus, it would be good if you could wait a couple of days so that cleanup
> and regression testing is completed. Friday looks good for me ...

Friday is next month, and quite frankly I'm planning on just making a
release tomorrow before the kids start trick-or-treating, and then just
sit back for a long weekend (actually, catch up on real work, but as far
as the kernel is concerned I'm not there any more on Friday).

Linus

2002-10-30 20:29:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Wed, 30 Oct 2002, Linus Torvalds wrote:

>
> On Wed, 30 Oct 2002, Davide Libenzi wrote:
> >
> > Thank you very much Janet for doing performance and stability test.
> > Working with Andrew we agreed to remove the main hash table since, by
> > using in full the fcblist.c capabilities, it is not needed any more.
>
> Ok. I was just about to apply the 0.14 stuff, so please just cc me with
> the patch (and an appropriate description for the ChangeSet comment) when
> you're done with the cleanup..

Linus, it would be good if you could wait a couple of days so that cleanup
and regression testing is completed. Friday looks good for me ...



- Davide


2002-10-30 20:21:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...

On Wed, 30 Oct 2002, Janet Morgan wrote:

> > Thanks to Andrew and John suggestions I coded another version of the
> > sys_epoll patch ( 0.13 skipped ... superstition :) ). I won't send the
> > patch to not waste bandwidth, the patch is available here :
> >
> > http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff
> >
> > Comments are welcome ...
>
>
> The previous and current versions of the sys_epoll patch are performing
> comparably and continue to far exceed the results from standard poll.
> Hopefully this is another endorsement for it's inclusion in the 2.5 kernel.
>
> We re-ran the SMP tests that were used to collect the sys_epoll performance
> data recently posted at http://lse.sourceforge.net/epoll/. Detailed data
> follows for those who want a closer look, or check out the website for
> more information.

Thank you very much Janet for doing performance and stability test.
Working with Andrew we agreed to remove the main hash table since, by
using in full the fcblist.c capabilities, it is not needed any more. Hash
allocation was a big Andrew concern because of its dimension and because
it was allocated with vmalloc() that might fail when fragmentation
increase. The new patch will have a reviewed locking behavior and will
introduce a slab allocator inside fs/fcblist.c. My plan is to release 0.16
today so that you guys can run regression test on it. To test the locking
"goodness" it should be run on an SMP system, with multiple copies of
ephttpd running on different ports, each one stressed either by httperf
or the sys_epoll HTTP blaster. Also, building an threaded application that
makes "bad use" of the interface might help in checking the API robustness
against every condition. Also, the usual performance test should be run to
check if we loose something along the path ...




- Davide


2002-10-30 20:25:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] sys_epoll 0.14 ...


On Wed, 30 Oct 2002, Davide Libenzi wrote:
>
> Thank you very much Janet for doing performance and stability test.
> Working with Andrew we agreed to remove the main hash table since, by
> using in full the fcblist.c capabilities, it is not needed any more.

Ok. I was just about to apply the 0.14 stuff, so please just cc me with
the patch (and an appropriate description for the ChangeSet comment) when
you're done with the cleanup..

Linus