2001-03-12 14:16:52

by Ingo Oeser

[permalink] [raw]
Subject: Feedback for fastselect and one-copy-pipe

Hi Manfred,

I'm running your patches [1] with sucess for a while now.

Did you get any feedback about problems regarding these patches?

They seem to work for me, but there seems to be a memleak in
2.4.x (x: 0-2), which I'm chasing down.

The problem is, it only shows up after about 3-4 days of uptime.
So there is no quick test and I'm even not sure about the
kernel version where this exactly occurs, because I run sometimes
2.4.0 for working and sometimes the latest one, to see whether
the problem still persists.

Regards

Ingo Oeser

[1] put on http://www.tu-chemnitz.de/~ioe/fastpipe.patch
and http://www.tu-chemnitz.de/~ioe/poll-2.4.0.patch
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>


2001-03-12 16:21:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

On Mon, 12 Mar 2001, Ingo Oeser wrote:

> They seem to work for me, but there seems to be a memleak in
> 2.4.x (x: 0-2), which I'm chasing down.

I just happen to have a 2.4.2 IKD patch sitting here, and therein
sits Ingo's memory leak detector... poor thing is bored to tears 8)

-Mike

2001-03-12 16:43:11

by Manfred Spraul

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 4
// SUBLEVEL = 2
// EXTRAVERSION = -ac17
--- 2.4/fs/select.c Thu Feb 22 22:29:47 2001
+++ build-2.4/fs/select.c Mon Mar 12 17:01:45 2001
@@ -24,12 +24,6 @@
#define ROUND_UP(x,y) (((x)+(y)-1)/(y))
#define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)

-struct poll_table_entry {
- struct file * filp;
- wait_queue_t wait;
- wait_queue_head_t * wait_address;
-};
-
struct poll_table_page {
struct poll_table_page * next;
struct poll_table_entry * entry;
@@ -52,11 +46,36 @@
* poll table.
*/

+/*
+ * Memory free and alloc took a significant part of the total
+ * sys_poll()/sys_select() execution time, thus I moved several
+ * structures on the stack:
+ * - sys_select has a 192 byte (enough for 256 fds) buffer on the stack.
+ * Please avoid selecting more than 5000 descriptors
+ * (kmalloc > 4096 bytes), and you can't select
+ * more than 170.000 fds (kmalloc > 128 kB)
+ * - sys_poll stores the first 24 file descriptors on the
+ * stack. If more than 24 descriptors are polled, then
+ * additional memory is allocated, but the first 24 descriptors
+ * always lie on the stack.
+ * - the poll table contains 8 wait queue entries. This means that no dynamic
+ * memory allocation is necessary for the wait queues if one of the first
+ * 8 file descriptors has new data.
+ * <[email protected]>
+ */
+
void poll_freewait(poll_table* pt)
{
struct poll_table_page * p = pt->table;
+ struct poll_table_entry * entry;
+ entry = pt->internal + pt->nr;
+ while(pt->nr > 0) {
+ pt->nr--;
+ entry--;
+ remove_wait_queue(entry->wait_address,&entry->wait);
+ fput(entry->filp);
+ }
while (p) {
- struct poll_table_entry * entry;
struct poll_table_page *old;

entry = p->entry;
@@ -67,39 +86,42 @@
} while (entry > p->entries);
old = p;
p = p->next;
- free_page((unsigned long) old);
+ kfree(old);
}
}

void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
- struct poll_table_page *table = p->table;
-
- if (!table || POLL_TABLE_FULL(table)) {
- struct poll_table_page *new_table;
+ struct poll_table_entry * entry;

- new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
- if (!new_table) {
- p->error = -ENOMEM;
- __set_current_state(TASK_RUNNING);
- return;
+ if(p->nr < POLL_TABLE_INTERNAL) {
+ entry = p->internal+p->nr++;
+ } else {
+ struct poll_table_page *table = p->table;
+
+ if (!table || POLL_TABLE_FULL(table)) {
+ struct poll_table_page *new_table;
+
+ new_table = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!new_table) {
+ p->error = -ENOMEM;
+ __set_current_state(TASK_RUNNING);
+ return;
+ }
+ new_table->entry = new_table->entries;
+ new_table->next = table;
+ p->table = new_table;
+ table = new_table;
}
- new_table->entry = new_table->entries;
- new_table->next = table;
- p->table = new_table;
- table = new_table;
- }
-
- /* Add a new entry */
- {
- struct poll_table_entry * entry = table->entry;
+ entry = table->entry;
table->entry = entry+1;
- get_file(filp);
- entry->filp = filp;
- entry->wait_address = wait_address;
- init_waitqueue_entry(&entry->wait, current);
- add_wait_queue(wait_address,&entry->wait);
}
+ /* Add a new entry */
+ get_file(filp);
+ entry->filp = filp;
+ entry->wait_address = wait_address;
+ init_waitqueue_entry(&entry->wait, current);
+ add_wait_queue(wait_address,&entry->wait);
}

#define __IN(fds, n) (fds->in + n)
@@ -233,14 +255,18 @@
return retval;
}

-static void *select_bits_alloc(int size)
+#define SELECT_INLINE_BYTES 32
+static inline void *select_bits_alloc(int size, void* internal)
{
+ if(size <= SELECT_INLINE_BYTES)
+ return internal;
return kmalloc(6 * size, GFP_KERNEL);
}

-static void select_bits_free(void *bits, int size)
+static inline void select_bits_free(void *bits, void* internal)
{
- kfree(bits);
+ if(bits != internal)
+ kfree(bits);
}

/*
@@ -254,10 +280,12 @@
#define MAX_SELECT_SECONDS \
((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)

+
asmlinkage long
sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
{
fd_set_bits fds;
+ char ibuf[6*SELECT_INLINE_BYTES];
char *bits;
long timeout;
int ret, size;
@@ -295,7 +323,7 @@
*/
ret = -ENOMEM;
size = FDS_BYTES(n);
- bits = select_bits_alloc(size);
+ bits = select_bits_alloc(size, ibuf);
if (!bits)
goto out_nofds;
fds.in = (unsigned long *) bits;
@@ -340,12 +368,18 @@
set_fd_set(n, exp, fds.res_ex);

out:
- select_bits_free(bits, size);
+ select_bits_free(bits, ibuf);
out_nofds:
return ret;
}

-#define POLLFD_PER_PAGE ((PAGE_SIZE) / sizeof(struct pollfd))
+struct poll_list {
+ struct poll_list *next;
+ int len;
+ struct pollfd entries[0];
+};
+
+#define POLLFD_PER_PAGE ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd))

static void do_pollfd(unsigned int num, struct pollfd * fdpage,
poll_table ** pwait, int *count)
@@ -379,39 +413,44 @@
}
}

-static int do_poll(unsigned int nfds, unsigned int nchunks, unsigned int nleft,
- struct pollfd *fds[], poll_table *wait, long timeout)
+static int do_poll(int nfds, struct poll_list *list,
+ poll_table *wait, long timeout)
{
- int count;
+ int count = 0;
poll_table* pt = wait;
-
+
for (;;) {
- unsigned int i;
-
+ struct poll_list* walk;
set_current_state(TASK_INTERRUPTIBLE);
- count = 0;
- for (i=0; i < nchunks; i++)
- do_pollfd(POLLFD_PER_PAGE, fds[i], &pt, &count);
- if (nleft)
- do_pollfd(nleft, fds[nchunks], &pt, &count);
+ walk = list;
+ while(walk != NULL) {
+ do_pollfd( walk->len, walk->entries, &pt, &count);
+ walk = walk->next;
+ }
pt = NULL;
if (count || !timeout || signal_pending(current))
break;
count = wait->error;
if (count)
break;
+
timeout = schedule_timeout(timeout);
}
current->state = TASK_RUNNING;
return count;
}

+#define INLINE_POLL_COUNT 24
asmlinkage long sys_poll(struct pollfd * ufds, unsigned int nfds, long timeout)
{
- int i, j, fdcount, err;
- struct pollfd **fds;
+ int fdcount, err;
+ unsigned int i;
+ struct poll_list *pollwalk;
+ struct {
+ struct poll_list head;
+ struct pollfd entries[INLINE_POLL_COUNT];
+ } polldata;
poll_table table, *wait;
- int nchunks, nleft;

/* Do a sanity check on nfds ... */
if (nfds > current->files->max_fds)
@@ -431,63 +470,65 @@
wait = NULL;

err = -ENOMEM;
- fds = NULL;
- if (nfds != 0) {
- fds = (struct pollfd **)kmalloc(
- (1 + (nfds - 1) / POLLFD_PER_PAGE) * sizeof(struct pollfd *),
- GFP_KERNEL);
- if (fds == NULL)
- goto out;
- }
+ polldata.head.next = NULL;
+ polldata.head.len = INLINE_POLL_COUNT;
+ if(nfds <= INLINE_POLL_COUNT)
+ polldata.head.len = nfds;

- nchunks = 0;
- nleft = nfds;
- while (nleft > POLLFD_PER_PAGE) { /* allocate complete PAGE_SIZE chunks */
- fds[nchunks] = (struct pollfd *)__get_free_page(GFP_KERNEL);
- if (fds[nchunks] == NULL)
+ pollwalk = &polldata.head;
+ i = nfds;
+ err = -ENOMEM;
+ goto start;
+ while(i!=0) {
+ struct poll_list *pp;
+ pp = kmalloc(sizeof(struct poll_list)+
+ sizeof(struct pollfd)*
+ (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i),
+ GFP_KERNEL);
+ if(pp==NULL)
goto out_fds;
- nchunks++;
- nleft -= POLLFD_PER_PAGE;
- }
- if (nleft) { /* allocate last PAGE_SIZE chunk, only nleft elements used */
- fds[nchunks] = (struct pollfd *)__get_free_page(GFP_KERNEL);
- if (fds[nchunks] == NULL)
+ pp->next=NULL;
+ pp->len = (i>POLLFD_PER_PAGE?POLLFD_PER_PAGE:i);
+ pollwalk->next = pp;
+ pollwalk = pp;
+start:
+ if (copy_from_user(pollwalk+1, ufds + nfds-i,
+ sizeof(struct pollfd)*pollwalk->len)) {
+ err = -EFAULT;
goto out_fds;
+ }
+ i -= pollwalk->len;
}
+
+ fdcount = do_poll(nfds, &polldata.head,
+ wait, timeout);

+ /* OK, now copy the revents fields back to user space. */
+ i = nfds;
+ pollwalk = &polldata.head;
err = -EFAULT;
- for (i=0; i < nchunks; i++)
- if (copy_from_user(fds[i], ufds + i*POLLFD_PER_PAGE, PAGE_SIZE))
- goto out_fds1;
- if (nleft) {
- if (copy_from_user(fds[nchunks], ufds + nchunks*POLLFD_PER_PAGE,
- nleft * sizeof(struct pollfd)))
- goto out_fds1;
+ while(pollwalk != NULL) {
+ struct pollfd * fds = pollwalk->entries;
+ int j;
+
+ for (j=0; j < pollwalk->len; j++, ufds++) {
+ if(__put_user(fds[j].revents, &ufds->revents))
+ goto out_fds;
+ }
+ i -= pollwalk->len;
+ pollwalk = pollwalk->next;
}
-
- fdcount = do_poll(nfds, nchunks, nleft, fds, wait, timeout);
-
- /* OK, now copy the revents fields back to user space. */
- for(i=0; i < nchunks; i++)
- for (j=0; j < POLLFD_PER_PAGE; j++, ufds++)
- __put_user((fds[i] + j)->revents, &ufds->revents);
- if (nleft)
- for (j=0; j < nleft; j++, ufds++)
- __put_user((fds[nchunks] + j)->revents, &ufds->revents);
-
err = fdcount;
if (!fdcount && signal_pending(current))
err = -EINTR;

-out_fds1:
- if (nleft)
- free_page((unsigned long)(fds[nchunks]));
out_fds:
- for (i=0; i < nchunks; i++)
- free_page((unsigned long)(fds[i]));
- if (nfds != 0)
- kfree(fds);
-out:
+ pollwalk = polldata.head.next;
+ while(pollwalk!=NULL) {
+ struct poll_list *pp = pollwalk->next;
+ kfree(pollwalk);
+ pollwalk = pp;
+ }
poll_freewait(&table);
return err;
}
--- 2.4/include/linux/poll.h Thu Jan 4 23:51:10 2001
+++ build-2.4/include/linux/poll.h Mon Mar 12 16:03:07 2001
@@ -12,9 +12,18 @@

struct poll_table_page;

+struct poll_table_entry {
+ struct file * filp;
+ wait_queue_t wait;
+ wait_queue_head_t * wait_address;
+};
+
+#define POLL_TABLE_INTERNAL 8
typedef struct poll_table_struct {
int error;
+ int nr;
struct poll_table_page * table;
+ struct poll_table_entry internal[POLL_TABLE_INTERNAL];
} poll_table;

extern void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p);
@@ -28,6 +37,7 @@
static inline void poll_initwait(poll_table* pt)
{
pt->error = 0;
+ pt->nr = 0;
pt->table = NULL;
}
extern void poll_freewait(poll_table* pt);


Attachments:
patch-kiopipe (12.75 kB)
patch-poll (9.80 kB)
Download all attachments

2001-03-12 18:13:45

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

Hello!

> * davem's patch breaks apps that assume that write(,PIPE_BUF) after
> poll(POLLOUT) never blocks, even for blocking pipes.

Pardon, but PIPE_BUF <= PAGE_SIZE yet, so that fears have no reasons.

Alexey

2001-03-12 18:43:15

by Manfred Spraul

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

From: <[email protected]>
> Hello!
>
> > * davem's patch breaks apps that assume that write(,PIPE_BUF) after
> > poll(POLLOUT) never blocks, even for blocking pipes.
>
> Pardon, but PIPE_BUF <= PAGE_SIZE yet, so that fears have no reasons.
>

The difference is the =

> <<<<< davem's patch
> + if (count >= PAGE_SIZE &&
> ^^
> + !(filp->f_flags & O_NONBLOCK)) {
> <<<<<<< my patch
> + if (count > PIPE_BUF && chars == PIPE_SIZE &&
^
> + (!(filp->f_flags & O_NONBLOCK))) {
> <<<<<<<

davem used >=, I used >. All other differences between our patches are
code cleanups.

Just try this on i386: (PIPE_BUF is defined to 4096 on i386 - I really
don't understand why, but now it's too late to reverse it back to 512)

<<<<
char buf[PIPE_BUF];
void main()
{
int pipes[2];
pipe(pipes);
write(pipes[1],buf,sizeof(buf));
}
<<<<<<<

It returns immediately on all unix platforms I tested, including all
linux versions, except with davem's patch.
It's not guaranteed in sus or posix, but I'm reluctant to change it.

--
Manfred


2001-03-12 19:41:13

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

Hello!

> It returns immediately on all unix platforms I tested

I see. It is essential moment. PAGE_SIZE was really bad threshold value.
Sigh and alas.

Alexey


PS BTW "all unix" is unlikely to include freebsd. 8)

2001-03-12 20:06:34

by Manfred Spraul

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

From: <[email protected]>
>
> PS BTW "all unix" is unlikely to include freebsd. 8)
>

freebsd, openbsd, netbsd, tru64, openvms - all unix versions I found
free telnet guest accounts for.

Running for cover,
Manfred

2001-03-12 20:09:24

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

Hello!

> freebsd

Very funny, the idea is borrowed from there.

As you could understand your patch kills it. PAGE_SIZE is one of the most
frequently used transfer unit.

Alexey

2001-03-12 20:29:24

by Manfred Spraul

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

From: <[email protected]>
>
> > freebsd
>
> Very funny, the idea is borrowed from there.
>
> As you could understand your patch kills it. PAGE_SIZE is one of the
most
> frequently used transfer unit.
>

freebsd-4.0 doesn't use direct transfers for PAGE_SIZE'd pipe write()s:
it uses MINDIRECT=8192. (and PIPE_BUF is 512, so 4096 was possible for
them)


--
Manfred

2001-03-12 20:32:34

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

Hello!

> freebsd-4.0 doesn't use direct transfers for PAGE_SIZE'd pipe write()s:
> it uses MINDIRECT=8192.

I see.

> (and PIPE_BUF is 512, so 4096 was possible for
> them)

8) I see.

Thank you for patience. 8)

Alexey

2001-03-12 16:59:21

by Ingo Oeser

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

On Mon, Mar 12, 2001 at 05:20:12PM +0100, Mike Galbraith wrote:
> I just happen to have a 2.4.2 IKD patch sitting here, and therein
> sits Ingo's memory leak detector... poor thing is bored to tears 8)

Could to point me to mingos[1] memleak-detector? I need to know,
whats going on here.

I have had a load of 10 today, a much to busy disk and a full
swap.

I killed X and killed -9 netscape several times and waited a
while. Things didn't settle down until reboot.

So I definitly would like to try all that doesn't corrupt my fs ;-)

Regards

Ingo Oeser

[1] If people wonder, why I ask about "my own" patch: Mike means
Ingo *Molnar* (nickname mingo), which is NOT me (nickname ioe).

I'm NOT going to change my name because of this, ok? ;-)
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-03-12 17:30:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: Feedback for fastselect and one-copy-pipe

On Mon, 12 Mar 2001, Ingo Oeser wrote:

> On Mon, Mar 12, 2001 at 05:20:12PM +0100, Mike Galbraith wrote:
> > I just happen to have a 2.4.2 IKD patch sitting here, and therein
> > sits Ingo's memory leak detector... poor thing is bored to tears 8)
>
> Could to point me to mingos[1] memleak-detector? I need to know,
> whats going on here.

I see that Andrea has released a new IKD. You can find it in
the people/andrea/ikd/v2.4 directory of your favorite mirror.

-Mike