LinuxLists.cc - zerocopy pipe, new version

2002-01-05 22:28:23

Subject: zerocopy pipe, new version

Attached is a new version of the zerocopy pipe code.

A few months ago, Hubertus Franke found a severe performance problem with my last
version. Now I figured out how I can solve it:
For good pipe performance, sys_read() must try to return as much data as possible with
one syscall, even if the writer writes small bits. The current code uses
PIPE_WAITING_WRITERS, but that doesn't work with nonblocking IO and is not that
efficient.

I added a sched_yield() into that path, and that fixed the performance degradation:
if pipe_read made progress and the user wants even more data, then call sched_yield() and
give the writers a chance to write additional data. After sched_yield() returns, try again until
there is either no more data, or the user request is completely fullfilled. Then return to userspace.

Now I got +50% on UP with pipeflex -c2 -r32 -w1 with 2.5.2-pre8+zerocopy, SMP kernel
over 2.4.2-UP.

Unfortunately the patch has virtually no effect on 4 kB write-4 kB read, and that's the most
common case. (number of context switches cut in half, but a slight performance loss on K6,
probably due to cache trashing)

The only solution I see for that problem are larger kernel buffers - more data must be queued
in kernel. Either on-demand (kmalloc() up to <sysctl-limit, around 512 kB> if a request for
<= 4096 bytes arrives and would block. If the allocation fails, then block), or at pipe creation
like in Hubertus' patch.

--
Manfred

Attachments:

patch-kp-2.5.2 (16.62 kB)

2002-01-15 17:19:07

by Hubertus Franke

[permalink] [raw]

Subject: Re: [Lse-tech] zerocopy pipe, new version

----- Original Message -----
From: "Manfred Spraul" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>
Sent: Saturday, January 05, 2002 5:27 PM
Subject: [Lse-tech] zerocopy pipe, new version

> Attached is a new version of the zerocopy pipe code.
>
> A few months ago, Hubertus Franke found a severe performance problem with
my last
> version. Now I figured out how I can solve it:
> For good pipe performance, sys_read() must try to return as much data as
possible with
> one syscall, even if the writer writes small bits. The current code uses
> PIPE_WAITING_WRITERS, but that doesn't work with nonblocking IO and is not
that
> efficient.
>
> I added a sched_yield() into that path, and that fixed the performance
degradation:
> if pipe_read made progress and the user wants even more data, then call
sched_yield() and
> give the writers a chance to write additional data. After sched_yield()
returns, try again until
> there is either no more data, or the user request is completely
fullfilled. Then return to userspace.
>
> Now I got +50% on UP with pipeflex -c2 -r32 -w1 with 2.5.2-pre8+zerocopy,
SMP kernel
> over 2.4.2-UP.
>
> Unfortunately the patch has virtually no effect on 4 kB write-4 kB read,
and that's the most
> common case. (number of context switches cut in half, but a slight
performance loss on K6,
> probably due to cache trashing)
>
> The only solution I see for that problem are larger kernel buffers - more
data must be queued
> in kernel. Either on-demand (kmalloc() up to <sysctl-limit, around 512 kB>
if a request for
> <= 4096 bytes arrives and would block. If the allocation fails, then
block), or at pipe creation
> like in Hubertus' patch.
>
> --
> Manfred
>

We can only echo Manfred's recent finding and improvements to
the zerocopy patch he has submitted. Looks like a winner to us.
As you might recall we had worked from the other end namely
larger pipe buffers and a below we show that a combination of zero-copy
and large pipe can basically deal with all the performance issues.

We show the performance gain/loss for various benchmarks on
2-way, 1-way and UP systems.

For this test we added our new large pipe patch over Manfred Spraul's
Zero copy patch and all these test were ran on 2.4.17 kernel.

Here are our initial results which shows the performance improvement of
large pipe buffer + zero copy over 2.4.17 vanilla kernel. The pipe size
we used for this testing was 32K (ie. 8 pages).

Grep over 50mb file
-------------------

2-way 1-way UP
----- ----- --
%imp 206.76 8.02 -1.55

LMBench
-------
%imp
TS 2-way 1-way UP
-- ----- ----- --
2k 102.61 -5.22 -7.25
4k 136 4.01 -5.02
6k 95.98 -22.15 -26.67
8k 29.75 -31.84 -44.23
12k 5.99 -14.09 -23.41
16k 8.49 -13.21 -22.19
24k -0.92 -3.11 -15.39
32k 3.11 -3.14 -14.75
64k 26.55 0.99 25.76
128k 22.69 55.72 -7.43

Pipeflex -c2 -t 20 -x 500 -y 0 -r #TS -w 1 -o 0 -m 0
--------
%imp
TS 2-way 1-way UP
-- ----- ----- --
2k 1.6 1.07 0.27
4k 0.14 0 -0.41
6k 21.08 0.29 -1.21
8k 15.01 -1.25 -3.07
12k 12.1 -3.81 -5.66
16k 27.02 -4.29 -5.8
24k 48.34 -4.86 -7.47
32k 27.89 -6.34 -8.24
64k 34.74 -6.67 -11.53
128k 58.23 2.09 -2.79

Here are the results of % improvement of Large Pipe(32k) + Zero copy over
Zero copy.

Grep over 50mb file
-------------------

2-way 1-way UP
----- ----- --

%imp 96.62 -3.77 -4.64

LMBench
-------

TS 2-way 1-way UP
-- ----- ----- --
2k 26.3 -20.19 -8.54
4k 26.98 -20.56 -29.92
6k 11.69 -47.83 -58.35
8k -5.26 -53.8 -65.39
12k -22.11 -28.19 -32.76
16k -22.62 -30.1 -34.15
24k -18.88 -27.02 -29.29
32k -20.44 -28.92 -29.95
64k -1.38 -29.75 1.92
128k 7.3 -2.02 -18.49

Pipeflex -c2 -t 20 -x 500 -y 0 -r #TS -w 1 -o 0 -m 0
----------------------------------------------------

TS 2-way 1-way UP
-- ----- ----- --
2k 1.33 0.27 -0.53
4k 6.1 0.14 -0.54
6k 15.38 -0.47 -1.66
8k 22.69 -1.4 -3
12k 24.1 -3.25 -4.98
16k 25.49 -3.88 -5.96
24k 41.48 -5.22 -8.19
32k 21.84 -5.52 -9.04
64k 35.35 -8.32 -14.06
128k 42.8 -8.78 -16.16

Here are the results which shows the % improvement (degradation)
of large pipe when added to the zero copy patch. For that
we set the pipe size to the default 4k.

Grep
----
2-way 1-way UP
----- ----- --
%overhead -0.73 -0.97 0

LMBench
-------
% overhead
TS 2-way 1-way UP
-- ----- ----- --
2k -9.59 13.51 1.75
4k 5.5 -1.22 -10.11
6k -2.44 3.93 -2.92
8k -1.61 0.51 -0.92
12k -1.74 0.71 -0.85
16k -0.04 0.44 -0.46
24k -0.15 0.37 0.05
32k -0.39 -0.02 -0.14
64k -0.8 -1.05 1.52
128k 1.94 -7.46 -2.96

Pipeflex -c2 -t 20 -x 500 -y 0 -r #TS -w 1 -o 0 -m 0
--------

% overhead
TS 2-way 1-way UP
-- ----- ---- --
2k -0.27 0.27 -0.26
4k 1.45 0 -0.14
6k 2.35 -0.09 -0.09
8k 7.96 -0.37 -0.14
12k 13.89 0.37 0.05
16k 17.66 -1.96 0
24k 12.41 -0.92 -0.66
32k 16.12 -0.11 -0.64
64k 32.97 -0.35 -0.64
128k 67.87 -0.25 -0.89

Conclusion
----------
Manfred Spraul's new zero copy patch showed a very good performance
improvement for pipeflex as well as grep on both 2-way and 1-way.
By adding our large pipe support, performance improvement up to
42% for pipeflex and 96% for grep benchmarks on 2-way systems.
No significant different has been observed for lmbench on 1-way systems.
The right way to configure the system than would be for UP and 1-way to
set the buffer size by default to 4K while for the SMP larger pipe buffers
should be encouraged.
Below is the large pipe patch which has to be applied over manfred's
latest zero copy patch.

diff -urbN linux-2.4.17-manfred-up/fs/pipe.c linux-2.4.17-pipe-up/fs/pipe.c
--- linux-2.4.17-manfred-up/fs/pipe.c Mon Jan 7 15:54:00 2002
+++ linux-2.4.17-pipe-up/fs/pipe.c Thu Jan 10 11:23:05 2002
@@ -28,6 +28,8 @@
* -- Julian Bradfield 1999-06-07.
*/

+struct pipe_stat_t pipe_stat;
+
/* Drop the inode semaphore and wait for a pipe event, atomically */
void pipe_wait(struct inode * inode)
{
@@ -158,7 +160,7 @@
* case. Write into the internal buffer before
* checking for signals/error conditions.
*/
- size_t j = min((size_t)PIPE_SIZE, pio->len);
+ size_t j = min((size_t)PIPE_SIZE(*inode), pio->len);
if (PIPE_LEN(*inode)) BUG();
if (PIPE_START(*inode)) BUG();
if (!copy_from_user(PIPE_BASE(*inode), buf + i, j)) {
@@ -201,8 +203,8 @@
int offset = PIPE_START(*inode)%PIPE_BUF;
if (chars > count)
chars = count;
- if (chars > PIPE_SIZE-offset)
- chars = PIPE_SIZE-offset;
+ if (chars > PIPE_SIZE(*inode)-offset)
+ chars = PIPE_SIZE(*inode)-offset;
if (unlikely(copy_to_user(buf, pipebuf+offset,
chars))) {
if (!read)
read = -EFAULT;
@@ -326,11 +328,11 @@
if (PIPE_PIOLEN(*inode))
goto skip_int_buf;
/* write to internal buffer - could be cyclic */
- while(start = PIPE_LEN(*inode),chars = PIPE_SIZE - start,
chars >= min) {
+ while(start = PIPE_LEN(*inode),chars = PIPE_SIZE(*inode) -
start, chars >= min) {
start += PIPE_START(*inode);
- start %= PIPE_SIZE;
- if (chars > PIPE_BUF - start)
- chars = PIPE_BUF - start;
+ start %= PIPE_SIZE(*inode);
+ if (chars > PIPE_SIZE(*inode) - start)
+ chars = PIPE_SIZE(*inode) - start;
if (chars > count)
chars = count;
if (unlikely(copy_from_user(PIPE_BASE(*inode)+start,
@@ -470,7 +472,7 @@
if (!PIPE_READERS(*inode) && !PIPE_WRITERS(*inode)) {
struct pipe_inode_info *info = inode->i_pipe;
inode->i_pipe = NULL;
- free_page((unsigned long) info->base);
+ free_pages((unsigned long) info->base, info->order);
kfree(info);
} else {
wake_up_interruptible(PIPE_WAIT(*inode));
@@ -604,8 +606,12 @@
struct inode* pipe_new(struct inode* inode)
{
unsigned long page;
+ int pipe_order = pipe_stat.pipe_size_order;
+
+ if (pipe_order > MAX_PIPE_ORDER)
+ pipe_order = MAX_PIPE_ORDER;

- page = __get_free_page(GFP_USER);
+ page = __get_free_pages(GFP_USER, pipe_order);
if (!page)
return NULL;

@@ -619,10 +625,11 @@
PIPE_START(*inode) = PIPE_LEN(*inode) = PIPE_PIOLEN(*inode) = 0;
PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;
+ PIPE_ORDER(*inode) = pipe_order;

return inode;
fail_page:
- free_page(page);
+ free_pages(page, pipe_order);
return NULL;
}

@@ -737,7 +744,7 @@
close_f12_inode_i:
put_unused_fd(i);
close_f12_inode:
- free_page((unsigned long) PIPE_BASE(*inode));
+ free_pages((unsigned long) PIPE_BASE(*inode), PIPE_ORDER(*inode));
kfree(inode->i_pipe);
inode->i_pipe = NULL;
iput(inode);
diff -urbN linux-2.4.17-manfred-up/include/linux/pipe_fs_i.h
linux-2.4.17-pipe-up/include/linux/pipe_fs_i.h
--- linux-2.4.17-manfred-up/include/linux/pipe_fs_i.h Mon Jan 7 15:54:00
2002
+++ linux-2.4.17-pipe-up/include/linux/pipe_fs_i.h Tue Jan 8 13:42:18
2002
@@ -2,6 +2,7 @@
#define _LINUX_PIPE_FS_I_H

#define PIPEFS_MAGIC 0x50495045
+#define MAX_PIPE_ORDER 3
struct pipe_inode_info {
wait_queue_head_t wait;
char *base;
@@ -13,12 +14,21 @@
unsigned int writers;
unsigned int r_counter;
unsigned int w_counter;
+ unsigned int order;
};

+struct pipe_stat_t{
+ int pipe_size_order;
+ int pipe_seg_order;
+};
+
+extern struct pipe_stat_t pipe_stat;
+
/* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
memory allocation, whereas PIPE_BUF makes atomicity guarantees. */
-#define PIPE_SIZE PAGE_SIZE
+#define PIPE_SIZE(inode) ((1 << PIPE_ORDER(inode)) * PAGE_SIZE)

+#define PIPE_ORDER(inode) ((inode).i_pipe->order)
#define PIPE_SEM(inode) (&(inode).i_sem)
#define PIPE_WAIT(inode) (&(inode).i_pipe->wait)
#define PIPE_BASE(inode) ((inode).i_pipe->base)
@@ -31,8 +41,8 @@
#define PIPE_RCOUNTER(inode) ((inode).i_pipe->r_counter)
#define PIPE_WCOUNTER(inode) ((inode).i_pipe->w_counter)

-#define PIPE_FREE(inode) (PIPE_SIZE - PIPE_LEN(inode))
-#define PIPE_END(inode) ((PIPE_START(inode) + PIPE_LEN(inode)) &
(PIPE_SIZE-1))
+#define PIPE_FREE(inode) (PIPE_SIZE(inode) - PIPE_LEN(inode))
+#define PIPE_END(inode) ((PIPE_START(inode) + PIPE_LEN(inode)) &
(PIPE_SIZE(inode)-1))

/* Drop the inode semaphore and wait for a pipe event, atomically */
void pipe_wait(struct inode * inode);
diff -urbN linux-2.4.17-manfred-up/include/linux/sysctl.h
linux-2.4.17-pipe-up/include/linux/sysctl.h
--- linux-2.4.17-manfred-up/include/linux/sysctl.h Tue Jan 8 16:16:55
2002
+++ linux-2.4.17-pipe-up/include/linux/sysctl.h Tue Jan 8 16:04:01 2002
@@ -543,6 +543,7 @@
FS_LEASES=13, /* int: leases enabled */
FS_DIR_NOTIFY=14, /* int: directory notification enabled */
FS_LEASE_TIME=15, /* int: maximum time to wait for a lease
break */
+ FS_PIPE_SIZE=16, /* int: number of pages allocated for PIPE
*/
};

/* CTL_DEBUG names: */
diff -urbN linux-2.4.17-manfred-up/kernel/sysctl.c
linux-2.4.17-pipe-up/kernel/sysctl.c
--- linux-2.4.17-manfred-up/kernel/sysctl.c Fri Dec 21 12:42:04 2001
+++ linux-2.4.17-pipe-up/kernel/sysctl.c Tue Jan 8 13:42:20 2002
@@ -307,6 +307,8 @@
sizeof(int), 0644, NULL, &proc_dointvec},
{FS_LEASE_TIME, "lease-break-time", &lease_break_time, sizeof(int),
0644, NULL, &proc_dointvec},
+ {FS_PIPE_SIZE, "pipe-sz", &pipe_stat, 2*sizeof(int),
+ 0644, NULL, &proc_dointvec},
{0}
};

2002-01-15 19:58:10

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [Lse-tech] zerocopy pipe, new version

On Tue, Jan 15, 2002 at 12:20:25PM -0500, Hubertus Franke wrote:
> Conclusion
> ----------
> Manfred Spraul's new zero copy patch showed a very good performance
> improvement for pipeflex as well as grep on both 2-way and 1-way.
> By adding our large pipe support, performance improvement up to
> 42% for pipeflex and 96% for grep benchmarks on 2-way systems.
> No significant different has been observed for lmbench on 1-way systems.
> The right way to configure the system than would be for UP and 1-way to
> set the buffer size by default to 4K while for the SMP larger pipe buffers
> should be encouraged.

Any conclusions are incomplete without mentioning the serious (30%+)
degredation on UP systems under a good chunk of the benchmarks. Such
aspects of the patch make it unsuitable as is for both the mainstream
and vendor kernels.

-ben

2002-01-15 20:09:10

by Manfred Spraul

[permalink] [raw]

Subject: Re: [Lse-tech] zerocopy pipe, new version

From: "Benjamin LaHaise" <[email protected]>
>
> Any conclusions are incomplete without mentioning the serious (30%+)
> degredation on UP systems under a good chunk of the benchmarks. Such
> aspects of the patch make it unsuitable as is for both the mainstream
> and vendor kernels.
>
My patch is definitively WIP - right now I again broke the -ENOMEM and
-EFAULT handling.

--
Manfred

2002-01-15 20:21:00

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [Lse-tech] zerocopy pipe, new version

On Tue, Jan 15, 2002 at 09:08:45PM +0100, Manfred Spraul wrote:
> My patch is definitively WIP - right now I again broke the -ENOMEM and
> -EFAULT handling.

I am aware of that, but the lse-tech posting made it sound as if things are
great now since the SMP numbers improved. Please folks, remember that UP
numbers are important too.

-ben
--
Fish.

2002-01-16 15:30:15

by Hubertus Franke

[permalink] [raw]

Subject: Re: [Lse-tech] zerocopy pipe, new version

On Tue, Jan 15, 2002 at 03:20:38PM -0500, Benjamin LaHaise wrote:
> On Tue, Jan 15, 2002 at 09:08:45PM +0100, Manfred Spraul wrote:
> > My patch is definitively WIP - right now I again broke the -ENOMEM and
> > -EFAULT handling.
>
> I am aware of that, but the lse-tech posting made it sound as if things are
> great now since the SMP numbers improved. Please folks, remember that UP
> numbers are important too.
>
> -ben
> --
> Fish.

Ben, yes you are right, the lse-posting in a second reading is misleading.
As reported previously http://lse.sourceforge.net/pipe/pipe-report
UP numbers see degradations for LM-Bench, other benchmarks are OK.

This is not solved either by an integration of zero-copy with large pipes.
We have however shown that the for SMP systems adding larger pipes to
zero-copy pipes makes sense and for UP and 1-way sticking with a 1-page
pipe does not degradate Manfred's patch.

It hence boils down to a proper parameterization of the pipe dependent on
the configuration.

-- Hubertus