2002-07-17 22:51:58

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.19rc2aa1

I would appreciate any feedback on the last patches for the i_size
atomic accesses on 32bit archs. Thanks,

URL:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1/

diff between 2.4.19rc1aa2 and 2.4.19rc1aa2:

Only in 2.4.19rc1aa2: 000_e100-2.0.30-k1.gz
Only in 2.4.19rc2aa1: 000_e100-2.1.6.gz
Only in 2.4.19rc1aa2: 000_e1000-4.2.17-k1.gz
Only in 2.4.19rc2aa1: 000_e1000-4.3.2.gz

Upgrade to latest drivers to see if they fix the reports.

Only in 2.4.19rc1aa2: 00_free_pgtable-and-p4-tlb-race-fixes-1

Merged into mainline.

Only in 2.4.19rc1aa2: 00_nfs-bkl-2
Only in 2.4.19rc2aa1: 00_nfs-bkl-3
Only in 2.4.19rc1aa2: 00_rwsem-fair-29
Only in 2.4.19rc1aa2: 00_rwsem-fair-29-recursive-8
Only in 2.4.19rc2aa1: 00_rwsem-fair-30
Only in 2.4.19rc2aa1: 00_rwsem-fair-30-recursive-8
Only in 2.4.19rc1aa2: 00_sched-O1-rml-2.4.19-pre9-1.gz
Only in 2.4.19rc2aa1: 00_sched-O1-rml-2.4.19-pre9-2.gz
Only in 2.4.19rc2aa1: 10_rawio-vary-io-10
Only in 2.4.19rc1aa2: 10_rawio-vary-io-9
Only in 2.4.19rc1aa2: 70_xfs-1.1-3.gz
Only in 2.4.19rc2aa1: 70_xfs-1.1-4.gz
Only in 2.4.19rc1aa2: 90_init-survive-threaded-race-4
Only in 2.4.19rc2aa1: 90_init-survive-threaded-race-5
Only in 2.4.19rc1aa2: 91_zone_start_pfn-5
Only in 2.4.19rc2aa1: 91_zone_start_pfn-6

Rediffed.

Only in 2.4.19rc1aa2: 20_pte-highmem-25
Only in 2.4.19rc2aa1: 20_pte-highmem-26

Rediffed.

Only in 2.4.19rc1aa2: 00_readv-writev-1

Go in sync with mainline (alpha has some change now).

Only in 2.4.19rc2aa1: 00_set_64bit-atomic-1

I noticed set_64bit wasn't atomic due the lack of lock prefix. This
fixes the problem to be sure the cpu sees coherent values while
walking pagetables.

Only in 2.4.19rc1aa2: 00_setfl-race-fix-1
Only in 2.4.19rc2aa1: 00_setfl-race-fix-2

Merge ioctl part of the new ->fasync locking from Marcus Alanen.
BTW, it looks fasync callback can also be recalled w/o the
big kernel lock, the fasync_helper seems thread safe, but
I preferred not to risk in 2.4 in case some driver is relying on the
big kernel lock somehow.

Only in 2.4.19rc1aa2: 00_shm_destroy-deadlock-1
Only in 2.4.19rc2aa1: 00_shm_destroy-deadlock-2

Mainline changes shm_tot inside the shmid lock, which isn't
needed, the semaphore serializes the shm_tot instead.

Only in 2.4.19rc2aa1: 00_sock_fasync-memleak-1

Fix memleak if sock lookup fails, from Robert Love.

Only in 2.4.19rc2aa1: 00_wait_kio-cleanup-1

Drop pointless buffer_locked check, wait_on_buffer does
it internally (noticed by Jens).

Only in 2.4.19rc1aa2: 07_qlogicfc-1.gz
Only in 2.4.19rc2aa1: 07_qlogicfc-2.gz
Only in 2.4.19rc1aa2: 08_qlogicfc-template-aa-1
Only in 2.4.19rc2aa1: 08_qlogicfc-template-aa-2
Only in 2.4.19rc1aa2: 09_qlogic-link-1
Only in 2.4.19rc2aa1: 09_qlogic-link-2

Upgrade to 6.1b2 driver.

Only in 2.4.19rc2aa1: 50_uml-patch-2.4.18-40

New uml updates from Jeff (36->40).

Only in 2.4.19rc2aa1: 76_xfs-64bit-1

64bit fixes.

Only in 2.4.19rc1aa2: 80_x86_64-common-code-4
Only in 2.4.19rc2aa1: 80_x86_64-common-code-5
Only in 2.4.19rc2aa1: 81_x86_64-arch-2.4.19rc1-1.gz
Only in 2.4.19rc1aa2: 81_x86_64-arch-6.gz
Only in 2.4.19rc1aa2: 82_x86-64-compile-aa-6
Only in 2.4.19rc1aa2: 82_x86-64-pte-highmem-2
Only in 2.4.19rc2aa1: 82_x86_64-suse-2
Only in 2.4.19rc2aa1: 83_x86_64-cvs-020716-1
Only in 2.4.19rc1aa2: 83_x86_64-setup64-compile-1
Only in 2.4.19rc2aa1: 84_x86-64-mtrr-compile-1
Only in 2.4.19rc1aa2: 84_x86_64-io-compile-1
Only in 2.4.19rc1aa2: 85_x86_64-mmx-xmm-init-4
Only in 2.4.19rc1aa2: 86_x86_64-FIOQSIZE-1
Only in 2.4.19rc2aa1: 87_x86_64-o1sched-1
Only in 2.4.19rc2aa1: 88_x86_64-poll-1

Synchronize with CVS, code from SuSE Labs.

Only in 2.4.19rc2aa1: 90_proc-mapped-base-1

Allow the tasks to choose the start of the heap using
/proc/<pid>/mapped_base .

Only in 2.4.19rc1aa2: 94_numaq-tsc-3
Only in 2.4.19rc2aa1: 94_numaq-tsc-4

Don't disable the tsc feature anymore, unless the user
specifys notsc, the user needs to specify notsc anyways to be sure to
get a failure if userspace tries to read the tsc.

Only in 2.4.19rc2aa1: 95_fsync-corruption-fix-2

Do all the fsync in two passes, and avoid rolling the bh in the inode
list every time we write to it.

Only in 2.4.19rc2aa1: 96_inode_read_write-atomic-1
Only in 2.4.19rc2aa1: 97_i_size-corruption-fixes-1

While checking for another potential race, I found this race between
i_size writers and i_size readers. The i_size readers aren't taking
the i_sem, so they can read the i_size while only the upper 32bit
are updated, so any update of the i_size that changes both the high
32bits and low 32bits of the 64bit i_size, may lead the reader of the
i_size to get out of bounds results (the reader is also write page
that will go and ask the fs to allocate indirect blocks at weird
offsets behind the end of the i_size). This race can happen with SMP
only for example while growing the file size over 4G. Only the readers
inside the i_sem doesn't risk to get random i_size values. The patch
fixes this incrementally, by implementing two methods (i_size_read,
i_size_write) that ensure the lockless readers to get coherent i_size
values. The rules are: 1) i_size_write must be used for all i_size
updates (at least when there can be potential parallel readers outside
the i_sem), 2) i_size_read must be used for all lockless reads when
an i_size change can happen from under us.

The i_size_read/write on x86 are implemented with read_64bit and
set_64bit (see also the above fix to make set_64bit atomic
which should be needed also for the PAE pgd). The only idea
I had for read_64bit on x86 is been to use chmxchg8b for it too,
with the exchange values matching the compare values read off the
pointer. I think it should work fine, but I would appreciate if anybody
could check it or suggest any better way to do it. And now I also
realized this kernel will segfault on a 486/386 (cmpxchg8b appeared on
the pentium first), I will fix it tomorrow (not very high prio to fix
it) with a spinlock if the kernel is compiled for 386/486, scalability
of 486/386 SMP doesn't matter much :) In the meantime I'd appreciate
any feedback on this bugfix, thanks.


Andrea


2002-07-18 10:13:04

by Tobias Ringstrom

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1

Hello Andrea!

Why are you not changing the EXTRAVERSION in your patch? I would make it
much easier to diffrentiate between kernels.

/Tobias

2002-07-18 21:32:07

by Thunder from the hill

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1

Hi,

On Thu, 18 Jul 2002, Tobias Ringstrom wrote:
> Why are you not changing the EXTRAVERSION in your patch? I would make it
> much easier to diffrentiate between kernels.

I did that for me.

# uname -r
2.4.19-rc2-aa1
#

It's working fine for some hours now. The EXTRAVERSION is the only thing
that I changed, and -rc2-aa1 works just fine. But my bdflush seems - with
the same values as from -rc1-aa2 - not to have 100% of the old efficiency
any more.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-19 16:55:43

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1


On 2002.07.18 Andrea Arcangeli wrote:
>I would appreciate any feedback on the last patches for the i_size
>atomic accesses on 32bit archs. Thanks,
>
>URL:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1.gz
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1/
>
>diff between 2.4.19rc1aa2 and 2.4.19rc1aa2:
>
>Only in 2.4.19rc1aa2: 000_e100-2.0.30-k1.gz
>Only in 2.4.19rc2aa1: 000_e100-2.1.6.gz
>Only in 2.4.19rc1aa2: 000_e1000-4.2.17-k1.gz
>Only in 2.4.19rc2aa1: 000_e1000-4.3.2.gz
>
> Upgrade to latest drivers to see if they fix the reports.

Any decent changelog for this two ? Docs in intel's website and downloadable
packages say just nothing.

And look interesting. I am running 2.4.19-rc2-jam1 on the cluster and NetPipe
performance jumped from 400 to 500 Mbits/sec (ie, 100Mb just for free).

Ehem, new kernel also included smptimers, so it is really the mix what
improves throughput. Anyone can say if the scalable timers can influence
network performance ??

TIA

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc2-jam1, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.8mdk)

2002-07-19 17:01:24

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1


On 2002.07.18 Andrea Arcangeli wrote:
>I would appreciate any feedback on the last patches for the i_size
>atomic accesses on 32bit archs. Thanks,
>
>URL:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1.gz
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19rc2aa1/
>
>diff between 2.4.19rc1aa2 and 2.4.19rc1aa2:
>
>Only in 2.4.19rc1aa2: 000_e100-2.0.30-k1.gz
>Only in 2.4.19rc2aa1: 000_e100-2.1.6.gz
>Only in 2.4.19rc1aa2: 000_e1000-4.2.17-k1.gz
>Only in 2.4.19rc2aa1: 000_e1000-4.3.2.gz
>

More on this.

We have two interfaces:
04:04.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
03:01.0 Ethernet controller: Intel Corp. 82543GC Gigabit Ethernet Controller (rev 02)

NetPipe (tcp) shows numbers like 80Mb/s for e100 and 500Mb/s for e1000. So
efficiency is much much higher for e100 driver+card than e1000.
I have to dig, perhaps e100 is doing zerocopy and e1000 is not ?

Any ideas ?

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc2-jam1, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.8mdk)

2002-07-19 23:03:09

by Feldman, Scott

[permalink] [raw]
Subject: RE: 2.4.19rc2aa1

Jamagallon wrote:

> >diff between 2.4.19rc1aa2 and 2.4.19rc1aa2:
> >
> >Only in 2.4.19rc1aa2: 000_e100-2.0.30-k1.gz
> >Only in 2.4.19rc2aa1: 000_e100-2.1.6.gz
> >Only in 2.4.19rc1aa2: 000_e1000-4.2.17-k1.gz
> >Only in 2.4.19rc2aa1: 000_e1000-4.3.2.gz
> >

>More on this.

>We have two interfaces:
>04:04.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
03:01.0 Ethernet
>controller: Intel Corp. 82543GC Gigabit Ethernet Controller (rev 02)

>NetPipe (tcp) shows numbers like 80Mb/s for e100 and 500Mb/s for e1000. So
efficiency is much >much higher for e100 driver+card than e1000. I have to
dig, perhaps e100 is doing zerocopy and >e1000 is not ?

>Any ideas ?

If e100 is sending from the zerocopy path, e1000 is doing the same.

There are several factors that may be limiting your throughput on e1000.
Assuming you have enough CPU umph and bus bandwidth, and your netpipe link
partner and switch are willing, you should be able to approach wire speed.

-scott

2002-07-20 00:07:46

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1


On 2002.07.20 "Feldman, Scott" wrote:
>Jamagallon wrote:
>
>> >diff between 2.4.19rc1aa2 and 2.4.19rc1aa2:
>> >
>> >Only in 2.4.19rc1aa2: 000_e100-2.0.30-k1.gz
>> >Only in 2.4.19rc2aa1: 000_e100-2.1.6.gz
>> >Only in 2.4.19rc1aa2: 000_e1000-4.2.17-k1.gz
>> >Only in 2.4.19rc2aa1: 000_e1000-4.3.2.gz
>> >
>
>>More on this.
>
>>We have two interfaces:
>>04:04.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
>03:01.0 Ethernet
>>controller: Intel Corp. 82543GC Gigabit Ethernet Controller (rev 02)
>
>>NetPipe (tcp) shows numbers like 80Mb/s for e100 and 500Mb/s for e1000. So
>efficiency is much >much higher for e100 driver+card than e1000. I have to
>dig, perhaps e100 is doing zerocopy and >e1000 is not ?
>
>>Any ideas ?
>
>If e100 is sending from the zerocopy path, e1000 is doing the same.
>

e100.txt:
- Support for Zero copy on 82550-based adapters. This feature provides
faster data throughput and significant CPU usage improvement in systems
that use the relevant system call (sendfile(2)).
(does this include the on-board 82557, not even listed in e100.txt ? )

e1000.txt
- Zero copy. This feature provides faster data throughput. Enabled by
default in supporting kernels. It is not supported on the Intel(R)
PRO/1000 Gigabit Server Adapter. (==82542)
(so I

>There are several factors that may be limiting your throughput on e1000.
>Assuming you have enough CPU umph and bus bandwidth, and your netpipe link
>partner and switch are willing, you should be able to approach wire speed.
>

Master/sender:
Dual P4Xeon 1.8GHz, Pro/1000 T Server Board (64bit slot, 66MHz)
Slave/receiver:
Dual PIII 1GHz, same board, same slot
Switch:
Intel(R) NetStructure(TM) 470T

netpipe over e100:
Node receiver...
Master transmitter...
Latency: 0.000057
Now starting main loop
0: 4096 bytes 7 times --> 24.96 Mbps in 0.001252 sec
1: 8192 bytes 7 times --> 46.45 Mbps in 0.001345 sec
2: 12288 bytes 92 times --> 46.03 Mbps in 0.002037 sec
3: 16384 bytes 81 times --> 58.32 Mbps in 0.002143 sec
4: 20480 bytes 87 times --> 56.26 Mbps in 0.002777 sec
5: 24576 bytes 72 times --> 71.59 Mbps in 0.002619 sec
6: 28672 bytes 79 times --> 62.76 Mbps in 0.003486 sec
7: 32768 bytes 61 times --> 75.23 Mbps in 0.003323 sec
8: 36864 bytes 65 times --> 66.52 Mbps in 0.004228 sec
9: 40960 bytes 52 times --> 77.20 Mbps in 0.004048 sec
10: 45056 bytes 55 times --> 69.32 Mbps in 0.004959 sec
11: 49152 bytes 45 times --> 74.98 Mbps in 0.005002 sec
12: 53248 bytes 45 times --> 71.41 Mbps in 0.005689 sec
13: 57344 bytes 40 times --> 76.24 Mbps in 0.005738 sec
14: 61440 bytes 40 times --> 73.03 Mbps in 0.006419 sec
15: 65536 bytes 36 times --> 77.13 Mbps in 0.006483 sec
16: 69632 bytes 36 times --> 74.10 Mbps in 0.007170 sec
17: 73728 bytes 32 times --> 78.04 Mbps in 0.007208 sec
18: 77824 bytes 32 times --> 75.42 Mbps in 0.007872 sec
19: 81920 bytes 30 times --> 78.61 Mbps in 0.007950 sec
20: 86016 bytes 29 times --> 76.45 Mbps in 0.008584 sec

(around 75Mb/s, 75% of bandwidth)

netpipe over e1000:
Node receiver...
Master transmitter...
Latency: 0.000058
Now starting main loop
0: 4096 bytes 7 times --> 204.44 Mbps in 0.000153 sec
1: 8192 bytes 7 times --> 303.61 Mbps in 0.000206 sec
2: 12288 bytes 607 times --> 361.83 Mbps in 0.000259 sec
3: 16384 bytes 643 times --> 408.76 Mbps in 0.000306 sec
4: 20480 bytes 613 times --> 424.47 Mbps in 0.000368 sec
5: 24576 bytes 543 times --> 458.94 Mbps in 0.000409 sec
6: 28672 bytes 509 times --> 474.85 Mbps in 0.000461 sec
7: 32768 bytes 465 times --> 491.99 Mbps in 0.000508 sec
8: 36864 bytes 430 times --> 443.68 Mbps in 0.000634 sec
9: 40960 bytes 350 times --> 448.35 Mbps in 0.000697 sec
10: 45056 bytes 322 times --> 455.34 Mbps in 0.000755 sec
11: 49152 bytes 301 times --> 464.07 Mbps in 0.000808 sec
12: 53248 bytes 283 times --> 464.18 Mbps in 0.000875 sec
13: 57344 bytes 263 times --> 467.06 Mbps in 0.000937 sec
14: 61440 bytes 247 times --> 476.95 Mbps in 0.000983 sec
15: 65536 bytes 237 times --> 482.73 Mbps in 0.001036 sec
16: 69632 bytes 226 times --> 488.26 Mbps in 0.001088 sec
17: 73728 bytes 216 times --> 473.16 Mbps in 0.001189 sec
18: 77824 bytes 198 times --> 472.22 Mbps in 0.001257 sec
19: 81920 bytes 188 times --> 481.13 Mbps in 0.001299 sec
20: 86016 bytes 182 times --> 478.84 Mbps in 0.001371 sec

(peak at 491, not even 50% bandwith...)

I have not played with sysctl:
annwn:/proc/sys/net/ipv4> cat tcp_wmem
4096 16384 131072
annwn:/proc/sys/net/ipv4> cat tcp_rmem
4096 87380 174760

Something can be limiting bandwith in the switch ???

TIA

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc2-jam1, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.8mdk)

2002-07-20 00:25:01

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1


On 2002.07.20 "Feldman, Scott" wrote:
>Jamagallon wrote:
>
[...]
>>We have two interfaces:
>>04:04.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
>03:01.0 Ethernet
>>controller: Intel Corp. 82543GC Gigabit Ethernet Controller (rev 02)
>
[...]
>
>There are several factors that may be limiting your throughput on e1000.
>Assuming you have enough CPU umph and bus bandwidth, and your netpipe link
>partner and switch are willing, you should be able to approach wire speed.
>

More info, for it is useful. During a similar run of netpipe, at about 500Mb/s,
switch stats are like this:

------------------------------------------------------------------------------
Switch Overview < 2 sec >
------------------------------------------------------------------------------
Update interval:< 2 sec >
Port 23145 30792 %Util. Port TX/sec RX/sec %Util.

1 |23773 |30928 | 26 | 5 |0 |0 | 0 |

2 |30923 |23773 | 26 | 6 |0 |0 | 0 |

3 |0 |0 | 0 | 7-GBIC|n/a |n/a | n/a |

4 |0 |0 | 0 | 8-GBIC|n/a |n/a | n/a |

Firmware versions:
Boot PROM version: v2.00.04
Firmware version: v2.00.14

Ihave seen in Intel web that there is a new firmware numberd 2.00.17, that
corrects some issues. I will try it....

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-rc2-jam1, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.8mdk)

2002-07-23 14:59:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19rc2aa1

On Thu, Jul 18, 2002 at 03:34:51PM -0600, Thunder from the hill wrote:
> Hi,
>
> On Thu, 18 Jul 2002, Tobias Ringstrom wrote:
> > Why are you not changing the EXTRAVERSION in your patch? I would make it
> > much easier to diffrentiate between kernels.
>
> I did that for me.
>
> # uname -r
> 2.4.19-rc2-aa1
> #
>
> It's working fine for some hours now. The EXTRAVERSION is the only thing
> that I changed, and -rc2-aa1 works just fine. But my bdflush seems - with
> the same values as from -rc1-aa2 - not to have 100% of the old efficiency
> any more.

I guess it's your userspace workload that changed, there are no
bdflush/vm/blkdev related changes between rc1aa2 and rc2aa1 that can
explain a change of behaviour in bdflush.

Andrea