Hello all,
Again, I did a rsync-operation as described in
"[2.4.17rc1] Swapping" MID <[email protected]>.
This time, the kernel had a swappartition which was about 200MB. As the
swap-partition was fully used, the kernel killed all processes of knode.
Nearly 50% of RAM had been used for buffers at this moment. Why is there
so much memory used for buffers?
I know I repeat it, but please:
Fix the VM-management in kernel 2.4.x. It's unusable. Believe
me! As comparison: kernel 2.2.19 didn't need nearly any swap for
the same operation!
Please consider that I'm using 512 MB of RAM. This should, or better:
must be enough to do the rsync-operation nearly without any swapping -
kernel 2.2.19 does it!
The performance of kernel 2.4.18pre1 is very poor, which is no surprise,
because the machine swaps nearly nonstop.
Regards,
Andreas Hartmann
On Fri, 28 Dec 2001, Andreas Hartmann wrote:
> Fix the VM-management in kernel 2.4.x. It's unusable. Believe
> me! As comparison: kernel 2.2.19 didn't need nearly any swap for
> the same operation!
If you feel adventurous you can try my rmap based
VM, the latest version is on:
http://surriel.com/patches/2.4/2.4.17-rmap-8
This VM should behave a bit better (it does on my machines),
but isn't yet bug-free enough to be used on production machines.
Also, the changes it introduces are, IMHO, too big for a stable
kernel series ;)
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
Andrew Morton wrote:
> Andreas Hartmann wrote:
>
>>Hello all,
>>
>>Again, I did a rsync-operation as described in
>>"[2.4.17rc1] Swapping" MID <[email protected]>.
>>
>>This time, the kernel had a swappartition which was about 200MB. As the
>>swap-partition was fully used, the kernel killed all processes of knode.
>>Nearly 50% of RAM had been used for buffers at this moment. Why is there
>>so much memory used for buffers?
>>
>
> It's very strange. The large amount of buffercache usage is to
> be expected from statting 20 gigs worth of files, but the kernel
> should (and normally does) free up that memory on demand.
>
> Which filesystem(s) are you using?
>
> Are you using NFS/NBD/SMBFS or anything like that?
>
Basically, I'm using NFS and reiserfs. But I didn't use any file on NFS
since the last reboot - and the NFS-shares haven't been mounted.
There are 2 IDE-Harddisks in this machine:
hda: WDC WD205AA, ATA DISK drive (40079088 sectors (20520 MB) w/2048KiB
cache, CHS=2494/255/63, UDMA(66))
hdb: WDC WD450AA-00BAA0, ATA DISK drive (87930864 sectors (45021 MB)
w/2048KiB Cache,
CHS=5473/255/63, UDMA(66))
On hda, I have got 7 partitions (plus one little "boot"-partition, which
isn't mounted and a 200MB swap partition).
On hdb, I have got 12 partitions and one more, meanwhile 1GB swap partition.
All partitions are formated with reiserfs.
Regards,
Andreas Hartmann
> Fix the VM-management in kernel 2.4.x. It's unusable. Believe
> me! As comparison: kernel 2.2.19 didn't need nearly any swap for
> the same operation!
> The performance of kernel 2.4.18pre1 is very poor, which is no surprise,
> because the machine swaps nearly nonstop.
Does the 2.4.9 Red Hat kernel (if yoiu are using RH) or 2.4.12-ac8 show the
same problem ?
Andreas Hartmann wrote:
> Hello all,
>
> Again, I did a rsync-operation as described in
> "[2.4.17rc1] Swapping" MID <[email protected]>.
>
Some other examples:
I just did a
cp -Rd linux-2.4.16 linux-2.4.17
(with object-files). Before starting this action, I had about 120 MB of
free RAM. During copying - I did nothing else meanwhile, there was 2MB
swap used - and 12 MB of RAM were free. The biggest part of memory was
used for caching - what is ok.
After copying, only 10 MB of memory have been given free again. There
have been 490MB of RAM used now (nearly most for caching).
Outgoing from this situation, I started another little cp-action:
cp -Rd linux-2.4.18pre1 linux-2.4.test
(again including object files).
Result: the swap usage stayed nearly constant, neverthless there have
been 6 accesses to swap.
Now, I deleted the linux-2.4.test-directory with
rm -R linux-2.4.test
This action was very fast (approximately 1s).
Afterwards, a big part of the cache memory has been given free (about
100MB). Now, 122MB of RAM have been free again.
Next example (running after the last):
SuSE run-crons have been running. This means:
-> updatedb
-> sort
-> frcode
-> find
-> mandb
47MB swap used, 2/3 of memory is used for buffers (Don't forget: I've
got 512MB of RAM) and about 30MB of RAM are free.
My observation:
Why does the kernel swap to get free memory for caching / buffering? I
can't see any sense in this action. Wouldn't it be better to shrink the
cashing / buffering-RAM to the amount of memory, which is obviously free?
Swapping should be principally used, if the RAM ends for real memory
(memory, which is used for running applications). First of all, the
memory-usage of cache and buffers should be reduced before starting to
swap IMHO.
Or would it be possible, to implement more than one swapping strategy,
which could be configured during make menuconfig? This would give the
user the chance to find the best swapping strategy for his purpose.
Regards,
Andreas Hartmann
On Fri, Dec 28, 2001 at 09:16:38PM +0100, Andreas Hartmann wrote:
> Hello all,
>
> Again, I did a rsync-operation as described in
> "[2.4.17rc1] Swapping" MID <[email protected]>.
>
> This time, the kernel had a swappartition which was about 200MB. As the
> swap-partition was fully used, the kernel killed all processes of knode.
> Nearly 50% of RAM had been used for buffers at this moment. Why is there
> so much memory used for buffers?
>
> I know I repeat it, but please:
>
> Fix the VM-management in kernel 2.4.x. It's unusable. Believe
> me! As comparison: kernel 2.2.19 didn't need nearly any swap for
> the same operation!
>
> Please consider that I'm using 512 MB of RAM. This should, or better:
> must be enough to do the rsync-operation nearly without any swapping -
> kernel 2.2.19 does it!
>
> The performance of kernel 2.4.18pre1 is very poor, which is no surprise,
> because the machine swaps nearly nonstop.
please try to reproduce on 2.4.17rc2aa2, thanks.
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17rc2aa2.bz2
Andrea
> Re: [2.4.17/18pre] VM and swap - it's really unusable
>
>
> My observation:
> Why does the kernel swap to get free memory for caching / buffering? I
> can't see any sense in this action. Wouldn't it be better to shrink the
> cashing / buffering-RAM to the amount of memory, which is obviously free?
>
> Swapping should be principally used, if the RAM ends for real memory
> (memory, which is used for running applications). First of all, the
> memory-usage of cache and buffers should be reduced before starting to
> swap IMHO.
>
My main "problem" with 2.4 as well. Using free memory for Cache/Buffers
is great, but only as long as the memory is not needed for running
tasks. As soon as a task is requesting more memory, it should be first
taken from Cache/Buffer (they are caches, aren't they :-). Only if this
has been used up (to a tunable minimum, see below), swapping and finally
OOM killing should happen.
To prevent completely trashing IO performance, there should be tunable
parameters for minimum and maximum Cache/Buffer usage (lets say in
percent of total memory). Maybe those tunables are even there today and
I am just to stupid to find them :-))
In any case, 2.4 Caches/Buffers show to much persistance. This is
basically true for both branches of VM. I was using the -ac kernels
because, being far from perfect, the VM gave considreabele better
interactive behaviour.
> Or would it be possible, to implement more than one swapping strategy,
> which could be configured during make menuconfig? This would give the
> user the chance to find the best swapping strategy for his purpose.
>
That is another option. But I would prefer something that could be
selected dynamically or at boot time.
Martin
--
+-----------------------------------------------------+
|Martin Knoblauch |
|-----------------------------------------------------|
|http://www.knobisoft.de/cats |
|-----------------------------------------------------|
|e-mail: [email protected] |
+-----------------------------------------------------+
Martin Knoblauch wrote:
>>Re: [2.4.17/18pre] VM and swap - it's really unusable
>>
>>
>>My observation:
>>Why does the kernel swap to get free memory for caching / buffering? I
>>can't see any sense in this action. Wouldn't it be better to shrink the
>>cashing / buffering-RAM to the amount of memory, which is obviously free?
>>
>>Swapping should be principally used, if the RAM ends for real memory
>>(memory, which is used for running applications). First of all, the
>>memory-usage of cache and buffers should be reduced before starting to
>>swap IMHO.
[...]
> In any case, 2.4 Caches/Buffers show to much persistance. This is
> basically true for both branches of VM. I was using the -ac kernels
> because, being far from perfect, the VM gave considreabele better
> interactive behaviour.
I did some tests with different VM-patches. I tested one ac-patch, too.
I detected the same as you described - but the memory-consumption and
the behaviour at all isn't better. If you want to, you can test another
patch, which worked best in my test. It's nearly as good as kernel
2.2.x. Ask M.H.vanLeeuwen ([email protected]) for his oom-patch to
kernel 2.4.17.
But beware: maybe this strategy doesn't fit to your applications. And
it's not for productive use.
I and some others surely too, would be interested in your experience
with this patch.
>>Or would it be possible, to implement more than one swapping strategy,
>>which could be configured during make menuconfig? This would give the
>>user the chance to find the best swapping strategy for his purpose.
>>
>>
>
> That is another option. But I would prefer something that could be
> selected dynamically or at boot time.
Dynamic selection - this would be great. For all situations the best
swapping strategy. On servers switched with a cronjob or a intelligent
daemon.
Regards,
Andreas Hartmann
> My main "problem" with 2.4 as well. Using free memory for Cache/Buffers
> is great, but only as long as the memory is not needed for running
> tasks. As soon as a task is requesting more memory, it should be first
> taken from Cache/Buffer (they are caches, aren't they :-). Only if this
> has been used up (to a tunable minimum, see below), swapping and finally
> OOM killing should happen.
>
> To prevent completely trashing IO performance, there should be tunable
> parameters for minimum and maximum Cache/Buffer usage (lets say in
> percent of total memory). Maybe those tunables are even there today and
> I am just to stupid to find them :-))
Yes!! I second that motion!! On top of that, we need buffer/page cache hit
rate statistics!! Once your read hit rate gets up into the high 90
percentages, more buffer/page cache memory is wasted.
If Linux is to succeed in enterprise-level usage, we *must* have tools to
measure, manage and tune performance -- in short, to do capacity planning
like we do on any other system. And the kernel variables that affect
performance *must* be under control of the system administrator and
ultimately the machine's *customers*, *not* a bunch of kernel geeks! That
means keeping them in variables accessible by a system administrator, *not*
#defines in code that must be entirely recompiled when you want to tweak a
parameter.
If you build it, they will come :). If you *refuse* to build it, they will
use something else -- it's as simple as that.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
On Sun, 30 Dec 2001, M. Edward Borasky wrote:
> Yes!! I second that motion!! On top of that, we need buffer/page cache
> hit rate statistics!!
> If Linux is to succeed in enterprise-level usage, we *must* have tools
> to measure, manage and tune performance -- in short, to do capacity
> planning like we do on any other system.
Indeed, VM statistics added back to the TODO list for
my VM ;)
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
Rik van Riel wrote:
>
> On Sun, 30 Dec 2001, M. Edward Borasky wrote:
>
> > Yes!! I second that motion!! On top of that, we need buffer/page cache
> > hit rate statistics!!
>
Yes, I forgot to mention them. And we need them not only to do the
production planning, but also for developing the VM[-strategies]. Most
time these days we just say things like "it sucks", or "no, it does not"
:-) Even worse if it comes to interactive "feeling".
> > If Linux is to succeed in enterprise-level usage, we *must* have tools
> > to measure, manage and tune performance -- in short, to do capacity
> > planning like we do on any other system.
>
> Indeed, VM statistics added back to the TODO list for
> my VM ;)
>
Very good to hear.
Happy new year
Martin
--
+-----------------------------------------------------+
|Martin Knoblauch |
|-----------------------------------------------------|
|http://www.knobisoft.de/cats |
|-----------------------------------------------------|
|e-mail: [email protected] |
+-----------------------------------------------------+
Unfortunately, I lost the response that basically said "2.4 looks stable
to me", but let me count the ways in which I agree with Andreas'
sentiment:
A) VM has major issues
1) about a dozen recent OOPS reports in VM code
2) VM falls down on large-memory machines with a
high inode count (slocate/updatedb, i/dcache)
3) Memory allocation failures and OOM triggers
even though caches remain full.
4) Other bugs fixed in -aa and others
B) Live- and dead-locks that I'm seeing on all 2.4 production
machines > 2.4.9, possibly related to A. But how will I
ever find out?
C) IO-APIC code that requires noapic on any and all SMP
machines that I've ever run on.
I don't have anything against anyone here -- I think everyone is doing a
fine job. It's an issue of acceptance of the problem and focus. These
issues are all showstoppers for me, and while I don't represent the 90%
of the Linux market that is UP desktops, IMHO future work on the kernel
will be degraded by basic functionality that continues to cause
problems.
I think seeing some of Andrea's and Andrew's et al patches actually
*happen* would be a good thing, since 2.4 kernels are decidedly not
ready for production here. I am forced to apply 26 distinct patch sets
to my kernels, and I am NOT the right person to make these judgements.
Which is why I was interested in an LKML summary source, though I
haven't yet had a chance to catch up on that thread of comment.
Having a glitch in the radeon driver is one thing; having persistent,
fatal, and reproducable failures in universal kernel code is entirely
another.
--
Ken.
[email protected]
On Fri, Dec 28, 2001 at 09:16:38PM +0100, Andreas Hartmann wrote:
| Hello all,
|
| Again, I did a rsync-operation as described in
| "[2.4.17rc1] Swapping" MID <[email protected]>.
|
| This time, the kernel had a swappartition which was about 200MB. As the
| swap-partition was fully used, the kernel killed all processes of knode.
| Nearly 50% of RAM had been used for buffers at this moment. Why is there
| so much memory used for buffers?
|
| I know I repeat it, but please:
|
| Fix the VM-management in kernel 2.4.x. It's unusable. Believe
| me! As comparison: kernel 2.2.19 didn't need nearly any swap for
| the same operation!
|
| Please consider that I'm using 512 MB of RAM. This should, or better:
| must be enough to do the rsync-operation nearly without any swapping -
| kernel 2.2.19 does it!
|
| The performance of kernel 2.4.18pre1 is very poor, which is no surprise,
| because the machine swaps nearly nonstop.
|
|
| Regards,
| Andreas Hartmann
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
On Thu, 3 Jan 2002, Ken Brownfield wrote:
> A) VM has major issues
> 1) about a dozen recent OOPS reports in VM code
> 2) VM falls down on large-memory machines with a
> high inode count (slocate/updatedb, i/dcache)
> 3) Memory allocation failures and OOM triggers
> even though caches remain full.
> 4) Other bugs fixed in -aa and others
> B) Live- and dead-locks that I'm seeing on all 2.4 production
> machines > 2.4.9, possibly related to A. But how will I
> ever find out?
I've spent ages trying to fix these bugs in the -ac kernel,
but they got all backed out in search of better performance.
Right now I'm developing a VM again, but I have no interest
at all in fixing the livelocks in the main kernel, they'll
just get removed again after a while.
If you want to test my VM stuff, you can get patches from
http://surriel.com/patches/ or direct access at the bitkeeper
tree on http://linuxvm.bkbits.net/
cheers,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
Ken Brownfield wrote:
>
> Unfortunately, I lost the response that basically said "2.4 looks stable
> to me", but let me count the ways in which I agree with Andreas'
> sentiment:
>
> A) VM has major issues
> 1) about a dozen recent OOPS reports in VM code
Ben LaHaise's fix for page_cache_release() is absolutely required.
> 2) VM falls down on large-memory machines with a
> high inode count (slocate/updatedb, i/dcache)
> 3) Memory allocation failures and OOM triggers
> even though caches remain full.
> 4) Other bugs fixed in -aa and others
> B) Live- and dead-locks that I'm seeing on all 2.4 production
> machines > 2.4.9, possibly related to A. But how will I
> ever find out?
Does this happen with the latest -aa patch? If so, please send
a full system description and report.
> C) IO-APIC code that requires noapic on any and all SMP
> machines that I've ever run on.
Dunno about this one. Have you prepared a description?
-
> Unfortunately, I lost the response that basically said "2.4 looks
stable
> to me", but let me count the ways in which I agree with Andreas'
> sentiment:
>
> A) VM has major issues
On all boxes I run currently (all 1GB or below RAM), I cannot find
_major_ issues.
> 2) VM falls down on large-memory machines with a
> high inode count (slocate/updatedb, i/dcache)
Must be beyond the GB range.
> 3) Memory allocation failures and OOM triggers
> even though caches remain full.
I have not had one up to now in everyday life with 2.4.17
> 4) Other bugs fixed in -aa and others
Hm, well I would expect Andrea to do tuning and fixing as experience
evolves...
> B) Live- and dead-locks that I'm seeing on all 2.4 production
> machines > 2.4.9, possibly related to A. But how will I
> ever find out?
Me = none up to now I could track down to a kernel issue. The single
one I had was with a distro kernel around 2.4.10 and flaky hardware.
> C) IO-APIC code that requires noapic on any and all SMP
> machines that I've ever run on.
I am currently running 5 Asus CUV4X-D based SMP boxes all with apic
_on_, amongst which are squids, sql servers, workstation type setups
(2 my very own).
> I don't have anything against anyone here -- I think everyone is
doing a
> fine job. It's an issue of acceptance of the problem and focus.
These
> issues are all showstoppers for me, and while I don't represent the
90%
> of the Linux market that is UP desktops, IMHO future work on the
kernel
> will be degraded by basic functionality that continues to cause
> problems.
Have you run _yourself_ into a problem with 2.4.17?
I mean it is not perfect of course, but it is far better than you make
it look.
I could hand the brown bag to all versions below about 2.4.15 pretty
easy, but since 2.4.16 it has really become hard to shoot it down for
me. Ok, I use only pretty selected hardware, but there are reasons I
do, and they are not related to the kernel in first place.
Regards,
Stephan
Stephan von Krawczynski wrote:
>
> On Mon, 31 Dec 2001 11:14:04 -0600
> "M.H.VanLeeuwen" <[email protected]> wrote:
>
> > [...]
> > vmscan patch:
> >
> > a. instead of calling swap_out as soon as max_mapped is reached, continue to
> try> to free pages. this reduces the number of times we hit
> try_to_free_pages() and> swap_out().
>
> I experimented with this some time ago, but found out it hit performance and
> (to my own surprise) did not do any good at all. Have you tried this
> stand-alone/on top of the rest to view its results?
>
> Regards,
> Stephan
Stephan,
Here is what I've run thus far. I'll add nfs file copy into the mix also...
System: SMP 466 Celeron 192M RAM, running KDE, xosview, and other minor apps.
Each run after clean & cache builds has 1 more setiathome client running upto a
max if 8 seti clients. No, this isn't my normal way of running setiathome, but
each instance uses a nice chunk of memory.
Note: this is a single run for each of the columns using "make -j2 bzImage" each time.
I will try to run aa and rmap this evening and/or tomorrow.
Martin
----------------------------------------------------------------------------
STOCK KERNEL MH KERNEL STOCK + SWAP MH + SWAP
(no swap) (no swap)
CLEAN
BUILD real 7m19.428s 7m19.546s 7m26.852s 7m26.256s
user 12m53.640s 12m50.550s 12m53.740s 12m47.110s
sys 0m47.890s 0m54.960s 0m58.810s 1m1.090s
1.1M swp 0M swp
CACHE
BUILD real 7m3.823s 7m3.520s 7m4.040s 7m4.266s
user 12m47.710s 12m49.110s 12m47.640s 12m40.120s
sys 0m46.660s 0m46.270s 0m47.480s 0m51.440s
1.1M swp 0M swp
SETI 1
real 9m51.652s 9m50.601s 9m53.153s 9m53.668s
user 13m5.250s 13m4.420s 13m5.040s 13m4.470s
sys 0m49.020s 0m50.460s 0m51.190s 0m50.580s
1.1M swp 0M swp
SETI 2
real 13m9.730s 13m7.719s 13m4.279s 13m4.768s
user 13m16.810s 13m15.150s 13m15.950s 13m13.400s
sys 0m50.880s 0m50.460s 0m50.930s 0m52.520s
5.8M swp 1.9M swp
SETI 3
real 15m49.331s 15m41.264s 15m40.828s 15m45.551s
user 13m22.150s 13m21.560s 13m14.390s 13m20.790s
sys 0m49.250s 0m49.910s 0m49.850s 0m50.910s
16.2M swp 3.1M swp
SETI 4
real OOM KILLED 19m8.435s 19m5.584s 19m3.618s
user kdeinit 13m24.570s 13m24.000s 13m22.520s
sys 0m51.430s 0m50.320s 0m51.390s
18.7M swp 8.3M swp
SETI 5
real NA 21m35.515s 21m48.543s 22m0.240s
user 13m9.680s 13m22.030s 13m28.820s
sys 0m49.910s 0m50.850s 0m52.270s
31.7M swp 11.3M swp
SETI 6
real NA 24m37.167s 25m5.244s 25m13.429s
user 13m7.650s 13m26.590s 13m32.640s
sys 0m51.390s 0m51.260s 0m52.790s
35.3M swp 17.1M swp
SETI 7
real NA 28m40.446s 28m3.612s 28m12.981s
user 13m16.460s 13m26.130s 13m31.520s
sys 0m57.940s 0m52.510s 0m53.570s
38.8M swp 25.4M swp
SETI 8
real NA 29m31.743s 31m16.275s 32m29.534s
user 13m37.610s 13m27.740s 13m33.630s
sys 1m4.450s 0m52.100s 0m54.140s
41.5M swp 49.7M swp (hmmm?)
Actually, I posted about C) many moons ago, and had some chats with
Manfred Spraul and Alan. It's a tough one to crack, and I have my own
workaround patch (below) that I've been using for a while now. My posts
are in the archives, but I can send a summary by request.
I haven't succeeded my bag check in putting -aa in production, which is
where I'm able to reproduce these problems. Part of the problem is me,
in that I can't easily test with -aa. And part of the problem is
chicken vs egg -- can't test unless it's in mainline, don't want to put
questionable stuff in a release kernel, even a -pre... But I do think
the -aa stuff is worth breaking out into Marcelo-digestable chunks as
soon as Andrea can.
The machines that are OOPSing are in production and right now don't have
serial consoles available... that will change in a month or so, but
right now I can't decode OOPSes without hand-copying. I might get that
desparate unless the problem goes away with 2.4.18 (with -aa merged,
hopefully. :)
Thanks much,
--
Ken.
[email protected]
Applies to any recent 2.4. Changing indent sucks.
--- linux/arch/i386/kernel/io_apic.c.orig Tue Nov 13 17:28:41 2001
+++ linux/arch/i386/kernel/io_apic.c Tue Dec 18 15:10:45 2001
@@ -172,6 +172,7 @@
int pirq_entries [MAX_PIRQS];
int pirqs_enabled;
int skip_ioapic_setup;
+int pintimer_setup;
static int __init ioapic_setup(char *str)
{
@@ -179,7 +180,14 @@
return 1;
}
+static int __init do_pintimer_setup(char *str)
+{
+ pintimer_setup = 1;
+ return 1;
+}
+
__setup("noapic", ioapic_setup);
+__setup("pintimer", do_pintimer_setup);
static int __init ioapic_pirq_setup(char *str)
{
@@ -1524,27 +1532,31 @@
printk(KERN_ERR "..MP-BIOS bug: 8254 timer not connected to IO-APIC\n");
}
- printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
- if (pin2 != -1) {
- printk("\n..... (found pin %d) ...", pin2);
- /*
- * legacy devices should be connected to IO APIC #0
- */
- setup_ExtINT_IRQ0_pin(pin2, vector);
- if (timer_irq_works()) {
- printk("works.\n");
- if (nmi_watchdog == NMI_IO_APIC) {
- setup_nmi();
- check_nmi_watchdog();
+ if ( pintimer_setup )
+ printk(KERN_INFO "...skipping 8259A init for IRQ0\n");
+ else {
+ printk(KERN_INFO "...trying to set up timer (IRQ0) through the 8259A ... ");
+ if (pin2 != -1) {
+ printk("\n..... (found pin %d) ...", pin2);
+ /*
+ * legacy devices should be connected to IO APIC #0
+ */
+ setup_ExtINT_IRQ0_pin(pin2, vector);
+ if (timer_irq_works()) {
+ printk("works.\n");
+ if (nmi_watchdog == NMI_IO_APIC) {
+ setup_nmi();
+ check_nmi_watchdog();
+ }
+ return;
}
- return;
+ /*
+ * Cleanup, just in case ...
+ */
+ clear_IO_APIC_pin(0, pin2);
}
- /*
- * Cleanup, just in case ...
- */
- clear_IO_APIC_pin(0, pin2);
+ printk(" failed.\n");
}
- printk(" failed.\n");
if (nmi_watchdog) {
printk(KERN_WARNING "timer doesnt work through the IO-APIC - disabling NMI Watchdog!\n");
On Thu, Jan 03, 2002 at 01:54:14PM -0800, Andrew Morton wrote:
| Ken Brownfield wrote:
| >
| > Unfortunately, I lost the response that basically said "2.4 looks stable
| > to me", but let me count the ways in which I agree with Andreas'
| > sentiment:
| >
| > A) VM has major issues
| > 1) about a dozen recent OOPS reports in VM code
|
| Ben LaHaise's fix for page_cache_release() is absolutely required.
|
| > 2) VM falls down on large-memory machines with a
| > high inode count (slocate/updatedb, i/dcache)
| > 3) Memory allocation failures and OOM triggers
| > even though caches remain full.
| > 4) Other bugs fixed in -aa and others
| > B) Live- and dead-locks that I'm seeing on all 2.4 production
| > machines > 2.4.9, possibly related to A. But how will I
| > ever find out?
|
| Does this happen with the latest -aa patch? If so, please send
| a full system description and report.
|
| > C) IO-APIC code that requires noapic on any and all SMP
| > machines that I've ever run on.
|
| Dunno about this one. Have you prepared a description?
|
|
| -
On Fri, Jan 04, 2002 at 01:19:28AM +0100, Stephan von Krawczynski wrote:
| > A) VM has major issues
|
| On all boxes I run currently (all 1GB or below RAM), I cannot find
| _major_ issues.
Yeah, I'm seeing it primarily with 1-4GB, though I have very few <1GB
machines in production.
| > 2) VM falls down on large-memory machines with a
| > high inode count (slocate/updatedb, i/dcache)
|
| Must be beyond the GB range.
The critical part is the high inode count -- memory amount increases the
severity rather than triggering the problem.
| > 3) Memory allocation failures and OOM triggers
| > even though caches remain full.
|
| I have not had one up to now in everyday life with 2.4.17
I'm seeing this in malloc()-heavy apps, but fairly sporadic unless I
create a test case. On desktops, most of these issues disappear, but I
do think the mindset behind the kernel needs to at least partially break
free of the grip of UP desktops, at least to the point of fixing issues
like I'm mentioning.
Not critical for me; but high-profile on lkml.
[...]
| > C) IO-APIC code that requires noapic on any and all SMP
| > machines that I've ever run on.
|
| I am currently running 5 Asus CUV4X-D based SMP boxes all with apic
| _on_, amongst which are squids, sql servers, workstation type setups
| (2 my very own).
Do they have *sustained* heavy hit/IRQ/IO load? For example, sending
25Mbit and >1,000 connections/s of sustained small images traffic
through khttpd will kill 2.4 (slow loss of timer and eventual total
freeze) in a couple of hours. Trivially reproducable for me on SMP with
any amount of memory. On HP, Tyan, Intel, Asus... etc.
| Have you run _yourself_ into a problem with 2.4.17?
| I mean it is not perfect of course, but it is far better than you make
| it look.
2.4.17 (and -pre/-rc) is my yardstick, actually. With the exception of
-aa, I stay very close to the bleeding edge.
Please don't misunderstand -- I don't think any 2.4 kernel sucks (with
the exception of the two or three DONTUSE kernels. :) In fact, I have
zero complaints other than the ones I've listed. I was ecstatic when
2.2 came out, and 2.4 is just as impressive.
It's not that the kernel is bad, it's that there are specific things
that shouldn't be forgotten because of a "the kernel is good"
evaluation. Especially those that make Linux regularly unstable in
common production environments.
| I could hand the brown bag to all versions below about 2.4.15 pretty
| easy, but since 2.4.16 it has really become hard to shoot it down for
| me. Ok, I use only pretty selected hardware, but there are reasons I
| do, and they are not related to the kernel in first place.
I use pretty selected hardware as well -- scaling hundreds of servers
for varied uses really depends on having someone track and select
hardware, and using it homogenously. Of course, of all of the selected
hardware I've used over the last two years since 2.4.0-test1, C) has
persisted on all configurations, but the others are more recent but
equally omnipresent.
Like I said, I suspect that most people with machines in lower-load
environments don't have these issues, but "number of people effected" is
only one metric to judge the importance of an issue.
Of course, I'm not biased or anything. ;-)
Thanks for the input,
--
Ken.
[email protected]
|
| Regards,
| Stephan
|
On Thu, Jan 03, 2002 at 11:26:01PM -0600, you [Ken Brownfield] claimed:
>
> | > 3) Memory allocation failures and OOM triggers
> | > even though caches remain full.
> |
> | I have not had one up to now in everyday life with 2.4.17
>
> I'm seeing this in malloc()-heavy apps, but fairly sporadic unless I
> create a test case.
I'm seeing this on 2GB IA64 (2.4.16-17). I posted a _very_ simple test case
to lkml a while a go. It didn't happen on 256MB x86.
I plan to try -aa shortly, now that I got patches to make it compile on
IA64.
-- v --
[email protected]
On Thu, 03 Jan 2002 20:14:42 -0600
"M.H.VanLeeuwen" <[email protected]> wrote:
> Stephan,
>
> Here is what I've run thus far. I'll add nfs file copy into the mix also...
Ah, Martin, thanks for sending the patch. I think I saw the voodoo in your
patch. When I did that last time I did _not_ do this:
+ if (PageReferenced(page)) {
+ del_page_from_inactive_list(page);
+ add_page_to_active_list(page);
+ }
+ continue;
This may shorten your inactive list through consecutive runs.
And there is another difference here:
+ if (max_mapped <= 0 && nr_pages > 0)
+ swap_out(priority, gfp_mask, classzone);
+
It sounds reasonable _not_ to swap in case of success (nr_pages == 0).
To me this looks pretty interesting. Is something like this already in -aa?
This patch may be worth applying in 2.4. It is small and looks like the right
thing to do.
Regards,
Stephan
On Fri, 4 Jan 2002 10:06:05 +0200
Ville Herva <[email protected]> wrote:
> On Thu, Jan 03, 2002 at 11:26:01PM -0600, you [Ken Brownfield] claimed:
> >
> > | > 3) Memory allocation failures and OOM triggers
> > | > even though caches remain full.
> > |
> > | I have not had one up to now in everyday life with 2.4.17
> >
> > I'm seeing this in malloc()-heavy apps, but fairly sporadic unless I
> > create a test case.
>
> I'm seeing this on 2GB IA64 (2.4.16-17). I posted a _very_ simple test case
> to lkml a while a go. It didn't happen on 256MB x86.
>
> I plan to try -aa shortly, now that I got patches to make it compile on
> IA64.
Ok, I am going to buy more mem right now to see what you see.
Regards,
Stephan
On Thu, 3 Jan 2002 23:26:01 -0600
Ken Brownfield <[email protected]> wrote:
> On Fri, Jan 04, 2002 at 01:19:28AM +0100, Stephan von Krawczynski wrote:
> | > A) VM has major issues
> |
> | On all boxes I run currently (all 1GB or below RAM), I cannot find
> | _major_ issues.
>
> Yeah, I'm seeing it primarily with 1-4GB, though I have very few <1GB
> machines in production.
Ok. It would be really nice to know if the -aa patches do any good at your
configs. Andrea has possibly done something on the issue. But let me take this
chance to state an open word: last time Andrea talked about his personal
hardware I couldn't really believe it - because it was so ridiculously small. I
wonder if anyone at SuSE management _does_ actually read this list and think
about how someone can do a good job without good equipment. If you really want
to do something groundbreaking about highmem you have to have a _box_. A box
_somewhere_ in the world or a patch for highmem-in-lowmem is not really the
same thing. Even Schumacher wouldn't have won formula one by sitting inside a
Fiat Uno with a patched speedometer.
> but I
> do think the mindset behind the kernel needs to at least partially break
> free of the grip of UP desktops, at least to the point of fixing issues
> like I'm mentioning.
>
> Not critical for me; but high-profile on lkml.
You are right.
> [...]
> | > C) IO-APIC code that requires noapic on any and all SMP
> | > machines that I've ever run on.
> |
> | I am currently running 5 Asus CUV4X-D based SMP boxes all with apic
> | _on_, amongst which are squids, sql servers, workstation type setups
> | (2 my very own).
>
> Do they have *sustained* heavy hit/IRQ/IO load? For example, sending
> 25Mbit and >1,000 connections/s of sustained small images traffic
> through khttpd will kill 2.4 (slow loss of timer and eventual total
> freeze) in a couple of hours. Trivially reproducable for me on SMP with
> any amount of memory. On HP, Tyan, Intel, Asus... etc.
Hm, I have about 24GB of NFS traffic every day, which may be too less. What
exactly are you seeing in this case (logfiles etc.)?
> It's not that the kernel is bad, it's that there are specific things
> that shouldn't be forgotten because of a "the kernel is good"
> evaluation.
Hopefully nobody does this here, I don't.
> Like I said, I suspect that most people with machines in lower-load
> environments don't have these issues, but "number of people effected" is
> only one metric to judge the importance of an issue.
The number of people is not really interesting for me, as the boxes get bigger
every day it is only a matter of time to see more people with lots of GB (as an
example).
> Of course, I'm not biased or anything. ;-)
How could you ? ;-))
Regards,
Stephan
On Thu, Jan 03, 2002 at 08:14:42PM -0600, M.H.VanLeeuwen wrote:
> Stephan von Krawczynski wrote:
> >
> > On Mon, 31 Dec 2001 11:14:04 -0600
> > "M.H.VanLeeuwen" <[email protected]> wrote:
> >
> > > [...]
> > > vmscan patch:
> > >
> > > a. instead of calling swap_out as soon as max_mapped is reached, continue to
> > try> to free pages. this reduces the number of times we hit
> > try_to_free_pages() and> swap_out().
> >
> > I experimented with this some time ago, but found out it hit performance and
> > (to my own surprise) did not do any good at all. Have you tried this
> > stand-alone/on top of the rest to view its results?
> >
> > Regards,
> > Stephan
>
> Stephan,
>
> Here is what I've run thus far. I'll add nfs file copy into the mix also...
>
> System: SMP 466 Celeron 192M RAM, running KDE, xosview, and other minor apps.
>
> Each run after clean & cache builds has 1 more setiathome client running upto a
> max if 8 seti clients. No, this isn't my normal way of running setiathome, but
> each instance uses a nice chunk of memory.
>
> Note: this is a single run for each of the columns using "make -j2 bzImage" each time.
>
> I will try to run aa and rmap this evening and/or tomorrow.
The design changes Linus did was explicitly to left the mapped pages
into the inactive list so we learn when we should trigger swapout. Also
it is nicer to swapout over the shrinking. rc2aa2 should work just fine.
Have a look at how such logic is implemented there. (btw, I will shortly
sync with 18pre, 2.2 and 2.5)
Andrea
On Fri, Jan 04, 2002 at 01:33:21PM +0100, Stephan von Krawczynski wrote:
> On Thu, 03 Jan 2002 20:14:42 -0600
> "M.H.VanLeeuwen" <[email protected]> wrote:
>
> > Stephan,
> >
> > Here is what I've run thus far. I'll add nfs file copy into the mix also...
>
> Ah, Martin, thanks for sending the patch. I think I saw the voodoo in your
> patch. When I did that last time I did _not_ do this:
>
> + if (PageReferenced(page)) {
> + del_page_from_inactive_list(page);
> + add_page_to_active_list(page);
> + }
> + continue;
>
> This may shorten your inactive list through consecutive runs.
>
> And there is another difference here:
>
> + if (max_mapped <= 0 && nr_pages > 0)
> + swap_out(priority, gfp_mask, classzone);
> +
>
> It sounds reasonable _not_ to swap in case of success (nr_pages == 0).
> To me this looks pretty interesting. Is something like this already in -aa?
> This patch may be worth applying in 2.4. It is small and looks like the right
> thing to do.
-aa swapout as soon as max_mapped hits zero. So it basically does it
internally (i.e. way more times) and so it will most certainly be able
to sustain an higher swap transfer rate. You can check with the mtest01
-w test from ltp.
Andrea
On Fri, 4 Jan 2002 15:14:38 +0100
Andrea Arcangeli <[email protected]> wrote:
> On Fri, Jan 04, 2002 at 01:33:21PM +0100, Stephan von Krawczynski wrote:
> > On Thu, 03 Jan 2002 20:14:42 -0600
> > "M.H.VanLeeuwen" <[email protected]> wrote:
> >
> > And there is another difference here:
> >
> > + if (max_mapped <= 0 && nr_pages > 0)
> > + swap_out(priority, gfp_mask, classzone);
> > +
> >
> > It sounds reasonable _not_ to swap in case of success (nr_pages == 0).
> > To me this looks pretty interesting. Is something like this already in -aa?
> > This patch may be worth applying in 2.4. It is small and looks like the
right> > thing to do.
>
> -aa swapout as soon as max_mapped hits zero. So it basically does it
> internally (i.e. way more times) and so it will most certainly be able
> to sustain an higher swap transfer rate. You can check with the mtest01
> -w test from ltp.
Hm, but do you think this is really good in overall performance, especially the
frequent cases where no swap should be needed _at all_ to do a successful
shrinking? And - as can be viewed in Martins tests - if you have no swap at
all, you seem to trigger OOM earlier through the short path exit in shrink,
which is a obvious nono. I would state it wrong to fix the oom-killer for this
case, because it should not be reached at all.
?
Regards,
Stephan
On Fri, Jan 04, 2002 at 03:24:09PM +0100, Stephan von Krawczynski wrote:
> On Fri, 4 Jan 2002 15:14:38 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
> > On Fri, Jan 04, 2002 at 01:33:21PM +0100, Stephan von Krawczynski wrote:
> > > On Thu, 03 Jan 2002 20:14:42 -0600
> > > "M.H.VanLeeuwen" <[email protected]> wrote:
> > >
> > > And there is another difference here:
> > >
> > > + if (max_mapped <= 0 && nr_pages > 0)
> > > + swap_out(priority, gfp_mask, classzone);
> > > +
> > >
> > > It sounds reasonable _not_ to swap in case of success (nr_pages == 0).
> > > To me this looks pretty interesting. Is something like this already in -aa?
> > > This patch may be worth applying in 2.4. It is small and looks like the
> right> > thing to do.
> >
> > -aa swapout as soon as max_mapped hits zero. So it basically does it
> > internally (i.e. way more times) and so it will most certainly be able
> > to sustain an higher swap transfer rate. You can check with the mtest01
> > -w test from ltp.
>
> Hm, but do you think this is really good in overall performance, especially the
> frequent cases where no swap should be needed _at all_ to do a successful
> shrinking? And - as can be viewed in Martins tests - if you have no swap at
the common case is that max_mapped doesn't reach zero, so either ways
(mainline or -aa) it's the same (i.e. no special exit path).
> all, you seem to trigger OOM earlier through the short path exit in shrink,
> which is a obvious nono. I would state it wrong to fix the oom-killer for this
> case, because it should not be reached at all.
Correct. Incidentally swap or no swap doesn't make any difference on the
exit-path of shrink_cache in -aa (furthmore if swap is full or if everything is
unfreeable I stop wasting an huge amount of time on the swap_out path at the
first failure, this allows graceful oom handling or it would nearly deadlock
there trying tp swap unfreeable stuff at every max_mapped reached). In -aa
max_mapped doesn't influcence in any way the exit path of shrink_cache.
max_mapped only controls the swapout-frequency. See:
if (!page->mapping || page_count(page) > 1) {
spin_unlock(&pagecache_lock);
UnlockPage(page);
page_mapped:
if (--max_mapped < 0) {
spin_unlock(&pagemap_lru_lock);
shrink_dcache_memory(vm_scan_ratio, gfp_mask);
shrink_icache_memory(vm_scan_ratio, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(vm_scan_ratio, gfp_mask);
#endif
if (!*failed_swapout)
*failed_swapout = !swap_out(classzone);
max_mapped = orig_max_mapped;
spin_lock(&pagemap_lru_lock);
}
continue;
}
So it should work just fine as far I can tell.
Andrea
Stephan von Krawczynski wrote:
>>Unfortunately, I lost the response that basically said "2.4 looks
>>stable
>>to me", but let me count the ways in which I agree with Andreas'
>>sentiment:
>>
>>A) VM has major issues
Unfortunately you are right.
>>
>
> On all boxes I run currently (all 1GB or below RAM), I cannot find
> _major_ issues.
Question is: which nature is your application / load of the system? You
wrote something about database server. How much rows alltogether? What's
the size of the table(s)? How many concurrent accesses do you have? Do
you do "easy" searches where all of the conditions are located in the
index? How big is your index? How big is the throughput of your
database? Do you have your tables on raw partitions (without caching; as
you can do it with UDB)?
You mentioned squid, too. I'm running squid here on a AMD K6 2 400, 256
MB RAM. It's mostly (sometimes plus my wife) for my own. No more users.
In this situation, I can't see any problem, too. Why? There is no load,
no throughput, ... .
How big are the partitions you are mounting at once? In my case, all the
partitions together have about 70GB (all reiserfs).
I want to know it, because I think the problem depends on how much
different HD-memory is accessed. If you have applications, which doesn't
access to much memory, you can't view the problems.
If you access more than 1G (and you do not just copy, but rsync e.g.)
and you have only 512MB of RAM, the machine swaps a lot with most actual
2.4.-kernels (patches).
Another question:
Are there any tools to meassure the datathroughput a application causes?
Interesting would be the sum at the end of the process, the maximum and
average throughput (in- and output seperated) and the same for swapactivity.
It could probably help to find optimization potential. At least it would
give the chance to directly compare the demand of different applications.
Regards,
Andreas Hartmann
On Fri, 04 Jan 2002 21:15:42 +0100
Andreas Hartmann <[email protected]> wrote:
[I will answer not all of your questions, as this is a matter of business, too]
> > On all boxes I run currently (all 1GB or below RAM), I cannot find
> > _major_ issues.
>
>
> Question is: which nature is your application / load of the system?
Generally we do not drive the boxes up to the edge. Our philosophy is to throw
money at the problem, before it actually arises. Yes, I can see the future ...
;-)
> [...] Do you have your tables on raw partitions (without caching; as
> you can do it with UDB)?
No.
> How big are the partitions you are mounting at once? In my case, all the
> partitions together have about 70GB (all reiserfs).
about 130 GB, all reiserfs.
> I want to know it, because I think the problem depends on how much
> different HD-memory is accessed.
I guess you should tilt that theory.
Have you already tried to throw a big SPARC at the problem?
> If you have applications, which doesn't
> access to much memory, you can't view the problems.
> If you access more than 1G (and you do not just copy, but rsync e.g.)
> and you have only 512MB of RAM, the machine swaps a lot with most actual
> 2.4.-kernels (patches).
Can you provide a simple and reproducible test case (e.g. some demo source),
where things break? I am very willing to test it here.
Regards,
Stephan
On Fri, Jan 04, 2002 at 02:03:21PM +0100, Stephan von Krawczynski wrote:
[...]
| Ok. It would be really nice to know if the -aa patches do any good at your
I'd love to, but unfortunately my problems reproduce only in production,
and -- nothing against Andrea -- I'm hesitant to deploy -aa live, since
it hasn't received the widespread use that mainline has. I may be
forced to soon if the VM fixes don't get merged.
[...]
| > Do they have *sustained* heavy hit/IRQ/IO load? For example, sending
| > 25Mbit and >1,000 connections/s of sustained small images traffic
| > through khttpd will kill 2.4 (slow loss of timer and eventual total
| > freeze) in a couple of hours. Trivially reproducable for me on SMP with
| > any amount of memory. On HP, Tyan, Intel, Asus... etc.
|
| Hm, I have about 24GB of NFS traffic every day, which may be too less. What
| exactly are you seeing in this case (logfiles etc.)?
Well, the nature of the problem is that the timer "slows" and stops,
causing the machine to get more and more sluggish until it falls of the
net and stops dead.
I suspect that high IRQ rates cause the issue -- large sequential
transfers are not necessarily culprits due the lowish overhead.
[...]
| > It's not that the kernel is bad, it's that there are specific things
| > that shouldn't be forgotten because of a "the kernel is good"
| > evaluation.
|
| Hopefully nobody does this here, I don't.
I don't think it's intentional, and I realize that VM changes are hard
to swallow in a stable kernel release. I just hope that the severity
and fairly wide negative effect is enough to make people more
comfortable with accepting VM fixes that may be somewhat invasive.
Thanks,
--
Ken.
[email protected]
Andrea Arcangeli wrote:
>
> On Fri, Jan 04, 2002 at 03:24:09PM +0100, Stephan von Krawczynski wrote:
> > On Fri, 4 Jan 2002 15:14:38 +0100
> > Andrea Arcangeli <[email protected]> wrote:
> >
> > > On Fri, Jan 04, 2002 at 01:33:21PM +0100, Stephan von Krawczynski wrote:
> > > > On Thu, 03 Jan 2002 20:14:42 -0600
> > > > "M.H.VanLeeuwen" <[email protected]> wrote:
> > > >
> > > > And there is another difference here:
> > > >
> > > > + if (max_mapped <= 0 && nr_pages > 0)
> > > > + swap_out(priority, gfp_mask, classzone);
> > > > +
> > > >
> > > > It sounds reasonable _not_ to swap in case of success (nr_pages == 0).
> > > > To me this looks pretty interesting. Is something like this already in -aa?
> > > > This patch may be worth applying in 2.4. It is small and looks like the
> > right> > thing to do.
> > >
> > > -aa swapout as soon as max_mapped hits zero. So it basically does it
> > > internally (i.e. way more times) and so it will most certainly be able
> > > to sustain an higher swap transfer rate. You can check with the mtest01
> > > -w test from ltp.
> >
> > Hm, but do you think this is really good in overall performance, especially the
> > frequent cases where no swap should be needed _at all_ to do a successful
> > shrinking? And - as can be viewed in Martins tests - if you have no swap at
>
> the common case is that max_mapped doesn't reach zero, so either ways
> (mainline or -aa) it's the same (i.e. no special exit path).
>
> > all, you seem to trigger OOM earlier through the short path exit in shrink,
> > which is a obvious nono. I would state it wrong to fix the oom-killer for this
> > case, because it should not be reached at all.
>
> Correct. Incidentally swap or no swap doesn't make any difference on the
> exit-path of shrink_cache in -aa (furthmore if swap is full or if everything is
> unfreeable I stop wasting an huge amount of time on the swap_out path at the
> first failure, this allows graceful oom handling or it would nearly deadlock
> there trying tp swap unfreeable stuff at every max_mapped reached). In -aa
> max_mapped doesn't influcence in any way the exit path of shrink_cache.
> max_mapped only controls the swapout-frequency. See:
>
> if (!page->mapping || page_count(page) > 1) {
> spin_unlock(&pagecache_lock);
> UnlockPage(page);
> page_mapped:
> if (--max_mapped < 0) {
> spin_unlock(&pagemap_lru_lock);
>
> shrink_dcache_memory(vm_scan_ratio, gfp_mask);
> shrink_icache_memory(vm_scan_ratio, gfp_mask);
> #ifdef CONFIG_QUOTA
> shrink_dqcache_memory(vm_scan_ratio, gfp_mask);
> #endif
>
> if (!*failed_swapout)
> *failed_swapout = !swap_out(classzone);
>
> max_mapped = orig_max_mapped;
> spin_lock(&pagemap_lru_lock);
> }
> continue;
>
> }
>
> So it should work just fine as far I can tell.
>
> Andrea
OK, here is RMAP 10c and RC2AA2 as well with the same load as previously used.
System: SMP 466 Celeron 192M RAM, ATA-66 40G IDE
Each run after clean & cache builds has 1 more setiathome client running upto a
max if 8 seti clients. No, this isn't my normal way of running setiathome, but
each instance uses a nice chunk of memory.
Andrea, is there a later version of aa that I could test?
Martin
STOCK KERNEL MH KERNEL STOCK + SWAP MH + SWAP RMAP 10c RC2AA2
(no swap) (no swap) (no swap) (no swap)
CLEAN
BUILD real 7m19.428s 7m19.546s 7m26.852s 7m26.256s 7m46.760s 7m17.698s
user 12m53.640s 12m50.550s 12m53.740s 12m47.110s 13m2.420s 12m33.440s
sys 0m47.890s 0m54.960s 0m58.810s 1m1.090s 1m7.890s 0m53.790s
1.1M swp 0M swp
CACHE
BUILD real 7m3.823s 7m3.520s 7m4.040s 7m4.266s 7m16.386s 7m3.209s
user 12m47.710s 12m49.110s 12m47.640s 12m40.120s 12m56.390s 12m46.200s
sys 0m46.660s 0m46.270s 0m47.480s 0m51.440s 0m55.200s 0m46.450s
1.1M swp 0M swp
SETI 1
real 9m51.652s 9m50.601s 9m53.153s 9m53.668s 10m8.933s 9m51.954s
user 13m5.250s 13m4.420s 13m5.040s 13m4.470s 13m16.040s 13m19.310s
sys 0m49.020s 0m50.460s 0m51.190s 0m50.580s 1m1.080s 0m51.800s
1.1M swp 0M swp
SETI 2
real 13m9.730s 13m7.719s 13m4.279s 13m4.768s OOM KILLED 13m13.181s
user 13m16.810s 13m15.150s 13m15.950s 13m13.400s kdeinit 13m0.640s
sys 0m50.880s 0m50.460s 0m50.930s 0m52.520s 0m48.840s
5.8M swp 1.9M swp
SETI 3
real 15m49.331s 15m41.264s 15m40.828s 15m45.551s NA 15m52.202s
user 13m22.150s 13m21.560s 13m14.390s 13m20.790s 13m21.650s
sys 0m49.250s 0m49.910s 0m49.850s 0m50.910s 0m52.410s
16.2M swp 3.1M swp
SETI 4
real OOM KILLED 19m8.435s 19m5.584s 19m3.618 NA 19m24.081s
user kdeinit 13m24.570s 13m24.000s 13m22.520s 13m20.140s
sys 0m51.430s 0m50.320s 0m51.390s 0m52.810s
18.7M swp 8.3M swp
SETI 5
real NA 21m35.515s 21m48.543s 22m0.240s NA 22m10.033s
user 13m9.680s 13m22.030s 13m28.820s 13m12.740s
sys 0m49.910s 0m50.850s 0m52.270s 0m52.180s
31.7M swp 11.3M swp
SETI 6
real NA 24m37.167s 25m5.244s 25m13.429s NA 25m25.686s
user 13m7.650s 13m26.590s 13m32.640s 13m21.610s
sys 0m51.390s 0m51.260s 0m52.790s 0m49.590s
35.3M swp 17.1M swp
SETI 7
real NA 28m40.446s 28m3.612s 28m12.981s NA VM: killing process cc1
user 13m16.460s 13m26.130s 13m31.520s
sys 0m57.940s 0m52.510s 0m53.570s
38.8M swp 25.4M swp
SETI 8
real NA 29m31.743s 31m16.275s 32m29.534s NA
user 13m37.610s 13m27.740s 13m33.630s
sys 1m4.450s 0m52.100s 0m54.140s
(NO SWAP ;) 41.5M swp 49.7M swp
Stephan von Krawczynski wrote:
[...]
>>If you have applications, which doesn't
>>access to much memory, you can't view the problems.
>>If you access more than 1G (and you do not just copy, but rsync e.g.)
>>and you have only 512MB of RAM, the machine swaps a lot with most actual
>>2.4.-kernels (patches).
>>
>
> Can you provide a simple and reproducible test case (e.g. some demo source),
> where things break? I am very willing to test it here.
>
It's easy - take a grown inn-newsserver-partition with reiserfs (*) (a
lot of small files and a lot of directories), about 1,3 GB or more, and
do a complete rsync to this partition to transport it somewhere else.
But you have to do it with a existing target, no empty target, so that
rsync must scan the whole target partition, too.
I don't like special test-programs. They seldom show up the reality.
What we need is a kernel that behaves fine in reality - not in testcases.
And before starting the test, take care, that most of ram is already
used for cache or buffers or applications.
I did this test with several VM-patches and there are huge differences
in swap consumption between them: 319MB with 2.4.17rc2 and 59MB with
2.4.17 oom-patch (max).
It's more than a little difference :-).
Regards,
Andreas Hartmann
(*) If I had DSL, I would send it to you (as tar.gz) - but with modem,
it's a bit too much :-)!
But your squid cache should be fine, too. It has a similar structure: a
lot of small files and a lot of subdirectories. But I think, that your
squid cache size isn't as high as my inn-partition.
"We" (Auctionwatch.com) are experiencing problems that appear to be
related to VM, I realize that this question was not directed at me:
On Fri, Jan 04, 2002 at 09:15:42PM +0100, Andreas Hartmann wrote:
> Stephan von Krawczynski wrote:
> Question is: which nature is your application / load of the system? You
> wrote something about database server. How much rows alltogether? What's
Mysql running a dual 650 PIII, 2 gig ram. Rows? Dunno, but several
million tables (about 85 gig of tables averaging 45-50k IIRC).
> the size of the table(s)? How many concurrent accesses do you have? Do
We will have 2-400+ tables open at once.
> you do "easy" searches where all of the conditions are located in the
> index? How big is your index? How big is the throughput of your
> database? Do you have your tables on raw partitions (without caching; as
> you can do it with UDB)?
I don't know much about the specific design, other than I've been
told it's non-optimal.
> How big are the partitions you are mounting at once? In my case, all the
> partitions together have about 70GB (all reiserfs).
One 250G logical volume, a couple smaller ones (3 gig, 30 gig).
--
Share and Enjoy.
On Sat, 5 Jan 2002, Andreas Hartmann wrote:
> I don't like special test-programs. They seldom show up the reality.
> What we need is a kernel that behaves fine in reality - not in
> testcases. And before starting the test, take care, that most of ram
> is already used for cache or buffers or applications.
OK, here's some pseduo-code for a real-world test case. I haven't had a
chance to code it up, but I'm guessing I know what it's going to do. I'd
*love* to be proved wrong :).
# build and boot a kernel with "Magic SysRq" turned on
# echo 1 > /proc/sys/kernel/sysrq
# fire up "nice --19 top" as "root"
# read "MemTotal" from /proc/meminfo
# now start the next two jobs concurrently
# write a disk file with "MemTotal" data or more in it
# perform a 2D in-place FFT of total size at least "MemTotal/2" but less
# than "MemTotal"
Watch the "top" window like a hawk. "Cached" will grow because of the
disk write and "free" will drop because the page cache is growing and
the 2D FFT is using *its* memory. Eventually the two will start
competing for the last bits of free memory. "kswapd" and "kupdated" will
start working furiously, bringing the system CPU utilization to 99+
percent. At this point the system will appear highly unresponsive.
Even with the "nice --19" setting, "top" is going to have a hard time
keeping its five-second screen updates going. You will quite possibly
end up going to the console and doing alt-sysrq-m, which dumps the
memory status on the console and into /var/log/messages. Then if you do
alt-sysrq-i, which kills everything but "init", you should be able to
log on again.
I'm going to try this on my 512 MB machine just to see what happens, but
I'd like to see what someone with a larger machine, say 4 GB, gets when
they do this. I think attempting to write a large file and do a 2D FFT
concurrently is a perfectly reasonable thing to expect an image
processing system to do in the real world. A "traditional" UNIX would do
the I/O of the file write and the compute/memory processing of the FFT
together with little or no problem. But because the 2.4 kernel insists
on keeping all those buffers around, the 2D FFT is going to have
difficulty, because it has to have its data in core.
What's worse is if the page cache gets so big that the FFT has to start
swapping. For those who aren't familiar with 2D FFTs, they take two
passes over the data. The first pass will be unit strides -- sequential
addresses. But the second pass will be large strides -- a power of two.
That second pass is going to be brutal if every page it hits has to be
swapped in!
The solution is to limit page cache size to, say, 1/4 of "MemTotal",
which I'm guessing will have a *negligible* impact on the performance of
the file write. I used to work in an image processing lab, which is
where I learned this little trick for bringing a VM to its knees, and
which is probably where the designers of other UNIX systems learned that
the memory used for buffering I/O needs to be limited :). There's
probably a VAX or two out there still that shudders when it remembers
what I did to it. :))
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
On Fri, 4 Jan 2002 17:50:50 -0600
Ken Brownfield <[email protected]> wrote:
> On Fri, Jan 04, 2002 at 02:03:21PM +0100, Stephan von Krawczynski wrote:
> [...]
> | Ok. It would be really nice to know if the -aa patches do any good at your
>
> I'd love to, but unfortunately my problems reproduce only in production,
> and -- nothing against Andrea -- I'm hesitant to deploy -aa live, since
> it hasn't received the widespread use that mainline has. I may be
> forced to soon if the VM fixes don't get merged.
I am pretty impressed by Martins test case where merely all VM patches fail
with the exception of his own :-) The thing is, this test is not of nature
"very special" but more like "system driven to limit by normal processes". And
this is the real interesting part about it.
> | Hm, I have about 24GB of NFS traffic every day, which may be too less. What
> | exactly are you seeing in this case (logfiles etc.)?
>
> Well, the nature of the problem is that the timer "slows" and stops,
> causing the machine to get more and more sluggish until it falls of the
> net and stops dead.
>
> I suspect that high IRQ rates cause the issue -- large sequential
> transfers are not necessarily culprits due the lowish overhead.
What exactly do you mean with "high IRQ rate"? Can you show so numbers from
/proc/interrupts and uptime for clarification?
> | Hopefully nobody does this here, I don't.
>
> I don't think it's intentional, and I realize that VM changes are hard
> to swallow in a stable kernel release. I just hope that the severity
> and fairly wide negative effect is enough to make people more
> comfortable with accepting VM fixes that may be somewhat invasive.
Hm, I don't think real "big" patches are needed, Rik is according to Martins
test no gain currently as rmap flops in this test, too.
The problem is: you should really use one of your problem machines for at least
very simple testing. If you don't you possibly cannot expect your problem to be
solved soon. We would need input from your side. If I were you, I'd start of
with Martins patch. It is simple (very simple indeed), small and pinned to a
single procedure. Martins test shows - under "normal" high load (not especially
IRQ) - good result and no difference in standard load, I cannot see a risk for
oops or deadlock.
Regards,
Stephan
On Sat, 5 Jan 2002 01:24:42 -0800
Petro <[email protected]> wrote:
> "We" (Auctionwatch.com) are experiencing problems that appear to be
> related to VM, I realize that this question was not directed at me:
And how exactly do the problems look like?
Regards,
Stephan
M. Edward (Ed) Borasky wrote:
> On Sat, 5 Jan 2002, Andreas Hartmann wrote:
>
>
>>I don't like special test-programs. They seldom show up the reality.
>>What we need is a kernel that behaves fine in reality - not in
>>testcases. And before starting the test, take care, that most of ram
>>is already used for cache or buffers or applications.
>>
>
> OK, here's some pseduo-code for a real-world test case. I haven't had a
> chance to code it up, but I'm guessing I know what it's going to do. I'd
> *love* to be proved wrong :).
I would like to try it with the oom-patch, which needed less swap in my
tests. It could be a good test to verify the results of the rsync-test.
> # build and boot a kernel with "Magic SysRq" turned on
> # echo 1 > /proc/sys/kernel/sysrq
> # fire up "nice --19 top" as "root"
> # read "MemTotal" from /proc/meminfo
>
> # now start the next two jobs concurrently
>
> # write a disk file with "MemTotal" data or more in it
>
> # perform a 2D in-place FFT of total size at least "MemTotal/2" but less
> # than "MemTotal"
>
> Watch the "top" window like a hawk. "Cached" will grow because of the
> disk write and "free" will drop because the page cache is growing and
> the 2D FFT is using *its* memory.
Could you please tell me a programm, that does 2D FFT? I would like to
do this test, too!
Regards,
Andreas Hartmann
On Sat, 5 Jan 2002, Andreas Hartmann wrote:
> Could you please tell me a programm, that does 2D FFT? I would like to
> do this test, too!
Try http://www.fftw.org. This is a public domain (GPL I think) general
purpose FFT library. If I get a chance I'll download it this weekend and
figure out how to code a 2D FFT.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
Never play leapfrog with a unicorn.
On Thursday, 03 January 2002, at 20:14:42 -0600,
M.H.VanLeeuwen wrote:
> Here is what I've run thus far. I'll add nfs file copy into the mix also...
> System: SMP 466 Celeron 192M RAM, running KDE, xosview, and other minor apps.
>
I applied your little patch "vmscan.patch.2.4.17.c" to a plain 2.4.17
source tree, recompiled, and tried it. Swap usage is _much_ less than in
original 2.4.17: hardware is a Pentium166 with 64 MB RAM and 75 MB
swap, and my workload includes X 4.1.x, icewm, nightly Mozilla, several
rxvt, gkrellm, some MP3 listening via XMMS, xchat, several links web
browsers and a couple of little daemons runnig.
I have not done scientific measures on swap usage, but with your patch
it seems caches don't grow too much, and swap is usually 10-20 MB lower
than using plain 2.4.17. I have also observed in /proc/meminfo that
"Inactive:" seems to be much lower with your patch.
If someone wants more tests/numbers done/reported, just ask.
Hope this helps.
--
Jos? Luis Domingo L?pez
Linux Registered User #189436 Debian Linux Woody (P166 64 MB RAM)
jdomingo AT internautas DOT org => Spam at your own risk
On Sat, Jan 05, 2002 at 04:08:33PM +0100, Stephan von Krawczynski wrote:
| I am pretty impressed by Martins test case where merely all VM patches fail
| with the exception of his own :-) The thing is, this test is not of nature
| "very special" but more like "system driven to limit by normal processes". And
| this is the real interesting part about it.
One problem is that I've never heard of it and don't know where to get
it. ;)
| What exactly do you mean with "high IRQ rate"? Can you show so numbers from
| /proc/interrupts and uptime for clarification?
I did, back in the archives. I don't have easy access to archives etc,
right now, but I might repost since it's been a while.
| The problem is: you should really use one of your problem machines for at least
| very simple testing. If you don't you possibly cannot expect your problem to be
| solved soon. We would need input from your side. If I were you, I'd start of
| with Martins patch. It is simple (very simple indeed), small and pinned to a
| single procedure. Martins test shows - under "normal" high load (not especially
| IRQ) - good result and no difference in standard load, I cannot see a risk for
| oops or deadlock.
Well, reboots are the problem over possible oopses (or data corruption,
even more fun.) But on your recommendation I'll give Martin's mod a
try, given a URL. Does Martin's patch play well with -aa? How about
Martin+10_vm in -pre2? ;-)
At any rate, right now there are three or four people with different VM
patch sets, probably more. There is a certain amount of work this group
can do in judging which concepts are cleaner or most suitable to 2.4.x.
It would be cool to give rmap a try, but I don't want to maintain a
2.4.x kernel with speculative features that aren't intented for 2.4.x.
I can see using patches back-ported from 2.5, but I'm a firm believer
that 2.4 should stay stable and that the benefit of 2.4 to admins is the
control by the maintainer and stability -- not the VM of the month.
I can test, but it's slow going with so many patches. And many of the
patches haven't been properly merged with any kernel (e.g., -aa 10_vm
reverting previously applied 2.4 changes, etc.)
While I've reproduced the issues and explained them here in the past,
it's difficult for me to iterate fast enough in an environment that
easily reproduces tha problem. I'm iterating as fast as I can, but when
I do iterate I'd prefer some support from the maintainers or other parts
of the community that "Yes, this patch has a good chance of fixing the
specific problems we've been seeing, give it a try." Right now that
doesn't exist (with the exception of your recommendation of this Martin
patch), and that's one reason I'm hesitant to iterate too much and
effect a lot of people.
Thanks,
--
Ken.
[email protected]
|
| Regards,
| Stephan
|
On 5 January 2002 10:59, M. Edward (Ed) Borasky wrote:
> OK, here's some pseduo-code for a real-world test case. I haven't had a
> chance to code it up, but I'm guessing I know what it's going to do. I'd
> *love* to be proved wrong :).
>
> # build and boot a kernel with "Magic SysRq" turned on
> # echo 1 > /proc/sys/kernel/sysrq
> # fire up "nice --19 top" as "root"
> # read "MemTotal" from /proc/meminfo
>
> # now start the next two jobs concurrently
>
> # write a disk file with "MemTotal" data or more in it
Like dd if=/dev/zero of=/tmp/file bs=... count=... ?
> # perform a 2D in-place FFT of total size at least "MemTotal/2" but less
> # than "MemTotal"
I'm willing to try. What program can I use for FFT?
> What's worse is if the page cache gets so big that the FFT has to start
> swapping. For those who aren't familiar with 2D FFTs, they take two
> passes over the data. The first pass will be unit strides -- sequential
> addresses. But the second pass will be large strides -- a power of two.
> That second pass is going to be brutal if every page it hits has to be
> swapped in!
Can you describe FFT memory access pattern in more detail?
I'd like to write a simple testcase with similar 'bad' pattern.
--
vda
On Sat, 5 Jan 2002 15:40:53 -0600
Ken Brownfield <[email protected]> wrote:
> One problem is that I've never heard of it and don't know where to get
> it. ;)
[Sent in off-LKML mail]
> | What exactly do you mean with "high IRQ rate"? Can you show so numbers from
> | /proc/interrupts and uptime for clarification?
>
> I did, back in the archives. I don't have easy access to archives etc,
> right now, but I might repost since it's been a while.
I read all your LKML mails since beginning of November, could find a lot about
cpu, configs,tops etc but not a single "cat /proc/interrupts" together with
uptime.
> Well, reboots are the problem over possible oopses (or data corruption,
> even more fun.) But on your recommendation I'll give Martin's mod a
> try, given a URL. Does Martin's patch play well with -aa? How about
> Martin+10_vm in -pre2? ;-)
According to the ongoings of your mails you seem to try really a lot of things
to make it work out. I recommend not to intermix the patches a lot. I would
stay close to marcelo's tree and try _single_ small patches on top of that. If
you mix them up (even only two of them) you won't be able to track down very
well, what is really better or worse.
One thing I would like to ask here is this (as you are dealing with oracle
stuff): why does oracle recommend to compile the kernel in 486 mode? I talked
to someone who uses oracle on 2.4.x and he told me it is even in the latest
docs. What is the voodoo behind that? Btw he has no freezes or the like, but
occasional coredumps from oracle processes, which he states as "not nice, but
no showstopper" as his clients reconnect/retransmit with only a slight delay.
This may be related to VM, thats why I will try to convince him of some patches
:-) and have a look at the coredump-frequency.
Regards,
Stephan
On Sun, 6 Jan 2002, [email protected] wrote:
> Like dd if=/dev/zero of=/tmp/file bs=... count=... ?
>
That would do it, but I was trying to give a real-world example from
image processing, like copying a large image file.
> > # perform a 2D in-place FFT of total size at least "MemTotal/2" but less
> > # than "MemTotal"
>
> I'm willing to try. What program can I use for FFT?
I use FFTW from http://www.fftw.org.
> Can you describe FFT memory access pattern in more detail?
> I'd like to write a simple testcase with similar 'bad' pattern.
Imagine a 16384 by 16384 array of double complex values. That's a 4
GByte image. Scale down to fit your machine, of course :). The first
pass will do an FFT on every row (column) if your language is C
(FORTRAN). The "stride" is 16 bytes (one complex value) in the inner
loop. Each row (column) is 16384*16 = 262144 bytes long, which works out
to 64 pages if the page size is 4096 bytes.
Then the second pass will do an FFT on every column (row). The stride is
16384*16 = 262144 bytes. This is a new page for each 16-byte complex
value you process :-). That is, all 16384 pages have to be in memory, or
swapped into memory if you've run out of real memory and the kernel has
swapped them out.
Please ... *don't* try to do this on a 512 MB machine and think that an
efficient VM is gonna make it work :),
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
What phrase will you *never* hear Miss Piggy use?
"You can't make a silk purse out of a sow's ear!"
> >
> That would do it, but I was trying to give a real-world example from
> image processing, like copying a large image file.
Image processing people use tiling. Try loading a giant image into
the gimp and into a non smart application like xpaint. The difference is
huge just by careful implementation of the algorithms
> Then the second pass will do an FFT on every column (row). The stride is
> 16384*16 = 262144 bytes. This is a new page for each 16-byte complex
> value you process :-). That is, all 16384 pages have to be in memory, or
> swapped into memory if you've run out of real memory and the kernel has
> swapped them out.
Yes but you don't do it that way, you do stripes of parallel fft
computations. We can all write dumb programs that don't behave well with the
VM layer.
Alan
2.4.17rc2aa2 seems to handle the grow/shrink of buffer_head,
dentry_cache, inode_cache slabs and page/buffer caches pretty
well.
Machine has 1GB RAM and CONFIG_NOHIGHMEM=y
1027 MB swap.
root# dmesg|grep Mem
Memory: 901804k/917504k available (1049k kernel code, 15312k reserved, 259k data, 236k init, 0k highmem)
root# uptime
2:47pm up 4 days, 17:04, 12 users, load average: 1.00, 1.00, 1.00
# Run updatedb to expand the size of dentry_cache.
root# updatedb
# At first, slabs are fairly large.
root# egrep 'inode_cache|dentry_cache|buffer_head' /proc/slabinfo
inode_cache 469072 469168 512 67024 67024 1
dentry_cache 539651 539670 128 17989 17989 1
buffer_head 119438 126330 128 4211 4211 1
Machine has been up for a while, compiling things, running tests, etc,
so page and buffer caches are large too. That's expected/desireable
as there is no memory shortage.
root# free -mt
total used free shared buffers cached
Mem: 880 868 11 0 152 316
-/+ buffers/cache: 400 480
Swap: 1027 0 1027
Total: 1908 868 1039
# count processes
root# ps axu|wc -l
64
# allocate and write to 99% of virtual memory
root# time ./mtest01 -p 99 -w
Filling up 99% of ram which is 1855075778d bytes
PASS ... 1855979520 bytes allocated.
real 0m39.006s
user 0m8.610s
sys 0m7.170s
# notice how much the slabs shrank compared to above.
root# egrep 'inode_cache|dentry_cache|buffer_head' /proc/slabinfo
inode_cache 254 658 512 94 94 1
dentry_cache 193 1140 128 38 38 1
buffer_head 8609 43650 128 1455 1455 1
# Memory allocated to buffers/cache has decreased considerably too.
root# free -mt
total used free shared buffers cached
Mem: 880 71 809 0 29 6
-/+ buffers/cache: 35 845
Swap: 1027 27 1000
Total: 1908 98 1810
# no processes killed by OOM
root# ps aux|wc -l
64
# Now force an OOM condition
root# time ./mtest01 -p 101 -w
Filling up 101% of ram which is 1916602941d bytes
Killed
real 0m40.083s
user 0m8.770s
sys 0m7.270s
Note the time to notice and kill mtest01 because of OOM is short.
39 seconds to allocate 1855979520 bytes of VM.
40 seconds to allocate 1916602941 bytes and kill with OOM handler.
# Only mtest01 killed by OOM
root# ps axu|wc -l
64
In the experiment above, it appears the rc2aa2 VM shinks slabs and
page/buffer caches in a reasonable way when a process needs a lot
of memory.
--
Randy Hron
You're right ... no one does an *out-of-core* 2D FFT using VM. What I am
saying is that a large page cache can turn an *in-core* 2D FFT -- a 4 GB
case on an 8 GB machine, for example -- into an out-of-core one!
One other data point: on my stock Red Hat 7.2 box with 512 MB of RAM, I ran
a Perl script that builds a 512 MByte hash, a second Perl script which
creates a 512 MByte disk file, and the check pass of FFTW concurrently. As I
expected, the two Perl scripts competed for RAM and slowed down FFTW. What
was even more interesting, though, was that the VM apparently functions
correctly in this instance. All three of the processes were getting CPU
cycles. And I never saw "kswapd" or "kupdated" take over the system.
Although the page cache did get large at one point, once the hash builder
got to about 400 MBytes in size, the "cached" piece shrunk to about 10
MBytes and most of the RAM got allocated to the hash builder, as did
appropriate amounts of swap. In short, the kernel in Red Hat 7.2 with under
1 GByte of memory is behaving well under memory pressure. It looks like it's
kernels beyond that one that have the problems, and also systems with more
than 1 GByte. If I had the money, I'd stuff some more RAM in the machine and
see if I could isolate this a little further. If anyone wants my Perl
scripts, which are trivial, let me know.
--
M. Edward Borasky
[email protected]
http://www.borasky-research.net
On Sat, 5 Jan 2002, Stephan von Krawczynski wrote:
> I am pretty impressed by Martins test case where merely all VM patches
> fail with the exception of his own :-)
No big wonder if both -aa and -rmap only get tested without swap ;)
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Sun, 6 Jan 2002, Rik van Riel wrote:
> On Sat, 5 Jan 2002, Stephan von Krawczynski wrote:
>
> > I am pretty impressed by Martins test case where merely all VM patches
> > fail with the exception of his own :-)
>
> No big wonder if both -aa and -rmap only get tested without swap ;)
To be clear ... -aa and -rmap should of course also work
nicely without swap, no excuses for the bad behaviour
shown in Martin's test, but at the moment they simply
don't seem tuned for it.
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On Sat, Jan 05, 2002 at 04:44:05PM +0100, Stephan von Krawczynski wrote:
> On Sat, 5 Jan 2002 01:24:42 -0800
> Petro <[email protected]> wrote:
>
> > "We" (Auctionwatch.com) are experiencing problems that appear to be
> > related to VM, I realize that this question was not directed at me:
>
> And how exactly do the problems look like?
After some time, ranging from 1 to 48 hours, mysql quits in an
unclean fashion (dies leaving tables improperly closed) with a dump
in the mysql log file that looks like:
> Here is the stack dump:
> 0x807b75f handle_segfault__Fi + 383
> 0x812bcaa pthread_sighandler + 154
> 0x815059c chunk_free + 596
> 0x8152573 free + 155
> 0x811579c my_no_flags_free + 16
> 0x80764d5 _._5ilink + 61
> 0x807b48d end_thread__FP3THDb + 53
> 0x80809cc handle_one_connection__FPv + 996
Which the Mysql support team says appears to be memory corruption.
Since this has happened on 4 different machines, and one of them had
memtest86 run on it (coming up clean), they seem (witness Sasha's
post) to think this may have something to do with the memory
handling in the kernel.
I haven't run it on a kernel that has debugging enabled yet,
partially because I've been tracing a completely unrelated problems
with our hard drives (IBM GXP 75G drives made in Hungary during the
first 3 months of 2001), and partially because the only way to get
this to happen is to put the database in production, which results
in a crash, which takes our site offline, which costs us money and
pisses off our users. Right now we're running on a sun e4500, and
it's stable, so until we get the other problem worked out, we're
waiting to see on this one.
--
Share and Enjoy.
On Mon, 7 Jan 2002 00:22:09 -0200 (BRST)
Rik van Riel <[email protected]> wrote:
> On Sun, 6 Jan 2002, Rik van Riel wrote:
> > On Sat, 5 Jan 2002, Stephan von Krawczynski wrote:
> >
> > > I am pretty impressed by Martins test case where merely all VM patches
> > > fail with the exception of his own :-)
> >
> > No big wonder if both -aa and -rmap only get tested without swap ;)
>
> To be clear ... -aa and -rmap should of course also work
> nicely without swap, no excuses for the bad behaviour
> shown in Martin's test, but at the moment they simply
> don't seem tuned for it.
Good to hear we agree it _should_ work. When does it (rmap)?
;-)
Regards,
Stephan
On Sun, 6 Jan 2002 23:15:31 -0800
Petro <[email protected]> wrote:
> On Sat, Jan 05, 2002 at 04:44:05PM +0100, Stephan von Krawczynski wrote:
> > On Sat, 5 Jan 2002 01:24:42 -0800
> > Petro <[email protected]> wrote:
> >
> > > "We" (Auctionwatch.com) are experiencing problems that appear to be
> > > related to VM, I realize that this question was not directed at me:
> >
> > And how exactly do the problems look like?
>
> After some time, ranging from 1 to 48 hours, mysql quits in an
> unclean fashion (dies leaving tables improperly closed) with a dump
> in the mysql log file that looks like:
mysql question: is this a binary from some distro or self-compiled? If
self-compiled can you show your ./configure paras, please?
> Which the Mysql support team says appears to be memory corruption.
> Since this has happened on 4 different machines, and one of them had
> memtest86 run on it (coming up clean), they seem (witness Sasha's
> post) to think this may have something to do with the memory
> handling in the kernel.
There is a big difference between memory _corruption_ and a VM deficiency. No
app can cope with a _corruption_ and is perfectly allowed to core dump or exit
(or trash your disk). But this should not happen on allocation failures.
Unless all your RAM is from the same series I do not really believe in mem
corruption. I would try Martins small VM patch, as it looks like being a bit
more efficient in low mem conditions and this may well be the case you are
running into. This means 2.4.17 standard + patch.
Regards,
Stephan
On Sun, 6 Jan 2002 15:38:54 -0500
[email protected] wrote:
> In the experiment above, it appears the rc2aa2 VM shinks slabs and
> page/buffer caches in a reasonable way when a process needs a lot
> of memory.
Hello Randy,
can you please try The Same Thing while copying large files around in the
background (lets say 100MB files) and re-comment.
Thanks,
Stephan
> Re: [2.4.17/18pre] VM and swap - it's really unusable
>
>
> I did some tests with different VM-patches. I tested one ac-patch, too.
> I detected the same as you described - but the memory-consumption and
> the behaviour at all isn't better. If you want to, you can test another
> patch, which worked best in my test. It's nearly as good as kernel
> 2.2.x. Ask M.H.vanLeeuwen ([email protected]) for his oom-patch to
> kernel 2.4.17.
> But beware: maybe this strategy doesn't fit to your applications. And
> it's not for productive use.
> I and some others surely too, would be interested in your experience
> with this patch.
>
Hi,
so I took the M.H.vL vmscan.c patch for 2.4.17 and it is a definite
winner for me. Sounds like 2.4.18 material.
Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759
On Mon, 07 Jan 2002 18:41:29 +0100
Martin Knoblauch <[email protected]> wrote:
> Hi,
>
> so I took the M.H.vL vmscan.c patch for 2.4.17 and it is a definite
> winner for me. Sounds like 2.4.18 material.
What issues have been solved or remarkably become better for you with Martins
patch?
Regards,
Stephan
Stephan von Krawczynski wrote:
>
> On Mon, 07 Jan 2002 18:41:29 +0100
> Martin Knoblauch <[email protected]> wrote:
>
> > Hi,
> >
> > so I took the M.H.vL vmscan.c patch for 2.4.17 and it is a definite
> > winner for me. Sounds like 2.4.18 material.
>
> What issues have been solved or remarkably become better for you with Martins
> patch?
>
> Regards,
> Stephan
massive reduction of swapped out applications in the light of a 220+
MiB Cache. This on a 320 MiB System with about 2x swap. As a result,
interactive behaviour (this is a notebook) is much better. Before the
patch I had about 50-60 MB in swap for my workload (vmware + netscape +
plus kernel compile + updatedb). Now it is below 5 MB, although this is
still to much (yes, I know, I could turn off swap completely :-)
I also took the OOM-Killer patch, although it never was a real concern
for me.
Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759
On Mon, Jan 07, 2002 at 03:33:48PM +0100, Stephan von Krawczynski wrote:
> On Sun, 6 Jan 2002 23:15:31 -0800
> Petro <[email protected]> wrote:
> > On Sat, Jan 05, 2002 at 04:44:05PM +0100, Stephan von Krawczynski wrote:
> > > On Sat, 5 Jan 2002 01:24:42 -0800
> > > Petro <[email protected]> wrote:
> > > > "We" (Auctionwatch.com) are experiencing problems that appear to be
> > > > related to VM, I realize that this question was not directed at me:
> > > And how exactly do the problems look like?
> > After some time, ranging from 1 to 48 hours, mysql quits in an
> > unclean fashion (dies leaving tables improperly closed) with a dump
> > in the mysql log file that looks like:
> mysql question: is this a binary from some distro or self-compiled? If
> self-compiled can you show your ./configure paras, please?
It's the binary from mysql.com.
> > Which the Mysql support team says appears to be memory corruption.
> > Since this has happened on 4 different machines, and one of them had
> > memtest86 run on it (coming up clean), they seem (witness Sasha's
> > post) to think this may have something to do with the memory
> > handling in the kernel.
> There is a big difference between memory _corruption_ and a VM deficiency. No
> app can cope with a _corruption_ and is perfectly allowed to core dump or exit
> (or trash your disk). But this should not happen on allocation failures.
> Unless all your RAM is from the same series I do not really believe in mem
> corruption. I would try Martins small VM patch, as it looks like being a bit
> more efficient in low mem conditions and this may well be the case you are
> running into. This means 2.4.17 standard + patch.
Is there a reasonable chance that martins patch will get mainlined
in the near future? One of the big reasons I chose to upgrade to a
later kernel version (from 2.4.8ac<something>+LVMpatches+...) was
to get away from having to apply patches (and document which
patches and where to get them etc).
If this is the route I have to go, I'll do it but, well, I'm not
that comfortable with it.
--
Share and Enjoy.
On Mon, 7 Jan 2002, Stephan von Krawczynski wrote:
> > To be clear ... -aa and -rmap should of course also work
> > nicely without swap, no excuses for the bad behaviour
> > shown in Martin's test, but at the moment they simply
> > don't seem tuned for it.
>
> Good to hear we agree it _should_ work. When does it (rmap)?
> ;-)
I integrated Ed Tomlinson's patch today and have made
one more small change. In the patches I ran here things
worked fine, the system avoids OOM now.
Problem is, it doesn't seem to want to run the OOM
killer when needed, at least not any time soon. I need
to check out this code again later.
Anyway, rmap-11 should work fine for your test. ;)
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
> On Mon, Jan 07, 2002 at 03:33:48PM +0100, Stephan von Krawczynski
wrote:
> > mysql question: is this a binary from some distro or
self-compiled? If
> > self-compiled can you show your ./configure paras, please?
>
> It's the binary from mysql.com.
Beta or stable release?
> > [...] I would try Martins small VM patch, as it looks like being a
bit
> > more efficient in low mem conditions and this may well be the case
you are
> > running into. This means 2.4.17 standard + patch.
>
> Is there a reasonable chance that martins patch will get
mainlined
> in the near future?
I really can't know. But to me the results look interesting enough to
give it a try on certain problem situations (like yours) to find out
if it is any better than the stock version. If you and others can
confirm that things get better then I have no real doubts that Marcelo
can pick it up.
> One of the big reasons I chose to upgrade to a
> later kernel version (from 2.4.8ac<something>+LVMpatches+...)
was
> to get away from having to apply patches (and document which
> patches and where to get them etc).
Well, there is really nothing wrong with upgrading mainline kernels,
as the are getting better with every release, so I would always
suggest to take the releases up lets say a week after being out. Only
your situation maybe can help to improve more, if you input some of
your experiences in LKML with a patch like Martins. Feedback _is_
required to find a solution to an existing problem.
> If this is the route I have to go, I'll do it but, well, I'm
not
> that comfortable with it.
Well, my suggestions: don't patch around too much, but try single
patches on stock kernel and evaluate them here.
Regards,
Stephan
> can you please try The Same Thing while copying large files around in the
> background (lets say 100MB files) and re-comment.
>
> Thanks,
> Stephan
I used the script below to create 10 330 megabyte files; then cpio them
over to another filesystem while another process was constantly allocating
and writing to 85% of VM. It did 77 iterations allocating about 1.6GB of
RAM during the time (10 * 330 * 2) MB worth of files were created and cpio'd.
The test took 53 minutes.
I do most of my work from console.
http://home.earthlink.net/~rwhron/hardware/matrox.html
There was an obvious but not painful delay for interactive use during
the test. An X desktop would be less kind to the user.
Kernel: 2.4.17rc2aa2
Memory: 901804k/917504k available
1027 MB swap.
Athlon 1333
(1) 40GB 7200 rpm IDE disk.
#!/bin/bash
# What does the VM do when copying big files and a memory hog is running.
cmd=${0##*/}
kern=$(uname -r)
typeset -i i
typeset -i j
log=${cmd}-${kern}.log
src=/usr/src/sources
dest=/opt/testing
mtest01=$src/l/ltp*206/testcases/bin/mtest01
percent=85 # vm to fill
uname -a > $log
# create memhog script to run during test
cat>memhog<<EOF
#!/bin/bash
while :
do # allocate and write to $percent of virtual memory
uptime
head -3 /proc/meminfo
egrep 'inode|dentry|buffer' /proc/slabinfo
time $mtest01 -p $percent -w 2>&1
head -3 /proc/meminfo
egrep 'inode|dentry|buffer' /proc/slabinfo
done
EOF
# execute memhog in background
chmod +x memhog
/memhog >> $log 2>&1 &
mpid=$!
# timer
SECONDS=0
vmstat 15 >> $log &
updatedb
rm -rf $src/?/bigfile $dest
mkdir $dest
df -k /opt /usr/src >> $log
# create 10 big files
for ((i=0; i<=9; i++))
do cat $src/[gbl]/*.{gz,bz2} > $src/$i/bigfile
done
ls -l $src/?/bigfile >> $log
size=$(ls -l $src/?/bigfile|awk '{print $5}'|uniq)
num=$(ls $src/?/bigfile|wc -w)
# copy the files
cd $src
find [0-9] -name bigfile|cpio -pdm $dest >> $log
sync
df -k /opt /usr/src
echo $SECONDS to create and copy $num $size byte files >> $log
# kill memhog and vmstat
kill $! $mpid
rm -rf $src/?/bigfile $dest
echo "log is in $log"
--
Randy Hron
Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
-rmap?
Andrew Morten`s read-latency.patch is a clear winner for me, too.
What about 00_nanosleep-5 and bootmem?
The O(1) scheduler?
Maybe preemption? It is disengageable so nobody should be harmed but we get
the chance for wider testing.
Any comments?
Thanks,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Tue, Jan 08, 2002 at 02:43:42AM +0100, Stephan von Krawczynski wrote:
> > On Mon, Jan 07, 2002 at 03:33:48PM +0100, Stephan von Krawczynski
> wrote:
> > > mysql question: is this a binary from some distro or
> self-compiled? If
> > > self-compiled can you show your ./configure paras, please?
> >
> > It's the binary from mysql.com.
>
> Beta or stable release?
Stable.
> > > more efficient in low mem conditions and this may well be the case you are
> > > running into. This means 2.4.17 standard + patch.
> > Is there a reasonable chance that martins patch will get mainlined
> > in the near future?
>
> I really can't know. But to me the results look interesting enough to
> give it a try on certain problem situations (like yours) to find out
> if it is any better than the stock version. If you and others can
> confirm that things get better then I have no real doubts that Marcelo
> can pick it up.
Out of ignorance and laziness, where is it again that I can get this
kernel?
> > One of the big reasons I chose to upgrade to a
> > later kernel version (from 2.4.8ac<something>+LVMpatches+...)
> was
> > to get away from having to apply patches (and document which
> > patches and where to get them etc).
>
> Well, there is really nothing wrong with upgrading mainline kernels,
Funny, I went from a working 2.4.8-ac<x> to a non-working
2.4.13+patches when I started getting these crashes. At first I
thought they were Mysql, so I called them. They said "Re-install
windows", er, I mean upgrade my kernel to 2.4.16, which would "fix
the problem", so I did, and it didn't. So they said to go to
2.4.17rc2 as that would fix my problem, only it didn't.
> as the are getting better with every release, so I would always
> suggest to take the releases up lets say a week after being out. Only
Yeah, and build a debian package, distribute it to (looks behind me)
100+ linux servers, including 6 mission critical heavily loaded DB
machines.
Not to be a complete asswipe, but no. While I like playing with
computers and all that, I don't have enough hours in the day to be
rolling out new kernels every couple weeks and still have time left
over to see my wife, shoot my guns, ride my motorcycles and drink my
scotch.
> your situation maybe can help to improve more, if you input some of
> your experiences in LKML with a patch like Martins. Feedback _is_
> required to find a solution to an existing problem.
I understand completely, I'm just trying to figure out a way to test
this that doesn't impact my site as drastically. See, we've only got
two databases that will cause this fault, and of course they are the
two most important ones, and the only way we can generate this fault
is to put them live and wait for them to crash.
> > If this is the route I have to go, I'll do it but, well, I'm
> not
> > that comfortable with it.
>
> Well, my suggestions: don't patch around too much, but try single
> patches on stock kernel and evaluate them here.
There are 2 other patches I need to apply, the first is the LVM
1.0.1 patch, and the second is the VFS-lock patch. We need these to
do snapshots. Which isn't bad, but I'm about the only one still here
who can do it (violates hit-by-a-bus rule).
--
Share and Enjoy.
On Mon, 7 Jan 2002 00:22:09 -0200 (BRST)
Rik van Riel <[email protected]> wrote:
> On Sun, 6 Jan 2002, Rik van Riel wrote:
> > On Sat, 5 Jan 2002, Stephan von Krawczynski wrote:
> >
> > > I am pretty impressed by Martins test case where merely all VM patches
> > > fail with the exception of his own :-)
> >
> > No big wonder if both -aa and -rmap only get tested without swap ;)
>
> To be clear ... -aa and -rmap should of course also work
> nicely without swap, no excuses for the bad behaviour
> shown in Martin's test, but at the moment they simply
> don't seem tuned for it.
Hi Rik,
My original goal was to find a VM that worked well w/o swap, not to show how each
one acted with and without swap. Failing to find what I thought satisfactory I
started my venture into VM by grepping the source for "Out of Memory:" and worked
backwards from there not so very too long ago...
I did not swap test any VM until asked how my patch benched w/ respect to
stock kernel w/ swap; others weren't left out intentionally I just never was asked.
Now that you "implied ;)" that it would be nice to see what other VM solutions look
like with my test case i've attached an updated "boring column of numbers".
Here is how I would rank current VM options w/ respect to the _one_ test case I run.
(Stephan has asked for an NFS server variant w/ my patch but I haven't tested it yet)
1st 2nd 3rd 4th
w/o swap MH AA STOCK RMAP (ranked on VM pressure)
w/ swap AA STOCK MH RMAP (ranked on sys time)
Observations:
a. rmap 10c, failed earlier than stock kernel thus doesn't qualify for use in my "no
swap" systems and it is still undergoing change and isn't mainline kernel yet; will
revisit at intervals.
b. rc2aa2, while much better than stock has unexpected behaviour(s) I didn't like.
Namely, the konsole window running the tests was gone and KDE task bar were missing
as well as some other windows and there was no indication by the kernel these
processes were killed, only "VM: killing process cc1" and a few 0 order allocation
failures (or something like that) at the end of the test.
c. mh patch probably doesn't swap enough but thus far there have been no bad reports
that can be directly attributed to this rather small change to vmscan.c
1. it works well under extreme VM pressure
2. it's performance is on par with the stock kernel and aa2
I glanced at rc2aa2 and it seems to be doing essentially what my patch does with a _lot_
of extra knobs some of which I don't like and I think I know why rc2aa2 doesn't perform
quite as well under extreme pressure but I don't have the luxury of time to [dis]prove
my ideas yet.
Haven't looked at rmap code yet...
Martin
On Sun, Jan 06, 2002 at 04:48:13PM +0100, Stephan von Krawczynski wrote:
[...]
| I read all your LKML mails since beginning of November, could find a lot about
| cpu, configs,tops etc but not a single "cat /proc/interrupts" together with
| uptime.
http://web.irridia.com/info/linux/APIC/
This was published back in the beginning (4/2001), and additional stuff
sent to Alan and Manfred for debugging. I was pushing my problem on
LKML for a couple of weeks, but without much feedback I'm sticking to my
workaround.
This also feeds back to my earlier thoughts on some kind of LKML summary
page of patches and problem reports for those disinclined to wade
through the high LKML traffic. It's hard for me, much less you, to go
back through the archives manually...
| According to the ongoings of your mails you seem to try really a lot of things
| to make it work out. I recommend not to intermix the patches a lot. I would
| stay close to marcelo's tree and try _single_ small patches on top of that. If
| you mix them up (even only two of them) you won't be able to track down very
| well, what is really better or worse.
Actually, that's why I don't test -aa. Whatever Marcelo chooses to
include, I'll trust it in its entirety. But I've tested, for example,
Linus' locked memory patch, and a couple of Andrew's isolated patches,
all applied to mainline with nothing else. I can't try -aa because it
has interdependencies and unintentional (I assume) backouts of code.
| One thing I would like to ask here is this (as you are dealing with oracle
| stuff): why does oracle recommend to compile the kernel in 486 mode? I talked
| to someone who uses oracle on 2.4.x and he told me it is even in the latest
| docs. What is the voodoo behind that? Btw he has no freezes or the like, but
| occasional coredumps from oracle processes, which he states as "not nice, but
| no showstopper" as his clients reconnect/retransmit with only a slight delay.
| This may be related to VM, thats why I will try to convince him of some patches
| :-) and have a look at the coredump-frequency.
I haven't had any problems with Oracle at all since Linus' locked memory
patch back in the 2.4.14-15ish days. This on a 4GB 6-way Xeon with
ext2, reiser, couple of other complications, with the kernel compiled
for P3. I really don't know what would cause Oracle to misbehave with
an i686 kernel that wouldn't be a kernel bug.
Perhaps a gcc-related bug? I'm still using 2.91.66 for kernels,
although I've used 2.95.x with no problems. I'm not touching 2.96.x
with a ten-foot pole, waiting instead for a sane 3.x one of these years.
I think Oracle (the company) is a little short of tooth on Linux
experience, since for example and AFAIK they never discovered the fatal
2.4 locked memory problem -- that took Google's report and to a much
lesser extent my later discovery of the same problem.
--
Ken.
[email protected]
On Mon, Jan 07, 2002 at 07:10:01PM -0800, Petro wrote:
> > can pick it up.
>
> Out of ignorance and laziness, where is it again that I can get this
> kernel?
Let me rephrase that.
Out of ignorance and laziness, exactly which patch is it that I
need, and where can I find it?
--
Share and Enjoy.
On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote (passim):
> Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
> -rmap?
[...]
> Maybe preemption? It is disengageable so nobody should be harmed but we get
> the chance for wider testing.
>
> Any comments?
preemption?? this is eventually 2.5 stuff, and should not be integrated
into 2.4 stable tree. Of course a backport is possible, when/if it will be
quite well tested and well working on 2.5
On Tue, Jan 08, 2002 at 11:55:59AM +0100, Luigi Genoni wrote:
>
>
> On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote (passim):
>
> > Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
> > -rmap?
> [...]
> > Maybe preemption? It is disengageable so nobody should be harmed but we get
> > the chance for wider testing.
> >
> > Any comments?
> preemption?? this is eventually 2.5 stuff, and should not be integrated
indeed ("eventually" in the italian sense btw, obvious to me, but not
for l-k).
I'm not against preemption (I can see the benefits about the mean
latency for real time DSP) but the claims about preemption making the
kernel faster doesn't make sense to me. more frequent scheduling,
overhead of branches in the locks (you've to conditional_schedule after
the last preemption lock is released and the cachelines for the per-cpu
preemption locks) and the other preemption stuff can only make the
kernel slower. Furthmore for multimedia playback any sane kernel out
there with lowlatency fixes applied will work as well as a preemption
kernel that pays for all the preemption overhead.
About the other claim that as the kernel becomes more granular
performance will increase with preemption in kernel, that's obviously
wrong as well, it's clearly the other way around. Maybe it was meant
"latency will decrease further", that's right, but also performance will
decrease if something.
So yes, mean latency will decrease with preemptive kernel, but your CPU
is definitely paying something for it.
> into 2.4 stable tree. Of course a backport is possible, when/if it will be
> quite well tested and well working on 2.5
>
>
>
>
Andrea
> So yes, mean latency will decrease with preemptive kernel, but your CPU
> is definitely paying something for it.
And Andrew Morton's work suggests he can do a much better job of
reducing latency than -preempt.
Anton
On 20020108 Dieter N?tzel wrote:
>Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
>-rmap?
>Andrew Morten`s read-latency.patch is a clear winner for me, too.
>What about 00_nanosleep-5 and bootmem?
>The O(1) scheduler?
>Maybe preemption? It is disengageable so nobody should be harmed but we get
>the chance for wider testing.
>
>Any comments?
>
I would pefer the ton of small, usefull and safe bits in Andrea's kernel
(vm-21, cache-aligned-spinlocks, compiler, gcc3, rwsem, highmen fixes...)
--
J.A. Magallon # Let the source be with you...
mailto:[email protected]
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.18-pre2-beo #1 SMP Tue Jan 8 03:18:18 CET 2002 i686
Just to throw in my EUR 0.02:
We're using 2.4.17-rc2aa2 on a quite huge database server. This is the most
stable and performant kernel IMHO since we tested the new 2.4.>10 series.
The machine has 4 GB main memory, the DB process mallocs about 3 GB memory
spawning about 150 threads. "Older" kernels had the problem that short after
logging in the machine starts heavily swapping leaving the system unusable
for minutes.
2.4.17-rc2aa2 does not have this problem (since now) - even under DB loads >
15 the system is still responsive.
Good Work!
--
Markus Doehr AUBI Baubschlaege GmbH
IT Admin/SAP R/3 Basis Zum Grafenwald
fon: +49 6503 917 152 54411 Hermeskeil
fax: +49 6503 917 190 http://www.aubi.de
On January 8, 2002 02:33 pm, Anton Blanchard wrote:
> Andrea Arcangeli [apparently] wrote:
> > So yes, mean latency will decrease with preemptive kernel, but your CPU
> > is definitely paying something for it.
>
> And Andrew Morton's work suggests he can do a much better job of
> reducing latency than -preempt.
That's not a particularly clueful comment, Anton. Obviously, any
latency-busting hacks that Andrew does could also be patched into a
-preempt kernel.
What a preemptible kernel can do that a non-preemptible kernel can't is:
reschedule exactly as often as necessary, instead of having lots of extra
schedule points inserted all over the place, firing when *they* think the
time is right, which may well be earlier than necessary.
The preemptible approach is much less of a maintainance headache, since
people don't have to be constantly doing audits to see if something changed,
and going in to fiddle with scheduling points.
Finally, with preemption, rescheduling can be forced with essentially zero
latency in response to an arbitrary interrupt such as IO completion, whereas
the non-preemptive kernel will have to 'coast to a stop'. In other words,
the non-preemptive kernel will have little lags between successive IOs,
whereas the preemptive kernel can submit the next IO immediately. So there
are bound to be loads where the preemptive kernel turns in better latency
*and throughput* than the scheduling point hack.
Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
it's good to be aware of why that approach can't equal the latency-busting
performance of the preemptive approach.
--
Daniel
On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote:
> Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
> -rmap?
-rmap is 2.5 stuff.
I would really like to integrate -aa stuff as soon as I can understand
_why_ Andrea is doing those changes.
Note that people will _always_ complain about VM: It will always be
possible to optimize it to some case and cause harm to other cases.
I'm not saying that VM is perfect right now: It for sure has problems.
> Andrew Morten`s read-latency.patch is a clear winner for me, too.
AFAIK Andrew's code simply adds schedule points around the kernel, right?
If so, nope, I do not plan to integrate it.
> What about 00_nanosleep-5 and bootmem?
What is 00_nanosleep-5 and bootmem ?
> The O(1) scheduler?
2.5 stuff.
> Maybe preemption? It is disengageable so nobody should be harmed but we get
> the chance for wider testing.
2.5 too.
I stayed at work all night banging out tests on a few of our machines
here. I took 2.4.18-pre2 and 2.4.18-pre2 with the vmscan patch from
"M.H.VanLeeuwen" <[email protected]>.
My sustained test consisted of this type of load:
ls -lR / > /dev/null &
/usr/bin/slocate -u -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net" &
dd if=/dev/sda3 of=/sda3 bs=1024k &
# Hit TUX on this machine repeatedly; html page with 1000 images
# Wait for memory to be mostly used by buff/page cache
./a.out &
# repeat finished commands -- keep all commands running
# after a.out finishes, alow buff/page to refill before repeating
The a.out in this case is a little program (attached, c.c) to allocate
and write to an amount of memory equal to physical RAM. The example I
chose below is from a 2xP3/600 with 1GB of RAM and 2GB swap.
This was not a formal benchmark -- I think benchmarks have been
presented before by other folks, and looking at benchmarks does not
necessarily indicate the real-world problems that exist. My intent was
to reproduce the issues I've been seeing, and then apply the MH (and
only the MH) patch and observe.
2.4.18-pre2
Once slocate starts and gets close to filling RAM with buffer/page
cache, kupdated and kswapd have periodic spikes of 50-100% CPU.
When a.out starts, kswapd and kupdated begin to eat significant portions
of CPU (20-100%) and I/O becomes more and more sluggish as a.out
allocates.
When a.out uses all free RAM and should begin eating cache, significant
swapping begins and cache is not decreased significantly until the
machine goes 100-200MB into swap.
Here are two readprofile outputs, sorted by ticks and load.
229689 default_idle 4417.0962
4794 file_read_actor 18.4385
405 __rdtsc_delay 14.4643
3763 do_anonymous_page 14.0410
3796 statm_pgd_range 9.7835
1535 prune_icache 6.9773
153 __free_pages 4.7812
1420 create_bounce 4.1765
583 sym53c8xx_intr 3.9392
221 atomic_dec_and_lock 2.7625
5214 generic_file_write 2.5659
273464 total 0.1903
234168 default_idle 4503.2308
5298 generic_file_write 2.6073
4868 file_read_actor 18.7231
3799 statm_pgd_range 9.7912
3763 do_anonymous_page 14.0410
1535 prune_icache 6.9773
1526 shrink_cache 1.6234
1469 create_bounce 4.3206
643 rmqueue 1.1320
591 sym53c8xx_intr 3.9932
505 __make_request 0.2902
2.4.18-pre2 with MH
With the MH patch applied, the issues I witnessed above did not seem to
reproduce. Memory allocation under pressure seemed faster and smoother.
kswapd never went above 5-15% CPU. When a.out allocated memory, it did
not begin swapping until buffer/page cache had been nearly completely
cannibalized.
And when a.out caused swapping, it was controlled and behaved like you
would expect the VM to bahave -- slowly swapping out unused pages
instead of large swap write-outs without the patch.
Martin, have you done throughput benchmarks with MH/rmap/aa, BTW?
But both kernels still seem to be sluggish when it comes to doing small
I/O operations (vi, ls, etc) during heavy swapping activity.
Here are the readprofile results:
206243 default_idle 3966.2115
6486 file_read_actor 24.9462
409 __rdtsc_delay 14.6071
2798 do_anonymous_page 10.4403
185 __free_pages 5.7812
1846 statm_pgd_range 4.7577
469 sym53c8xx_intr 3.1689
176 atomic_dec_and_lock 2.2000
349 end_buffer_io_async 1.9830
492 refill_inactive 1.8358
94 system_call 1.8077
245776 total 0.1710
216238 default_idle 4158.4231
6486 file_read_actor 24.9462
2799 do_anonymous_page 10.4440
1855 statm_pgd_range 4.7809
1611 generic_file_write 0.7928
839 __make_request 0.4822
820 shrink_cache 0.7374
540 rmqueue 0.9507
534 create_bounce 1.5706
492 refill_inactive 1.8358
487 sym53c8xx_intr 3.2905
There may be significant differences in the profile outputs for those
with VM fu.
Summary: MH swaps _after_ cache has been properly cannibalized, and
swapping activity starts when expected and is properly throttled.
kswapd and kupdated don't seem to go into berserk 100% CPU mode.
At any rate, I now have the MH patch (and Andrew Morton's mini-ll and
read-latency2 patches) in production, and I like what I see so far. I'd
vote for them to go into 2.4.18, IMHO. Maybe the full low-latency patch
if it's not truly 2.5 material.
My next cook-off will be with -aa and rmap, although if the rather small
MH patch fixes my last issues it may be worth putting all VM effort into
a 2.5 VM cook-off. :) Hopefully the useful stuff in -aa can get pulled
in at some point soon, though.
Thanks much to Martin H. VanLeeuwen for his patch and Stephan von
Krawczynski for his recommendations. I'll let MH cook for a while and
I'll follow up later.
--
Ken.
[email protected]
c.c:
#include <stdio.h>
#define MB_OF_RAM 1024
int
main()
{
long stuffsize = MB_OF_RAM * 1048576 ;
char *stuff ;
if ( stuff = (char *)malloc( stuffsize ) ) {
long chunksize = 1048576 ;
long c ;
for ( c=0 ; c<chunksize ; c++ )
*(stuff+c) = '\0' ;
/* hack; last chunk discarded if stuffsize%chunksize != 0 */
for ( ; (c+chunksize)<stuffsize ; c+=chunksize )
memcpy( stuff+c, stuff, chunksize );
sleep( 120 );
}
else
printf("OOPS\n");
exit( 0 );
}
On Sat, Jan 05, 2002 at 04:08:33PM +0100, Stephan von Krawczynski wrote:
[...]
| I am pretty impressed by Martins test case where merely all VM patches fail
| with the exception of his own :-) The thing is, this test is not of nature
| "very special" but more like "system driven to limit by normal processes". And
| this is the real interesting part about it.
[...]
On Tue, Jan 08, 2002 at 04:00:11PM +0100, Daniel Phillips wrote:
> On January 8, 2002 02:33 pm, Anton Blanchard wrote:
> > Andrea Arcangeli [apparently] wrote:
> > > So yes, mean latency will decrease with preemptive kernel, but your CPU
> > > is definitely paying something for it.
> >
> > And Andrew Morton's work suggests he can do a much better job of
> > reducing latency than -preempt.
>
> That's not a particularly clueful comment, Anton. Obviously, any
> latency-busting hacks that Andrew does could also be patched into a
> -preempt kernel.
>
> What a preemptible kernel can do that a non-preemptible kernel can't is:
> reschedule exactly as often as necessary, instead of having lots of extra
> schedule points inserted all over the place, firing when *they* think the
> time is right, which may well be earlier than necessary.
"extra schedule points all over the place", that's the -preempt kernel
not the lowlatency kernel! (on yeah, you don't see them in the source
but ask your CPU if it sees them)
> The preemptible approach is much less of a maintainance headache, since
> people don't have to be constantly doing audits to see if something changed,
> and going in to fiddle with scheduling points.
this yes, it requires less maintainance, but still you should keep in
mind the details about the spinlocks, things like the checks the VM does
in shrink_cache are needed also with preemptive kernel.
> Finally, with preemption, rescheduling can be forced with essentially zero
> latency in response to an arbitrary interrupt such as IO completion, whereas
> the non-preemptive kernel will have to 'coast to a stop'. In other words,
> the non-preemptive kernel will have little lags between successive IOs,
> whereas the preemptive kernel can submit the next IO immediately. So there
> are bound to be loads where the preemptive kernel turns in better latency
> *and throughput* than the scheduling point hack.
The I/O pipeline is big enough that a few msec before or later in a
submit_bh shouldn't make a difference, the batch logic in the
ll_rw_block layer also try to reduce the reschedule, and last but not
the least if the task is I/O bound preemptive kernel or not won't make
any difference in the submit_bh latency because no task is eating cpu
and latency will be the one of pure schedule call.
> Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> it's good to be aware of why that approach can't equal the latency-busting
> performance of the preemptive approach.
I also don't want to devaluate the preemptive kernel approch (the mean
latency it can reach is lower than the one of the lowlat kernel, however
I personally care only about worst case latency and this is why I don't
feel the need of -preempt), but I just wanted to make clear that the
idea that is floating around that preemptive kernel is all goodness is
very far from reality, you get very low mean latency but at a price.
Andrea
> > Andrew Morten`s read-latency.patch is a clear winner for me, too.
>
> AFAIK Andrew's code simply adds schedule points around the kernel, righ=
> t?=20
>
> If so, nope, I do not plan to integrate it.
Yep. It has the most wonderful effect on system latency without actually
breaking any semantics. Pre-empt is a trickier one because it does change
actual behaviour a lot more, although it should be preserving locking
rules.
Alan
On Tue, Jan 08, 2002 at 11:59:36AM -0200, Marcelo Tosatti wrote:
>
>
> On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote:
>
> > Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
> > -rmap?
>
> -rmap is 2.5 stuff.
>
> I would really like to integrate -aa stuff as soon as I can understand
> _why_ Andrea is doing those changes.
>
> Note that people will _always_ complain about VM: It will always be
> possible to optimize it to some case and cause harm to other cases.
At the moment I've one showstopper thing to fix (enterprise thing only
though, normal boxes doesn't care about this), then I will try to split
out those bits for integration and they should become much more readable.
>
> I'm not saying that VM is perfect right now: It for sure has problems.
>
> > Andrew Morten`s read-latency.patch is a clear winner for me, too.
>
> AFAIK Andrew's code simply adds schedule points around the kernel, right?
>
> If so, nope, I do not plan to integrate it.
Note that some of them are bugfixes, without them an luser can hang the
machine for several seconds on any box with some giga of ram by simple
reading and writing into a large enough buffer. I think we never had
time to care merging those bits into mainline yet and this is probably
the main reason they're not integrated but it's something that should be
in mainline IMHO.
> > What about 00_nanosleep-5 and bootmem?
>
> What is 00_nanosleep-5 and bootmem ?
nanosleep gives usec resolution to the rest-of-time returned by
nanosleep, this avoids glibc userspace starvation on nanosleep
interrupted by a flood of signals. It was requested by glibc people.
I know nanosleep triggers ltp failures (both kindly reported by Paul
Larson and Randy Hron), but I really suspect a false positive in the ltp
testsuite, I didn't had time to check it yet, but certainly I tested
nanosleep retval usec resolution with some testcase written by hand
(compared to gettimeofday output) after writing it long ago, and it was
apparently working fine and I never had problems with it yet.
Andrea
On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > The preemptible approach is much less of a maintainance headache, since
> > people don't have to be constantly doing audits to see if something changed,
> > and going in to fiddle with scheduling points.
>
> this yes, it requires less maintainance, but still you should keep in
> mind the details about the spinlocks, things like the checks the VM does
> in shrink_cache are needed also with preemptive kernel.
Yes of course, the spinlock regions still have to be analyzed and both
patches have to be maintained for that. Long duration spinlocks are bad
by any measure, and have to be dealt with anyway.
> > Finally, with preemption, rescheduling can be forced with essentially zero
> > latency in response to an arbitrary interrupt such as IO completion, whereas
> > the non-preemptive kernel will have to 'coast to a stop'. In other words,
> > the non-preemptive kernel will have little lags between successive IOs,
> > whereas the preemptive kernel can submit the next IO immediately. So there
> > are bound to be loads where the preemptive kernel turns in better latency
> > *and throughput* than the scheduling point hack.
>
> The I/O pipeline is big enough that a few msec before or later in a
> submit_bh shouldn't make a difference, the batch logic in the
> ll_rw_block layer also try to reduce the reschedule, and last but not
> the least if the task is I/O bound preemptive kernel or not won't make
> any difference in the submit_bh latency because no task is eating cpu
> and latency will be the one of pure schedule call.
That's not correct. For one thing, you don't know that no task is eating
CPU, or that nobody is hogging the kernel. Look at the above, and consider
the part about the little lags between IOs.
> > Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> > it's good to be aware of why that approach can't equal the latency-busting
> > performance of the preemptive approach.
>
> I also don't want to devaluate the preemptive kernel approch (the mean
> latency it can reach is lower than the one of the lowlat kernel, however
> I personally care only about worst case latency and this is why I don't
> feel the need of -preempt),
This is exactly the case that -preempt handles well. On the other hand,
trying to show that scheduling hacks satisfy any given latency bound is
equivalent to solving the halting problem.
I thought you had done some real time work?
> but I just wanted to make clear that the
> idea that is floating around that preemptive kernel is all goodness is
> very far from reality, you get very low mean latency but at a price.
A price lots of people are willing to pay.
By the way, have you measured the cost of -preempt in practice?
--
Daniel
From: Marcelo Tosatti <[email protected]>
Date: Tue, 8 Jan 2002 11:59:36 -0200 (BRST)
On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote:
> Andrew Morten`s read-latency.patch is a clear winner for me, too.
AFAIK Andrew's code simply adds schedule points around the kernel, right?
If so, nope, I do not plan to integrate it.
No, this is the block I/O scheduler changes he did which up the
priority of reads wrt. writes so they do not sit around forever.
Franks a lot,
David S. Miller
[email protected]
On Tue, Jan 08, 2002 at 04:54:58PM +0100, Daniel Phillips wrote:
> On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > > The preemptible approach is much less of a maintainance headache, since
> > > people don't have to be constantly doing audits to see if something changed,
> > > and going in to fiddle with scheduling points.
> >
> > this yes, it requires less maintainance, but still you should keep in
> > mind the details about the spinlocks, things like the checks the VM does
> > in shrink_cache are needed also with preemptive kernel.
>
> Yes of course, the spinlock regions still have to be analyzed and both
> patches have to be maintained for that. Long duration spinlocks are bad
> by any measure, and have to be dealt with anyway.
>
> > > Finally, with preemption, rescheduling can be forced with essentially zero
> > > latency in response to an arbitrary interrupt such as IO completion, whereas
> > > the non-preemptive kernel will have to 'coast to a stop'. In other words,
> > > the non-preemptive kernel will have little lags between successive IOs,
> > > whereas the preemptive kernel can submit the next IO immediately. So there
> > > are bound to be loads where the preemptive kernel turns in better latency
> > > *and throughput* than the scheduling point hack.
> >
> > The I/O pipeline is big enough that a few msec before or later in a
> > submit_bh shouldn't make a difference, the batch logic in the
> > ll_rw_block layer also try to reduce the reschedule, and last but not
> > the least if the task is I/O bound preemptive kernel or not won't make
> > any difference in the submit_bh latency because no task is eating cpu
> > and latency will be the one of pure schedule call.
>
> That's not correct. For one thing, you don't know that no task is eating
> CPU, or that nobody is hogging the kernel. Look at the above, and consider
> the part about the little lags between IOs.
We agree. Actually "if the task is I/O bound", I meant "if nobody is
hogging CPU", I exactly wanted to make the example of no task hogging
CPU in general.
> > > Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> > > it's good to be aware of why that approach can't equal the latency-busting
> > > performance of the preemptive approach.
> >
> > I also don't want to devaluate the preemptive kernel approch (the mean
> > latency it can reach is lower than the one of the lowlat kernel, however
> > I personally care only about worst case latency and this is why I don't
> > feel the need of -preempt),
>
> This is exactly the case that -preempt handles well. On the other hand,
> trying to show that scheduling hacks satisfy any given latency bound is
> equivalent to solving the halting problem.
>
> I thought you had done some real time work?
>
> > but I just wanted to make clear that the
> > idea that is floating around that preemptive kernel is all goodness is
> > very far from reality, you get very low mean latency but at a price.
>
> A price lots of people are willing to pay.
I'm not convinced that all those people knows exactly what they're
buying then 8).
>
> By the way, have you measured the cost of -preempt in practice?
dropping the lock in spin_unlock made a difference in the numbers, so
the overhead must be definitely visible, by simply loading the system
with threaded kernel computation (like webserving etc..etc..).
Andrea
On Tue, 8 Jan 2002, Andrea Arcangeli wrote:
> On Tue, Jan 08, 2002 at 11:55:59AM +0100, Luigi Genoni wrote:
> >
> >
> > On Tue, 8 Jan 2002, Dieter [iso-8859-15] N?tzel wrote (passim):
> >
> > > Is it possible to decide, now what should go into 2.4.18 (maybe -pre3) -aa or
> > > -rmap?
> > [...]
> > > Maybe preemption? It is disengageable so nobody should be harmed but we get
> > > the chance for wider testing.
> > >
> > > Any comments?
> > preemption?? this is eventually 2.5 stuff, and should not be integrated
>
> indeed ("eventually" in the italian sense btw, obvious to me, but not
> for l-k).
>
> I'm not against preemption (I can see the benefits about the mean
> latency for real time DSP) but the claims about preemption making the
> kernel faster doesn't make sense to me. more frequent scheduling,
> overhead of branches in the locks (you've to conditional_schedule after
> the last preemption lock is released and the cachelines for the per-cpu
> preemption locks) and the other preemption stuff can only make the
> kernel slower. Furthmore for multimedia playback any sane kernel out
> there with lowlatency fixes applied will work as well as a preemption
> kernel that pays for all the preemption overhead.
I would add that preemption simply gives a felling of more speed with
interactive usage (with one single user on the system), and also has some
advantages for dedicated servers, but except of those conditions it never
showed in my experience to be a real and decisive advantage.
Of course we are supposing that the preemptive scheduler is very well
done, because otherway (bad working scheduler) there is nothing worse than
preemption.
>
> About the other claim that as the kernel becomes more granular
> performance will increase with preemption in kernel, that's obviously
> wrong as well, it's clearly the other way around. Maybe it was meant
> "latency will decrease further", that's right, but also performance will
> decrease if something.
>
> So yes, mean latency will decrease with preemptive kernel, but your CPU
> is definitely paying something for it.
agreed. Obviously this choice depends on what you want to do with your
system. If you have more than a couple of interactive users (and here I
have also 50 interactive users at the same time on every single system),
preemption is not what you want, period. If you have a desktop system,
well, it is a different situation.
>
> > into 2.4 stable tree. Of course a backport is possible, when/if it will be
> > quite well tested and well working on 2.5
> >
> Andrea
Luigi
Daniel Phillips wrote:
>
> On January 8, 2002 02:33 pm, Anton Blanchard wrote:
> > Andrea Arcangeli [apparently] wrote:
> > > So yes, mean latency will decrease with preemptive kernel, but your CPU
> > > is definitely paying something for it.
> >
> > And Andrew Morton's work suggests he can do a much better job of
> > reducing latency than -preempt.
>
> That's not a particularly clueful comment, Anton. Obviously, any
> latency-busting hacks that Andrew does could also be patched into a
> -preempt kernel.
Yes. The important part is the implicit dropping of the BKL across
schedule().
> What a preemptible kernel can do that a non-preemptible kernel can't is:
> reschedule exactly as often as necessary, instead of having lots of extra
> schedule points inserted all over the place, firing when *they* think the
> time is right, which may well be earlier than necessary.
Nope. `if (current->need_resched)' -> the time is right (beyond right,
actually).
> The preemptible approach is much less of a maintainance headache, since
> people don't have to be constantly doing audits to see if something changed,
> and going in to fiddle with scheduling points.
Except it doesn't work. The full-on low-latency patch has ~60 rescheduling
points. Of these, ~40 involve popping spinlocks. Really, the only significant
latency sources which the preemptible kernel solves are generic_file_read()
and generic_file_write().
So preemptible kernel needs lock-break to be useful. And then it's basically
the same thing, with the same maintainability problems. And believe me, these
are considerable. Mainly because the areas which needs busting up exactly
coincide with the areas where there has been most churn in the kernel.
> Finally, with preemption, rescheduling can be forced with essentially zero
> latency in response to an arbitrary interrupt such as IO completion, whereas
> the non-preemptive kernel will have to 'coast to a stop'. In other words,
> the non-preemptive kernel will have little lags between successive IOs,
> whereas the preemptive kernel can submit the next IO immediately. So there
> are bound to be loads where the preemptive kernel turns in better latency
> *and throughput* than the scheduling point hack.
Latency yes. Throughout no.
I don't think the "preempt slows down the kernel" argument is very valid
really. Let's invert the argument - Linux is multitasking, and that has a
cost. There's no reason why certain bits of the kernel need to violate that
just to get a bit more throughput. If it really worries you, set HZ=10 and
increase all the timeslices, etc.
Now, there *may* be overheads added due to losing the implicit locking which
per-CPU data gives you.
The main cost of preempt IMO is in complexity and stability risks.
(BTW: I took a weird oops testing the preempt patch on an SMP NFS client.
The fault address was 0x0aXXXXXX. No useful backtrace, unfortunately).
> Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> it's good to be aware of why that approach can't equal the latency-busting
> performance of the preemptive approach.
There's no point in just merging the preempt patch and saying "there,
that's done". It doesn't do anything.
Instead, a decision needs to be made: "Linux will henceforth be a
low-latency kernel". Now, IF we can come to this decision, then
internal preemption is the way to do it. But it affects ALL kernel
developers. Because we'll need to introduce a new rule: "it is a
bug to spend more than five milliseconds holding any locks".
So. Do we we want a low-latency kernel? Are we prepared to mandate
the five-millisecond rule? It can be done, but won't be easy, and
we'll never get complete coverage. But I don't see the will around
here.
-
On January 8, 2002 08:47 pm, Andrew Morton wrote:
> There's no point in just merging the preempt patch and saying "there,
> that's done". It doesn't do anything.
>
> Instead, a decision needs to be made: "Linux will henceforth be a
> low-latency kernel".
I thought the intention was to make it a config option?
> Now, IF we can come to this decision, then
> internal preemption is the way to do it. But it affects ALL kernel
> developers. Because we'll need to introduce a new rule: "it is a
> bug to spend more than five milliseconds holding any locks".
>
> So. Do we we want a low-latency kernel? Are we prepared to mandate
> the five-millisecond rule? It can be done, but won't be easy, and
> we'll never get complete coverage. But I don't see the will around
> here.
At least the flaming has gotten a little less ;-)
--
Daniel
Marcelo Tosatti wrote:
>
> > Andrew Morten`s read-latency.patch is a clear winner for me, too.
>
> AFAIK Andrew's code simply adds schedule points around the kernel, right?
>
> If so, nope, I do not plan to integrate it.
I haven't sent it to you yet :) It improves the kernel. That's
good, isn't it? (There are already forty or fifty open-coded
rescheduling points in the kernel. That patch just adds the
missing (and most important) ten).
BTW, with regard to the "preempt and low-lat improve disk throughput"
argument. I have occasionally seen small throughput improvements,
but I think these may be just request-merging flukes. Certainly
they were very small.
The one area where it sometimes makes a huuuuuge throughput
improvement is software RAID.
Much of the VM and dirty buffer writeout code assumes that
submit_bh() starts I/O. Guess what? RAID's submit_bh()
sometimes *doesn't* start I/O. Because the IO is started
by a different thread.
With the Riel VM I had a test case in which software RAID
completely and utterly collapsed because of this. The machine
was spending huge amounts of time spinning in page_launder(), madly
submitting I/O, but never yielding, so the I/O wasn't being started.
-aa VM has an open-coded yield in shrink_cahce() which prevents
that particular collapse. But I had a report yesterday that
the mini-ll patch triples throughput on a complex RAID stack in
2.4.17. Same reason.
Arguably, this is a RAID problem - raidN_make_request() should
be yielding. But it's better to do this in one nice, single,
reviewable place - submit_bh(). However that won't prevent
wait_for_buffers() from starving the raid thread.
RAID is not alone. ksoftirqd, keventd and loop_thread() also
need reasonably good response times.
But given the number of people who have been providing feedback
on this patch, and on the disk-read-latency patch, none of this
is going anywhere, and mine will be the only Linux machines which
don't suck. (Takes ball, goes home).
-
Andrea Arcangeli wrote:
>
> ...
> > What is 00_nanosleep-5 and bootmem ?
>
> nanosleep gives usec resolution to the rest-of-time returned by
> nanosleep, this avoids glibc userspace starvation on nanosleep
> interrupted by a flood of signals. It was requested by glibc people.
>
It would be really nice to do something about that two millisecond
busywait for RT tasks in sys_nanosleep(). It's really foul.
The rudest thing about it is that programmers think "oh, let's
be nice to the machine and use usleep(1000)". Boy, do they
get a shock when they switch their pthread app to use an
RT policy.
Does anyone have any clever ideas?
-
> low-latency kernel". Now, IF we can come to this decision, then
> internal preemption is the way to do it. But it affects ALL kernel
The pre-empt patches just make things much much harder to debug. They
remove some of the predictability and the normal call chain following
goes out of the window because you end up seeing crashes in a thread with
no idea what ran the microsecond before
Some of that happens now but this makes it vastly worse.
The low latency patches don't change the basic predictability and
debuggability but allow you to hit a 1mS pre-empt target for the general
case.
Alan
On Tue, 2002-01-08 at 10:29, Andrea Arcangeli wrote:
> "extra schedule points all over the place", that's the -preempt kernel
> not the lowlatency kernel! (on yeah, you don't see them in the source
> but ask your CPU if it sees them)
How so? The branch on drop of the last lock? It's not a factor in
profiles I've seen. And it is marked unlikely. The other change is the
usual check for reschedule on return from interrupt, but that is the
case already, we just allow it when in-kernel, too.
This makes me think the end conclusion would be that preemptive
multitasking in general is bad. Why don't we increase the timeslice and
and tick period, in that case?
One can argue the complexity degrades performance, but tests show
otherwise. In throughput and latency. Besides, like I always say, its
an option that uses existing kernel (SMP lock) infrastructure. You
don't have to use it ;)
> > The preemptible approach is much less of a maintainance headache, since
> > people don't have to be constantly doing audits to see if something changed,
> > and going in to fiddle with scheduling points.
>
> this yes, it requires less maintainance, but still you should keep in
> mind the details about the spinlocks, things like the checks the VM does
> in shrink_cache are needed also with preemptive kernel.
They impact SMP in the same way they impact preempt-kernel. Long-held
locks are never good. Weird locking rules are never good.
> The I/O pipeline is big enough that a few msec before or later in a
> submit_bh shouldn't make a difference, the batch logic in the
> ll_rw_block layer also try to reduce the reschedule, and last but not
> the least if the task is I/O bound preemptive kernel or not won't make
> any difference in the submit_bh latency because no task is eating cpu
> and latency will be the one of pure schedule call.
Yet throughput tests show marked increase. I believe this is true with
Andrew's patch, too. We multitask better. We dispatch waiting tasks
faster. Without the patch, a queued I/O task may stall for sometime
waiting for some hog to get out of the kernel.
Although the patch's goal is to improve interactivity (*), no one says
the higher priority task we preempt in favor of has to be CPU-bound. A
woken up I/O-bound task is just as benefiting from the patch.
(*) this is, IMO, where we benefit most though. By far the most
pleasing benchmark isn't to see some x% decrease in latency or y more
MB/s in bonnie, but to feel the improvement in interactivity. On a
multitasking desktop, it is noticeable.
> I also don't want to devaluate the preemptive kernel approch (the mean
> latency it can reach is lower than the one of the lowlat kernel, however
> I personally care only about worst case latency and this is why I don't
> feel the need of -preempt), but I just wanted to make clear that the
> idea that is floating around that preemptive kernel is all goodness is
> very far from reality, you get very low mean latency but at a price.
Andrea, I don't want you or anyone to believe preemption is a free
ride. On the other hand, the patch has a _huge_ userbase and you can't
question that. You also can't question the benchmarks that show
improvements in average _and_ worst case latency _and_ throughput.
I don't expect you to use the patch. If it were merged, it is an
option. It provides a framework for continuing to improve latency. It
is a solution to the problem (i.e. latency is poor because the kernel is
non-preemptible) instead of a hack. I agree worst-case latency is
important, and I agree the patch shines more so in average case. But we
do affect worse-case. And now a framework exists for working on fixing
the worst-case latencies too. And in the end, its just an option for
some, but a better kernel for others.
Robert Love
On January 8, 2002 08:47 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > reschedule exactly as often as necessary, instead of having lots of extra
> > schedule points inserted all over the place, firing when *they* think the
> > time is right, which may well be earlier than necessary.
>
> Nope. `if (current->need_resched)' -> the time is right (beyond right,
> actually).
Oops, sorry, right.
The preemptible kernel can reschedule, on average, sooner than the
scheduling-point kernel, which has to wait for a scheduling point to roll
around.
And while I'm enumerating differences, the preemptable kernel (in this
incarnation) has a slight per-spinlock cost, while the non-preemptable kernel
has the fixed cost of checking for rescheduling, at intervals throughout all
'interesting' kernel code, essentially all long-running loops. But by clever
coding it's possible to finesse away almost all the overhead of those loop
checks, so in the end, the non-preemptible low-latency patch has a slight
efficiency advantage here, with emphasis on 'slight'.
--
Daniel
On Tue, 2002-01-08 at 14:47, Andrew Morton wrote:
> > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > reschedule exactly as often as necessary, instead of having lots of extra
> > schedule points inserted all over the place, firing when *they* think the
> > time is right, which may well be earlier than necessary.
>
> Nope. `if (current->need_resched)' -> the time is right (beyond right,
> actually).
Eh, I disagree here. The right time is the moment a high-priority task
becomes runnable. Given your HZ, only a fully preemptible kernel can
come close to meeting that.
> > Finally, with preemption, rescheduling can be forced with essentially zero
> > latency in response to an arbitrary interrupt such as IO completion, whereas
> > the non-preemptive kernel will have to 'coast to a stop'. In other words,
> > the non-preemptive kernel will have little lags between successive IOs,
> > whereas the preemptive kernel can submit the next IO immediately. So there
> > are bound to be loads where the preemptive kernel turns in better latency
> > *and throughput* than the scheduling point hack.
>
> Latency yes. Throughout no.
I bet in _many_ (most?) workloads the preemptible kernel turns in better
throughput. Anytime there is load on the system, there should be a
benefit. I bet the same goes for your patch. I've certainly verified
it for both in various loads myself.
> I don't think the "preempt slows down the kernel" argument is very valid
> really. Let's invert the argument - Linux is multitasking, and that has a
> cost. There's no reason why certain bits of the kernel need to violate that
> just to get a bit more throughput. If it really worries you, set HZ=10 and
> increase all the timeslices, etc.
Very well said. I always find an answer to the "more complexity, more
context switching, blah blah" arguments as ultimately being arguments
tantamount to "preemptive multitasking sucks".
> Now, there *may* be overheads added due to losing the implicit locking which
> per-CPU data gives you.
Perhaps, but note what preempt enable and disable statements effectively
are: an inc and a dec. Not even atomic.
Yes, there is a branch on reenable. This may be an interesting change
to look into. FWIW, we have a construct that doesn't check for
reschedule on reenable, too.
> The main cost of preempt IMO is in complexity and stability risks.
>
> (BTW: I took a weird oops testing the preempt patch on an SMP NFS client.
> The fault address was 0x0aXXXXXX. No useful backtrace, unfortunately).
Should of sent me the oops :)
> > Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> > it's good to be aware of why that approach can't equal the latency-busting
> > performance of the preemptive approach.
>
> There's no point in just merging the preempt patch and saying "there,
> that's done". It doesn't do anything.
>
> Instead, a decision needs to be made: "Linux will henceforth be a
> low-latency kernel". Now, IF we can come to this decision, then
> internal preemption is the way to do it. But it affects ALL kernel
> developers. Because we'll need to introduce a new rule: "it is a
> bug to spend more than five milliseconds holding any locks".
>
> So. Do we we want a low-latency kernel? Are we prepared to mandate
> the five-millisecond rule? It can be done, but won't be easy, and
> we'll never get complete coverage. But I don't see the will around
> here.
I agree here, but then I do have three points:
- proper lock use that benefits SMP scalability benefits preempt-kernel
induced latency improvements. In other words, things like short lock
durations, fine-grained locking, and ditching the BKL benefit both
worlds.
- with the preemptible kernel in place, we can look at the long-held
locks and figure ways to combat them. The preempt-stats patch helps us
find them. Then, we can take a lock-break approach. We can look into
finer grained locks. We can localize lock the lock if it is global or
the BKL. Or we can do something radical like make long-held spinlocks
priority-inheriting when preempt-kernel is enabled. In other words,
preempt-kernel becomes a framework for proper solutions in the future.
- finally, the usual: it's an option.
Robert Love
On Tue, 8 Jan 2002, Daniel Phillips wrote:
> The preemptible kernel can reschedule, on average, sooner than the
> scheduling-point kernel, which has to wait for a scheduling point to
> roll around.
The preemptible kernel ALSO has to wait for a scheduling point
to roll around, since it cannot preempt with spinlocks held.
Considering this, I don't see much of an advantage to adding
kernel preemption.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, 2002-01-08 at 16:08, Rik van Riel wrote:
> The preemptible kernel ALSO has to wait for a scheduling point
> to roll around, since it cannot preempt with spinlocks held.
>
> Considering this, I don't see much of an advantage to adding
> kernel preemption.
It only has to wait if locks are held and then only until the locks are
dropped. Otherwise it will preempt on the next return from interrupt.
Future work would be to look into long-held locks and see what we can
do.
Without preempt-kernel, we have none of this: either run until
completion or explicit scheduling points.
Robert Love
On Tue, 2002-01-08 at 15:18, Daniel Phillips wrote:
> > Instead, a decision needs to be made: "Linux will henceforth be a
> > low-latency kernel".
>
> I thought the intention was to make it a config option?
It was originally, it is now, and I intend it to be.
Further, since it uses the existing SMP locks, it doesn't introduce new
design decisions (the one being protection of implicitly locked per-CPU
data on preempt).
Robert Love
On Tue, 2002-01-08 at 15:59, Daniel Phillips wrote:
> And while I'm enumerating differences, the preemptable kernel (in this
> incarnation) has a slight per-spinlock cost, while the non-preemptable kernel
> has the fixed cost of checking for rescheduling, at intervals throughout all
> 'interesting' kernel code, essentially all long-running loops. But by clever
> coding it's possible to finesse away almost all the overhead of those loop
> checks, so in the end, the non-preemptible low-latency patch has a slight
> efficiency advantage here, with emphasis on 'slight'.
True (re spinlock weight in preemptible kernel) but how is that not
comparable to explicit scheduling points? Worse, the preempt-kernel
typically does its preemption on a branch on return to interrupt
(similar to user space's preemption). What better time to check and
reschedule if needed?
Robert Love
Daniel Phillips wrote:
>
> On January 8, 2002 08:47 pm, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > > reschedule exactly as often as necessary, instead of having lots of extra
> > > schedule points inserted all over the place, firing when *they* think the
> > > time is right, which may well be earlier than necessary.
> >
> > Nope. `if (current->need_resched)' -> the time is right (beyond right,
> > actually).
>
> Oops, sorry, right.
>
> The preemptible kernel can reschedule, on average, sooner than the
> scheduling-point kernel, which has to wait for a scheduling point to roll
> around.
That's theory. Practice (ie: instrumentation) says that the preempt
patch makes little improvement over conditional yields in generic_file_read()
and generic_file_write(). Four lines. Additional yields in wait_for_buffers()
(where we move zillions of buffers from BUF_LOCKED to BUF_CLEAN) and in submit_bh()
and bread() are cream.
Preemptability is global in its impact, and in its effect. It requires
global changes to make it useful. If we're prepared to make those
changes (scan_swap_map, truncate_inode_pages, etc) then fine. Go for
it. We'll end up with a better kernel.
> And while I'm enumerating differences, the preemptable kernel (in this
> incarnation) has a slight per-spinlock cost, while the non-preemptable kernel
> has the fixed cost of checking for rescheduling, at intervals throughout all
> 'interesting' kernel code, essentially all long-running loops. But by clever
> coding it's possible to finesse away almost all the overhead of those loop
> checks, so in the end, the non-preemptible low-latency patch has a slight
> efficiency advantage here, with emphasis on 'slight'.
>
As I said: I don't buy the efficiency worries at all. If scheduling pressure
is so high that either approach impacts performance, then scheduling pressure
is too high. We need to fix the context switch rate and/or speed up context
switches. The overhead of conditional_schedule() or preempt will be zilch.
-
On 8 Jan 2002, Robert Love wrote:
> On Tue, 2002-01-08 at 16:08, Rik van Riel wrote:
>
> > The preemptible kernel ALSO has to wait for a scheduling point
> > to roll around, since it cannot preempt with spinlocks held.
> >
> > Considering this, I don't see much of an advantage to adding
> > kernel preemption.
>
> It only has to wait if locks are held and then only until the locks are
> dropped. Otherwise it will preempt on the next return from interrupt.
So what exactly _is_ the difference between an explicit
preemption point and a place where we need to explicitly
drop a spinlock ?
>From what I can see, there really isn't a difference.
> Future work would be to look into long-held locks and see what we can
> do.
One thing we could do is download Andrew Morton's patch ;)
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Tue, 2002-01-08 at 16:24, Rik van Riel wrote:
> So what exactly _is_ the difference between an explicit
> preemption point and a place where we need to explicitly
> drop a spinlock ?
In that case nothing, except that when we drop the lock and check it is
the earliest place where preemption is allowed. In the normal scenario,
however, we have a check for reschedule on return from interrupt (e.g.
the timer) and thus preempt in the same manner as with user space and
that is the key.
> > Future work would be to look into long-held locks and see what we can
> > do.
>
> One thing we could do is download Andrew Morton's patch ;)
That is certainly one option, and Andrew's patch is very good.
Nonetheless, I think we need a more general framework that tackles the
problem itself. Preemptible kernel does this, yields results now, and
allows for greater return later on.
Robert Love
> And while I'm enumerating differences, the preemptable kernel
> (in this
> incarnation) has a slight per-spinlock cost, while the
> non-preemptable kernel
> has the fixed cost of checking for rescheduling, at intervals
> throughout all
> 'interesting' kernel code, essentially all long-running
> loops.
For a general case, that cost is leveraged by the improvement in scheduling,
by filling out the IO channels better, and thus, using most resources more
efficiently. I did some dirty tests that showed that the preemptible kernel
performed more or less one second better than the normal one when unzipping
and compiling a kernel [dirty general case]. The std deviation is around the
time difference, so we can quite conclude the impact is zero -- asides from
the improvement in responsiveness].
Please see my message to the mailing list at
http://www.geocrawler.com/archives/3/14905/2001/11/0/7074067/ [the excel
spreadsheet is available at request].
I?aky P?rez Gonz?lez -- (503) 677 6807
I do not speak for Intel Corp, opinions are my own.
On January 8, 2002 10:08 pm, Rik van Riel wrote:
> On Tue, 8 Jan 2002, Daniel Phillips wrote:
>
> > The preemptible kernel can reschedule, on average, sooner than the
> > scheduling-point kernel, which has to wait for a scheduling point to
> > roll around.
>
> The preemptible kernel ALSO has to wait for a scheduling point
> to roll around, since it cannot preempt with spinlocks held.
Think about the relative amount of time spent inside spinlocks vs regular
kernel.
> Considering this, I don't see much of an advantage to adding
> kernel preemption.
And now?
--
Daniel
On January 8, 2002 10:17 pm, Robert Love wrote:
> On Tue, 2002-01-08 at 15:59, Daniel Phillips wrote:
>
> > And while I'm enumerating differences, the preemptable kernel (in this
> > incarnation) has a slight per-spinlock cost, while the non-preemptable kernel
> > has the fixed cost of checking for rescheduling, at intervals throughout all
> > 'interesting' kernel code, essentially all long-running loops. But by clever
> > coding it's possible to finesse away almost all the overhead of those loop
> > checks, so in the end, the non-preemptible low-latency patch has a slight
> > efficiency advantage here, with emphasis on 'slight'.
>
> True (re spinlock weight in preemptible kernel) but how is that not
> comparable to explicit scheduling points? Worse, the preempt-kernel
> typically does its preemption on a branch on return to interrupt
> (similar to user space's preemption). What better time to check and
> reschedule if needed?
The per-spinlock cost I was refering to is the cost of the inc/dec per
spinlock. I guess this cost is small enough as to be hard to measure, but
I have not tried so I don't know. Curiously, none of the people I've heard
making pronouncements on the overhead of your preempt patch seem to have
measured it either.
--
Daniel
On Tue, 2002-01-08 at 16:57, Daniel Phillips wrote:
> > True (re spinlock weight in preemptible kernel) but how is that not
> > comparable to explicit scheduling points? Worse, the preempt-kernel
> > typically does its preemption on a branch on return to interrupt
> > (similar to user space's preemption). What better time to check and
> > reschedule if needed?
>
> The per-spinlock cost I was refering to is the cost of the inc/dec per
> spinlock. I guess this cost is small enough as to be hard to measure, but
> I have not tried so I don't know. Curiously, none of the people I've heard
> making pronouncements on the overhead of your preempt patch seem to have
> measured it either.
;-)
If they did I suspect it would be minimal. Andrew's point on complexity
and overhead in this manner is exact -- such thinks are just not an
issue.
I see two valid arguments against kernel preemption, and I'll be the
first to admit them:
- we introduce new problems with kernel programming. specifically, the
issue with implicitly locked per-CPU data. honestly, this isn't a huge
deal. I've been working on preempt-kernel for awhile now and the
problems we have found and fixed are minimal. admittedly, however,
especially wrt the future, preempt-kernel may introduce new concerns. I
say let's rise to meet them.
- we don't do enough for the worst-case latency. this is where future
work is useful and where preempt-kernel provides the framework for a
better kernel.
I want a better kernel. Hell, I want the best kernel. In my opinion,
one factor of that is having a preemptible kernel.
Robert Love
On Tuesdayen den 8 January 2002 21.13, Alan Cox wrote:
> > low-latency kernel". Now, IF we can come to this decision, then
> > internal preemption is the way to do it. But it affects ALL kernel
>
> The pre-empt patches just make things much much harder to debug. They
> remove some of the predictability and the normal call chain following
> goes out of the window because you end up seeing crashes in a thread with
> no idea what ran the microsecond before
>
> Some of that happens now but this makes it vastly worse.
>
> The low latency patches don't change the basic predictability and
> debuggability but allow you to hit a 1mS pre-empt target for the general
> case.
>
Yes, it does make things much much harder to debug - but:
* If you get a problem on a preemtive UP kernel, it is likely to be a problem
on a SMP too - and those are hard to debug aswell. But the positive aspect
is that you get more people that can do the debugging... :-)
(One CPU gets delayed with handling a IRQ the other runs into the critical
section)
* It is optional at compile time.
And could even be made run time optional / CPU ! Just set a too big value
on the counter and it will never reschedule...
/RogerL
--
Roger Larsson
Skellefte?
Sweden
Daniel Phillips wrote:
>
> On January 8, 2002 08:47 pm, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > > reschedule exactly as often as necessary, instead of having lots of extra
> > > schedule points inserted all over the place, firing when *they* think the
> > > time is right, which may well be earlier than necessary.
> >
> > Nope. `if (current->need_resched)' -> the time is right (beyond right,
> > actually).
>
> Oops, sorry, right.
>
> The preemptible kernel can reschedule, on average, sooner than the
> scheduling-point kernel, which has to wait for a scheduling point to roll
> around.
>
Yes. It can also fix problematic areas which my testing
didn't cover.
Incidentally, there's the SMP problem. Suppose we
have the code:
lock_kernel();
for (lots) {
do(something sucky);
if (current->need_resched)
schedule();
}
unlock_kernel();
This works fine on UP, but not on SMP. The scenario:
- CPU A runs this loop.
- CPU B is spinning on the lock.
- Interrupt occurs, kernel elects to run RT task on CPU B.
CPU A doesn't have need_resched set, and just keeps
on going. CPU B is stuck spinning on the lock.
This is only an issue for the low-latency patch - all the
other approaches still have sufficiently bad worse-case that
this scenario isn't worth worrying about.
I toyed with creating spin_lock_while_polling_resched(),
but ended up changing the scheduler to set need_resched
against _all_ CPUs if an RT task is being woken (yes, yuk).
-
Robert Love wrote:
>
> On Tue, 2002-01-08 at 16:24, Rik van Riel wrote:
>
> > So what exactly _is_ the difference between an explicit
> > preemption point and a place where we need to explicitly
> > drop a spinlock ?
>
> In that case nothing, except that when we drop the lock and check it is
> the earliest place where preemption is allowed. In the normal scenario,
> however, we have a check for reschedule on return from interrupt (e.g.
> the timer) and thus preempt in the same manner as with user space and
> that is the key.
One could do:
static inline void spin_unlock(spinlock_t *lock)
{
__asm__ __volatile__(
spin_unlock_string
);
if (--current->lock_depth == 0 &&
current->need_resched &&
current->state == TASK_RUNNING)
schedule();
}
But I have generally avoided "global" solutions like this, in favour
of nailing the _specific_ code which is causing the problem. Which
is a lot more work, but more useful.
The scheduling points in bread() and submit_bh() in the mini-ll patch
go against this (masochistic) philosophy.
> > > Future work would be to look into long-held locks and see what we can
> > > do.
> >
> > One thing we could do is download Andrew Morton's patch ;)
>
> That is certainly one option, and Andrew's patch is very good.
> Nonetheless, I think we need a more general framework that tackles the
> problem itself. Preemptible kernel does this, yields results now, and
> allows for greater return later on.
We need something which makes 2.4.x not suck.
-
On Tue, 8 Jan 2002, Daniel Phillips wrote:
> On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > > The preemptible approach is much less of a maintainance headache, since
> > > people don't have to be constantly doing audits to see if something changed,
> > > and going in to fiddle with scheduling points.
> >
> > this yes, it requires less maintainance, but still you should keep in
> > mind the details about the spinlocks, things like the checks the VM does
> > in shrink_cache are needed also with preemptive kernel.
>
> Yes of course, the spinlock regions still have to be analyzed and both
> patches have to be maintained for that. Long duration spinlocks are bad
> by any measure, and have to be dealt with anyway.
>
> > > Finally, with preemption, rescheduling can be forced with essentially zero
> > > latency in response to an arbitrary interrupt such as IO completion, whereas
> > > the non-preemptive kernel will have to 'coast to a stop'. In other words,
> > > the non-preemptive kernel will have little lags between successive IOs,
> > > whereas the preemptive kernel can submit the next IO immediately. So there
> > > are bound to be loads where the preemptive kernel turns in better latency
> > > *and throughput* than the scheduling point hack.
> >
> > The I/O pipeline is big enough that a few msec before or later in a
> > submit_bh shouldn't make a difference, the batch logic in the
> > ll_rw_block layer also try to reduce the reschedule, and last but not
> > the least if the task is I/O bound preemptive kernel or not won't make
> > any difference in the submit_bh latency because no task is eating cpu
> > and latency will be the one of pure schedule call.
>
> That's not correct. For one thing, you don't know that no task is eating
> CPU, or that nobody is hogging the kernel. Look at the above, and consider
> the part about the little lags between IOs.
>
> > > Mind you, I'm not devaluing Andrew's work, it's good and valuable. However
> > > it's good to be aware of why that approach can't equal the latency-busting
> > > performance of the preemptive approach.
> >
> > I also don't want to devaluate the preemptive kernel approch (the mean
> > latency it can reach is lower than the one of the lowlat kernel, however
> > I personally care only about worst case latency and this is why I don't
> > feel the need of -preempt),
>
> This is exactly the case that -preempt handles well. On the other hand,
> trying to show that scheduling hacks satisfy any given latency bound is
> equivalent to solving the halting problem.
>
> I thought you had done some real time work?
>
> > but I just wanted to make clear that the
> > idea that is floating around that preemptive kernel is all goodness is
> > very far from reality, you get very low mean latency but at a price.
>
> A price lots of people are willing to pay
Probably sometimes they are not making a good business. In the reality
preempt is good in many scenarios, as I said, and I agree that for
desktops, and dedicated servers where just one application runs, and
probably the CPU is idle the most of the time, indeed users have a speed
feeling. Please consider that on eavilly loaded servers, with 40 and more
users, some are running gcc, others g77, others g++ compilations, someone
runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
form xterm kde or gnome), and and
and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
mainly thinking of dualAthlon systems).
there is a lot of memory and disk I/O.
This is not a strange scenary on the interactive servers used at SNS.
Here preempt has a too high price
>
> By the way, have you measured the cost of -preempt in practice?
>
Yes, I did a lot of tests, and with current preempt patch definitelly
I was seeing a too big performance loss.
On Tue, 8 Jan 2002, Daniel Phillips wrote:
> On January 8, 2002 08:47 pm, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > > reschedule exactly as often as necessary, instead of having lots of extra
> > > schedule points inserted all over the place, firing when *they* think the
> > > time is right, which may well be earlier than necessary.
> >
> > Nope. `if (current->need_resched)' -> the time is right (beyond right,
> > actually).
>
> Oops, sorry, right.
>
> The preemptible kernel can reschedule, on average, sooner than the
> scheduling-point kernel, which has to wait for a scheduling point to roll
> around.
>
mmhhh. At which cost? And then anyway if I have a spinlock, I still have
to wait for a scheduling point to roll around.
On Wed, Jan 09, 2002 at 12:02:48AM +0100, Luigi Genoni wrote:
| Probably sometimes they are not making a good business. In the reality
| preempt is good in many scenarios, as I said, and I agree that for
| desktops, and dedicated servers where just one application runs, and
| probably the CPU is idle the most of the time, indeed users have a speed
| feeling. Please consider that on eavilly loaded servers, with 40 and more
| users, some are running gcc, others g77, others g++ compilations, someone
| runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
| form xterm kde or gnome), and and
| and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
| mainly thinking of dualAthlon systems).
| there is a lot of memory and disk I/O.
| This is not a strange scenary on the interactive servers used at SNS.
| Here preempt has a too high price
MacOS 9 is the OS for you.
Essentially what the low-latency patches are is cooperative
multitasking. Which has less overhead in some cases than preemptive as
long as everyone is equally nice and calls WaitNextEvent() within the
right inner loops. In the absence of preemptive, Andrew's patch is the
next best thing. But Bad Things happen without preemptive. Just try
using Mac OS 9. ;)
Preemptive gives better interactivity under load, which is the whole
point of multitasking (think about it). If you don't want the overhead
(which also exists without preemptive) run #processes == #processors.
Whether or not preemptive is applied, having a large number of processes
active is a performance hit from context switches, cache thrashing, etc.
Preemptive punishes (and rewards) everyone equally, thus better latency.
I'm really surprised that people are still actually arguing _against_
preemptive multitasking in this day and age. This is a no-brainer in
the long run, where current corner cases aren't holding us back.
At least IMVHO.
--
Ken.
[email protected]
| > By the way, have you measured the cost of -preempt in practice?
| >
| Yes, I did a lot of tests, and with current preempt patch definitelly
| I was seeing a too big performance loss.
|
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2002-01-08 at 18:32, Ken Brownfield wrote:
> On Wed, Jan 09, 2002 at 12:02:48AM +0100, Luigi Genoni wrote:
> | Probably sometimes they are not making a good business. In the reality
> | preempt is good in many scenarios, as I said, and I agree that for
> | desktops, and dedicated servers where just one application runs, and
> | probably the CPU is idle the most of the time, indeed users have a speed
> | feeling. Please consider that on eavilly loaded servers, with 40 and more
> | users, some are running gcc, others g77, others g++ compilations, someone
> | runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
> | form xterm kde or gnome), and and
> | and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
> | mainly thinking of dualAthlon systems).
> | there is a lot of memory and disk I/O.
> | This is not a strange scenary on the interactive servers used at SNS.
> | Here preempt has a too high price
>
> MacOS 9 is the OS for you.
>
> Essentially what the low-latency patches are is cooperative
> multitasking. Which has less overhead in some cases than preemptive as
> long as everyone is equally nice and calls WaitNextEvent() within the
> right inner loops. In the absence of preemptive, Andrew's patch is the
> next best thing. But Bad Things happen without preemptive. Just try
> using Mac OS 9. ;)
>
> Preemptive gives better interactivity under load, which is the whole
> point of multitasking (think about it). If you don't want the overhead
> (which also exists without preemptive) run #processes == #processors.
>
> Whether or not preemptive is applied, having a large number of processes
> active is a performance hit from context switches, cache thrashing, etc.
> Preemptive punishes (and rewards) everyone equally, thus better latency.
>
> I'm really surprised that people are still actually arguing _against_
> preemptive multitasking in this day and age. This is a no-brainer in
> the long run, where current corner cases aren't holding us back.
Amen.
Of course, the counter argument is that we, as kernel programmers, can
design everything to behave kindly cooperatively. I reply "now that the
kernel is SMP safe it is trivial to become preemptible" but some still
don't take the patch. I'll keep trucking along.
Robert Love
On Tue, 8 Jan 2002, Ken Brownfield wrote:
> On Wed, Jan 09, 2002 at 12:02:48AM +0100, Luigi Genoni wrote:
> | Probably sometimes they are not making a good business. In the reality
> | preempt is good in many scenarios, as I said, and I agree that for
> | desktops, and dedicated servers where just one application runs, and
> | probably the CPU is idle the most of the time, indeed users have a speed
> | feeling. Please consider that on eavilly loaded servers, with 40 and more
> | users, some are running gcc, others g77, others g++ compilations, someone
> | runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
> | form xterm kde or gnome), and and
> | and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
> | mainly thinking of dualAthlon systems).
> | there is a lot of memory and disk I/O.
> | This is not a strange scenary on the interactive servers used at SNS.
> | Here preempt has a too high price
>
> MacOS 9 is the OS for you.
>
> Essentially what the low-latency patches are is cooperative
> multitasking. Which has less overhead in some cases than preemptive as
> long as everyone is equally nice and calls WaitNextEvent() within the
> right inner loops. In the absence of preemptive, Andrew's patch is the
> next best thing. But Bad Things happen without preemptive. Just try
> using Mac OS 9 :)
Not exaclty what I was thinking about.
I listened some horror story from MAC sysadmin at SNS
>
> Preemptive gives better interactivity under load, which is the whole
> point of multitasking (think about it). If you don't want the overhead
> (which also exists without preemptive) run #processes == #processors.
>
> Whether or not preemptive is applied, having a large number of processes
> active is a performance hit from context switches, cache thrashing, etc.
> Preemptive punishes (and rewards) everyone equally, thus better latency.
you are supposing that I want them to be punished equally. But there are
cases when that is not what you want ;). Thing if one users runs a
montecarlo code for test in the server I was describing. This job could
run, let's say, a couple of hour, and also under nice 20 it can suck a
lot.
>
> I'm really surprised that people are still actually arguing _against_
> preemptive multitasking in this day and age. This is a no-brainer in
> the long run, where current corner cases aren't holding us back.
>
> At least IMVHO.
What I am talking about is some test I did some week ago. The initial post
of this thread, I think, was very clear about that. On the long run, with
a very well tested implementation. Actually it is not a good idea to
insert preempt nside of the 2.4 stable tree,
because there is a lot of work to do
to get a very WELL TESTED implementation.
> Preemptive gives better interactivity under load, which is the whole
> point of multitasking (think about it). If you don't want the overhead
> (which also exists without preemptive) run #processes == #processors.
That is generally not true. Pe-emption is used in user space to prevent
applications doing very stupid things. Pre-emption in a trusted environment
can often be most efficient if done by the programs themselves.
Userspace is not a trusted environment
> I'm really surprised that people are still actually arguing _against_
> preemptive multitasking in this day and age. This is a no-brainer in
> the long run, where current corner cases aren't holding us back.
Andrew's patches give you 1mS worst case latency for normal situations, that
is below human perception, and below scheduling granularity. In other words
without the efficiency loss and the debugging problems you can place the
far enough latency below other effects that it isnt worth attacking any more.
Alan
On Wednesday, 9. January 2002 00:02, Luigi Genoni wrote:
> On Tue, 8 Jan 2002, Daniel Phillips wrote:
> > On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
[-]
> > > I also don't want to devaluate the preemptive kernel approch (the mean
> > > latency it can reach is lower than the one of the lowlat kernel,
> > > however I personally care only about worst case latency and this is why
> > > I don't feel the need of -preempt),
> >
> > This is exactly the case that -preempt handles well. On the other hand,
> > trying to show that scheduling hacks satisfy any given latency bound is
> > equivalent to solving the halting problem.
> >
> > I thought you had done some real time work?
> >
> > > but I just wanted to make clear that the
> > > idea that is floating around that preemptive kernel is all goodness is
> > > very far from reality, you get very low mean latency but at a price.
> >
> > A price lots of people are willing to pay
>
> Probably sometimes they are not making a good business. In the reality
> preempt is good in many scenarios, as I said, and I agree that for
> desktops, and dedicated servers where just one application runs, and
> probably the CPU is idle the most of the time,
OK, good. You are much at the same line than I am.
Should we starting not only to differentiate between UP and SMP systems but
allthought between desktop and (big) servers?
I remember one saying. "Think, this patch is worth only for ~0.05% of the
Linux users..." (He meant the multi SMP system users.)
Allmost 99.95% of the Linux users running desktops and I am somewhat tiered
of saying, "sorry, Linux is under development..."
Look at the imprint of the famous German ct magazine (they are not even known
as Linux bashers...;-). It shows little penguins falling like domino stones
(starting with 2.4.17).
Let me rephrase it:
I appreciate all your great work and I know "only" some (little) internals of
it but we should do some interactivity improvements for the 2.4 kernel, too.
I know what it's worth Andrew's (lowlatency patch) and Robert's (George
Anzinger's) preempt patch. In short the system (bigger desktop) flies.
The holly grail would be a combination of preempt+lock-break plus lowlatency
and Ingo's O(1) scheduler.
My main focus lies on 3D graphics not kernel and I use KDE (yes, a little
luxury:-) 'cause KDE is C++ and most visualization systems are c and later
c++.
Without the above patches even my 1 GHz Athlon II, 640 MB, feels sluggish.
But I don't forget to think about throughput which is even usefull for
"heavy" compiler runs...
> indeed users have a speed
> feeling. Please consider that on eavilly loaded servers, with 40 and more
> users, some are running gcc, others g77, others g++ compilations, someone
> runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
> form xterm kde or gnome), and and
> and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
> mainly thinking of dualAthlon systems).
> there is a lot of memory and disk I/O.
> This is not a strange scenary on the interactive servers used at SNS.
> Here preempt has a too high price
That's why preempt is a compile time option, btw.
> > By the way, have you measured the cost of -preempt in practice?
>
> Yes, I did a lot of tests, and with current preempt patch definitelly
> I was seeing a too big performance loss.
Have you tried with stock 2.4.17 or with additional patches?
2.4.17-rc2aa2 (10_vm-21)?
The later make big differences in throughput for me (with and without
preempt).
I am under preparation of some numbers.
Anybody want some special tests?
dbench (yes, I know...) with and without MP3 during run
latencytest0.42-png
bonnie++
getc_putc
Thank you for all your serious answers. This was definitely not intended as a
flamewar start.
-Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Wed, 9 Jan 2002 00:10:38 +0000 (GMT), Alan Cox
<[email protected]> wrote:
>> Preemptive gives better interactivity under load, which is the whole
>> point of multitasking (think about it). If you don't want the overhead
>> (which also exists without preemptive) run #processes == #processors.
>
>That is generally not true. Pe-emption is used in user space to prevent
>applications doing very stupid things. Pre-emption in a trusted environment
>can often be most efficient if done by the programs themselves.
>
>Userspace is not a trusted environment
The best part about planned preemption points is that there is minimal
state to save when an interruption occurs.
>
>> I'm really surprised that people are still actually arguing _against_
>> preemptive multitasking in this day and age. This is a no-brainer in
>> the long run, where current corner cases aren't holding us back.
>
>Andrew's patches give you 1mS worst case latency for normal situations, that
>is below human perception, and below scheduling granularity. In other words
>without the efficiency loss and the debugging problems you can place the
>far enough latency below other effects that it isnt worth attacking any more.
Incidently human visual perception runs around 200 milliseconds
minimum and hearing/touch perception around 100 milliseconds if the
signal has to go through the brain. Of course we extend our
perceptions with tools/programs etc.
john
On Tue, 8 Jan 2002, Andrew Morton wrote:
> > What a preemptible kernel can do that a non-preemptible kernel can't is:
> > reschedule exactly as often as necessary, instead of having lots of extra
> > schedule points inserted all over the place, firing when *they* think the
> > time is right, which may well be earlier than necessary.
>
> Nope. `if (current->need_resched)' -> the time is right (beyond right,
> actually).
Have we ever considered making rescheduling work like get_user? That is,
make current->need_resched be a pointer, and if we need to reschedule,
make it an INVALID pointer that causes us to fault and call schedule in
its fault path?
Orthogonally, for rescheduling points with locks, we could build a version
of the spinlocks that know when they're blocking other processes and can
do a spin_yield(&lock) in places where they can safely give up a lock. On
single processor, spin_yield could translate to a scheduling point.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
On Tue, 2002-01-08 at 19:29, John Alvord wrote:
> The best part about planned preemption points is that there is minimal
> state to save when an interruption occurs.
Actually, both preempt-kernel and low-latency do about the same amount
of work re saving state.
With preempt-kernel, when a task is preempted in-kernel we AND a flag
value into the preempt count. That is all we need to keep track of
things.
With low-latency, the task state is set to TASK_RUNNING (which is a
precautionary measure). So it is about the same, although low-latency
(and lock-break) often also have to do various setup with the locks and
all.
Robert Love
On Wed, 9 Jan 2002, Dieter [iso-8859-15] N?tzel wrote:
> On Wednesday, 9. January 2002 00:02, Luigi Genoni wrote:
> > On Tue, 8 Jan 2002, Daniel Phillips wrote:
> > > On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> [-]
> > > > I also don't want to devaluate the preemptive kernel approch (the mean
> > > > latency it can reach is lower than the one of the lowlat kernel,
> > > > however I personally care only about worst case latency and this is why
> > > > I don't feel the need of -preempt),
> > >
> > > This is exactly the case that -preempt handles well. On the other hand,
> > > trying to show that scheduling hacks satisfy any given latency bound is
> > > equivalent to solving the halting problem.
> > >
> > > I thought you had done some real time work?
> > >
> > > > but I just wanted to make clear that the
> > > > idea that is floating around that preemptive kernel is all goodness is
> > > > very far from reality, you get very low mean latency but at a price.
> > >
> > > A price lots of people are willing to pay
> >
> > Probably sometimes they are not making a good business. In the reality
> > preempt is good in many scenarios, as I said, and I agree that for
> > desktops, and dedicated servers where just one application runs, and
> > probably the CPU is idle the most of the time,
>
> OK, good. You are much at the same line than I am.
>
> Should we starting not only to differentiate between UP and SMP systems but
> allthought between desktop and (big) servers?
> I remember one saying. "Think, this patch is worth only for ~0.05% of the
> Linux users..." (He meant the multi SMP system users.)
Linux is suitable for many use, with a lot of HW.
There are a lot of thing you MUST differentiate, not just desktop and
servers, and uniprocessor and multi SMP
Just think to this, Linux runs on
sprac64, alpha, ia64, pa-risc, ppc and so on, and all those platforms have
work well with it.
Don't you see?
And please, lkml is the place to differentiate, because here people talk
about development, and how to do it at best.
>
> Allmost 99.95% of the Linux users running desktops and I am somewhat tiered
> of saying, "sorry, Linux is under development..."
> Look at the imprint of the famous German ct magazine (they are not even known
> as Linux bashers...;-). It shows little penguins falling like domino stones
> (starting with 2.4.17).
yes and no. Maybe 95% of linux box out here are desktops, and they would
benefit of preemption. For the others, it's an option not to enable it.
But what we were discussing here was if preemption is a greek panacea or
not. And if not, why. And then, How could things be improoved. It is
important to discuss those topics here.
>
> Let me rephrase it:
> I appreciate all your great work and I know "only" some (little) internals of
> it but we should do some interactivity improvements for the 2.4 kernel, too.
> I know what it's worth Andrew's (lowlatency patch) and Robert's (George
> Anzinger's) preempt patch. In short the system (bigger desktop) flies.
yes, but this is mostly 2.5 stuff. Then it can be backported.
>
>
> > indeed users have a speed
> > feeling. Please consider that on eavilly loaded servers, with 40 and more
> > users, some are running gcc, others g77, others g++ compilations, someone
> > runs pine or mutt or kmail, and netscape, and mozilla, and emacs (someone
> > form xterm kde or gnome), and and
> > and... You can have also 4/8 CPU butthey are not infinite ;) (but I talk
> > mainly thinking of dualAthlon systems).
> > there is a lot of memory and disk I/O.
> > This is not a strange scenary on the interactive servers used at SNS.
> > Here preempt has a too high price
>
> That's why preempt is a compile time option, btw.
>
> > > By the way, have you measured the cost of -preempt in practice?
> >
> > Yes, I did a lot of tests, and with current preempt patch definitelly
> > I was seeing a too big performance loss.
>
> Have you tried with stock 2.4.17 or with additional patches?
> 2.4.17-rc2aa2 (10_vm-21)?
Obviously. This is an important part of my work (and hobby).
>
> The later make big differences in throughput for me (with and without
> preempt).
for me too, 2.4.17+rc2aa2 is specially good with medium sized databases.
>
> I am under preparation of some numbers.
> Anybody want some special tests?
> dbench (yes, I know...) with and without MP3 during run
> latencytest0.42-png
> bonnie++
> getc_putc
very interested in what you will get.
>
> Thank you for all your serious answers. This was definitely not intended as a
> flamewar start.
and this was not a flamewar...
Alan Cox wrote:
>
> Andrew's patches give you 1mS worst case latency for normal situations, that
> is below human perception, and below scheduling granularity.
The full ll patch is pretty gruesome though.
The high-end audio synth guys claim that two milliseconds is getting
to be too much. They are generating real-time audio and they do
have more than one round-trip through the software. It adds up.
Linux is being used in so many different applications now. You are,
I think, one of the stronger recognisers of the fact that we do not
only use Linux to squirt out html and to provide shell prompts to snotty
students. Good scheduling responsiveness is a valuable feature.
I haven't seen any figures for embedded XP, but it is said that
if you bend over backwards you can get 10 milliseconds out of NT4,
and 4-5 out of the fabled BeOS. This is one area where we can
fairly easily be very much the best. It's low-hanging fruit.
Internal preemptability is, in my opinion, the best way to deliver
this.
I accept your point about it making debugging harder - I would
suggest that the preempt code be altered so that it can be disabled
at runtime, rather than via a rebuild. I suspect this can be
done at zero cost by setting init_task's preempt count to 1000000
via a kernel boot option. And at almost-zero cost via a sysctl.
I would further suggest that support be added to the kernel to
allow general users to both detect and find the source of latency
problems. That's actually pretty easy - realfeel running at
2 kHz only consumes 2-3% of the CPU. It can just be left ticking
over in the background.
With preemptability merged in 2.5 we can then work to fix the
long-held locks. Most of them are simple. Some of them are
very much not. I'll gladly help with that.
-
On Wed, 2002-01-09 at 00:08, Andrew Morton wrote:
> [snip]
> With preemptability merged in 2.5 we can then work to fix the
> long-held locks. Most of them are simple. Some of them are
> very much not. I'll gladly help with that.
Amen. Thank you, Andrew.
Let's work together and make a better kernel.
Robert Love
On January 9, 2002 12:02 am, Luigi Genoni wrote:
> On Tue, 8 Jan 2002, Daniel Phillips wrote:
> > On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > > but I just wanted to make clear that the
> > > idea that is floating around that preemptive kernel is all goodness is
> > > very far from reality, you get very low mean latency but at a price.
> >
> > A price lots of people are willing to pay
>
> Probably sometimes they are not making a good business.
Perhaps. But they are happy customers and their music sounds better.
Note: the dominating cost of -preempt is not Robert's patch, but the fact
that you need to have CONFIG_SMP enabled, even for uniprocessor, turning all
those stub macros into real spinlocks. For a dual processor you have to have
this anyway and it just isn't an issue.
Personally, I don't intend to ever get another single-processor machine,
except maybe a laptop, and that's only if Transmeta doesn't come up with a
dual-processor laptop configuration.
> > By the way, have you measured the cost of -preempt in practice?
>
> Yes, I did a lot of tests, and with current preempt patch definitelly
> I was seeing a too big performance loss.
Was this on uniprocessor machines, or your dual Athlons? How did you measure
the performance?
--
Daniel
On January 9, 2002 12:26 am, Luigi Genoni wrote:
> On Tue, 8 Jan 2002, Daniel Phillips wrote:
>
> > On January 8, 2002 08:47 pm, Andrew Morton wrote:
> > > Daniel Phillips wrote:
> > > > What a preemptible kernel can do that a non-preemptible kernel can't
is:
> > > > reschedule exactly as often as necessary, instead of having lots of
extra
> > > > schedule points inserted all over the place, firing when *they* think
the
> > > > time is right, which may well be earlier than necessary.
> > >
> > > Nope. `if (current->need_resched)' -> the time is right (beyond right,
> > > actually).
> >
> > Oops, sorry, right.
> >
> > The preemptible kernel can reschedule, on average, sooner than the
> > scheduling-point kernel, which has to wait for a scheduling point to roll
> > around.
>
> mmhhh. At which cost? And then anyway if I have a spinlock, I still have
> to wait for a scheduling point to roll around.
Did you read the thread? Think about the relative amount of time spent in
spinlocks vs the amount of time spent in the regular kernel.
--
Daniel
(the subject has been wrong for some time now...)
On Wednesday den 9 January 2002 07.26, Daniel Phillips wrote:
> On January 9, 2002 12:02 am, Luigi Genoni wrote:
> > On Tue, 8 Jan 2002, Daniel Phillips wrote:
> > > On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > > > but I just wanted to make clear that the
> > > > idea that is floating around that preemptive kernel is all goodness
> > > > is very far from reality, you get very low mean latency but at a
> > > > price.
> > >
> > > A price lots of people are willing to pay
> >
> > Probably sometimes they are not making a good business.
>
> Perhaps. But they are happy customers and their music sounds better.
>
> Note: the dominating cost of -preempt is not Robert's patch, but the fact
> that you need to have CONFIG_SMP enabled, even for uniprocessor, turning
> all those stub macros into real spinlocks. For a dual processor you have
> to have this anyway and it just isn't an issue.
>
Well you don't - the first versions used the SMP spinlocks macros but
replaced them with own code. (basically an INC on entry and a DEC and test
when leaving)
Think about what happens on a UP
There are two cases
- the processor is in the critical section, it can not be preempted = no
other process can take the CPU away from it.
- the processor is not in a critical section, no process can be executing
inside it = can never be busy.
=> no real spinlocks needed on a UP
/RogerL
--
Roger Larsson
Skellefte?
Sweden
On January 9, 2002 08:25 am, Roger Larsson wrote:
> (the subject has been wrong for some time now...)
>
> On Wednesday den 9 January 2002 07.26, Daniel Phillips wrote:
> > On January 9, 2002 12:02 am, Luigi Genoni wrote:
> > > On Tue, 8 Jan 2002, Daniel Phillips wrote:
> > > > On January 8, 2002 04:29 pm, Andrea Arcangeli wrote:
> > > > > but I just wanted to make clear that the
> > > > > idea that is floating around that preemptive kernel is all goodness
> > > > > is very far from reality, you get very low mean latency but at a
> > > > > price.
> > > >
> > > > A price lots of people are willing to pay
> > >
> > > Probably sometimes they are not making a good business.
> >
> > Perhaps. But they are happy customers and their music sounds better.
> >
> > Note: the dominating cost of -preempt is not Robert's patch, but the fact
> > that you need to have CONFIG_SMP enabled, even for uniprocessor, turning
> > all those stub macros into real spinlocks. For a dual processor you have
> > to have this anyway and it just isn't an issue.
>
> Well you don't - the first versions used the SMP spinlocks macros but
> replaced them with own code. (basically an INC on entry and a DEC and test
> when leaving)
>
> Think about what happens on a UP
> There are two cases
> - the processor is in the critical section, it can not be preempted = no
> other process can take the CPU away from it.
> - the processor is not in a critical section, no process can be executing
> inside it = can never be busy.
> => no real spinlocks needed on a UP
Right, thanks, it was immediately obvious when you pointed out that the
macros are just used to find the bounds of the critical regions. So the cost
of -preempt is somewhat less than I had imagined.
--
Daniel
Andrew Morton wrote:
[...]
> I haven't seen any figures for embedded XP, but it is said that
> if you bend over backwards you can get 10 milliseconds out of NT4,
> and 4-5 out of the fabled BeOS. This is one area where we can
> fairly easily be very much the best. It's low-hanging fruit.
>
> Internal preemptability is, in my opinion, the best way to deliver
> this.
>
> I accept your point about it making debugging harder - I would
> suggest that the preempt code be altered so that it can be disabled
> at runtime, rather than via a rebuild. I suspect this can be
> done at zero cost by setting init_task's preempt count to 1000000
> via a kernel boot option. And at almost-zero cost via a sysctl.
And with some bad luck, the bug goes away when you
do this. The bug of the missing lock...
Helge Hafting
On January 8, 2002 11:21 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > The preemptible kernel can reschedule, on average, sooner than the
> > scheduling-point kernel, which has to wait for a scheduling point to roll
> > around.
>
> Yes. It can also fix problematic areas which my testing
> didn't cover.
I bet, with a minor hack, it can help you *find* those problem areas too.
You compile the two patches together and automatically log any event along
with the execution address, where your explicit schedule points failed to
reschedule in time. Sort of like a profile but suited exactly to your
problem.
This just detects the problem areas in normal kernel execution, not
spinlocks, but that is probably where most of the maintainance will be anyway.
By the way, did you check for latency in directory operations?
> Incidentally, there's the SMP problem. Suppose we
> have the code:
>
> lock_kernel();
> for (lots) {
> do(something sucky);
> if (current->need_resched)
> schedule();
> }
> unlock_kernel();
>
> This works fine on UP, but not on SMP. The scenario:
>
> - CPU A runs this loop.
>
> - CPU B is spinning on the lock.
>
> - Interrupt occurs, kernel elects to run RT task on CPU B.
> CPU A doesn't have need_resched set, and just keeps
> on going. CPU B is stuck spinning on the lock.
>
> This is only an issue for the low-latency patch - all the
> other approaches still have sufficiently bad worse-case that
> this scenario isn't worth worrying about.
>
> I toyed with creating spin_lock_while_polling_resched(),
> but ended up changing the scheduler to set need_resched
> against _all_ CPUs if an RT task is being woken (yes, yuk).
Heh, subtle. Thanks for pointing that out and making my head hurt.
--
Daniel
Daniel Phillips wrote:
>
> On January 8, 2002 11:21 pm, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > > The preemptible kernel can reschedule, on average, sooner than the
> > > scheduling-point kernel, which has to wait for a scheduling point to roll
> > > around.
> >
> > Yes. It can also fix problematic areas which my testing
> > didn't cover.
>
> I bet, with a minor hack, it can help you *find* those problem areas too.
> You compile the two patches together and automatically log any event along
> with the execution address, where your explicit schedule points failed to
> reschedule in time. Sort of like a profile but suited exactly to your
> problem.
Well, one of the instrumentation patches which I use detects a
scheduling overrun at interrupt time and emits an all-CPU backtrace.
You just feed the trace into ksymoops or gdb then go stare at
the offending code.
That's the easy part - the hard part is getting sufficient coverage.
There are surprising places. close_files(), exit_notify(), ...
> This just detects the problem areas in normal kernel execution, not
> spinlocks, but that is probably where most of the maintainance will be anyway.
>
> By the way, did you check for latency in directory operations?
Yes. They can be very bad for really large directories. Scheduling
on the found-in-cache case in bread() kills that one easily for most
local filesystems. There may still be a problem in ext2.
-
On January 9, 2002 10:26 am, Andrew Morton wrote:
> Daniel Phillips wrote:
> > By the way, did you check for latency in directory operations?
>
> Yes. They can be very bad for really large directories. Scheduling
> on the found-in-cache case in bread() kills that one easily for most
> local filesystems. There may still be a problem in ext2.
A indexed directory won't have that problem - I'll get to finishing off the
htree patch pretty soon[1]. In any event, the analogous technique will work:
a schedule point in ext2_bread.
[1] Wli's hash work is happening at a convenient time.
--
Daniel
Oliver Xymoron wrote:
[...]
> Have we ever considered making rescheduling work like get_user? That is,
> make current->need_resched be a pointer, and if we need to reschedule,
> make it an INVALID pointer that causes us to fault and call schedule in
> its fault path?
Elegant perhaps, but now you take the time to do a completely
unnecessary
page fault when rescheduling. This has a cost which is high on
some architectures. But the point of rescheduling was to improve
interactive performance and io latency.
Every page fault may have to check for this case.
Helge Hafting
On Tue, Jan 08, 2002 at 12:13:57PM -0800, Andrew Morton wrote:
> Marcelo Tosatti wrote:
> >
> > > Andrew Morten`s read-latency.patch is a clear winner for me, too.
> >
> > AFAIK Andrew's code simply adds schedule points around the kernel, right?
> >
> > If so, nope, I do not plan to integrate it.
>
> I haven't sent it to you yet :) It improves the kernel. That's
> good, isn't it? (There are already forty or fifty open-coded
> rescheduling points in the kernel. That patch just adds the
> missing (and most important) ten).
yes, only make sure not to add live locks.
> reviewable place - submit_bh(). However that won't prevent
agreed, I also merged this part of mini-ll in 18pre2aa1 infact.
Andrea
On Tue, Jan 08, 2002 at 12:24:32PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > ...
> > > What is 00_nanosleep-5 and bootmem ?
> >
> > nanosleep gives usec resolution to the rest-of-time returned by
> > nanosleep, this avoids glibc userspace starvation on nanosleep
> > interrupted by a flood of signals. It was requested by glibc people.
> >
>
> It would be really nice to do something about that two millisecond
> busywait for RT tasks in sys_nanosleep(). It's really foul.
>
> The rudest thing about it is that programmers think "oh, let's
> be nice to the machine and use usleep(1000)". Boy, do they
> get a shock when they switch their pthread app to use an
> RT policy.
>
> Does anyone have any clever ideas?
actually with a design proposal I did today, I think for x86-64 we'll be
able to reach something of the order of 1msec/100usec of precision in
nanosleep without the need of RT privilegies (note this has nothing to
do with the nanosleep patch in -aa, the nanosleep patch is about the
retval, not about "how time will pass"). However the whole timer
handling will have to be revisited by the time we start to code this.
Andrea
On Tue, Jan 08, 2002 at 03:55:38PM -0500, Robert Love wrote:
> On Tue, 2002-01-08 at 10:29, Andrea Arcangeli wrote:
>
> > "extra schedule points all over the place", that's the -preempt kernel
> > not the lowlatency kernel! (on yeah, you don't see them in the source
> > but ask your CPU if it sees them)
>
> How so? The branch on drop of the last lock? It's not a factor in
exactly, this is the reschedule point I meant. Oh note that it's
unlikely also in the lowlatecy patch. Please count the number of time
you add this branch in the -preempt, and how many times we add this
branch in the lowlat and then tell me who is adding rescheduling points
in the kernel all over the place.
> This makes me think the end conclusion would be that preemptive
> multitasking in general is bad. Why don't we increase the timeslice and
> and tick period, in that case?
that would increase performance, but we'd lost interactivity.
> One can argue the complexity degrades performance, but tests show
> otherwise. In throughput and latency. Besides, like I always say, its
which benchmarks? you should make sure the CPU spend all its cycles in
the kernel to benchmark the perfrormance degradation (this is the normal
case of webserving with a few gigabit ethernet cards using sendfile).
> ride. On the other hand, the patch has a _huge_ userbase and you can't
I question this because it is too risky to apply. There is no way any
distribution or production system could ever consider applying the
preempt kernel and ship it in its next kernel update 2.4. You never know
if a driver will deadlock because it is doing a test and set bit busy
loop by hand instead of using spin_lock and you cannot audit all the
device drivers out there. It is not like the VM that is self contained
and that can be replaced without any caller noticing, this instead
impacts every single driver out there and you'd need to audit all of
them, which is not feasible I think and that should be done by giving
everybody the time to test. This is also what makes preempt config
option risky, if we go preempt we should force everybody to use it, at
least during 2.5, so we get the useful feedback from testers of all the
hardware, or nobody could trust -preempt.
NOTE: I trust your work with spinlocks, locks around per-cpu data
structures etc.. is perfect, I trust that part, as said it's the driver
doing test and set bit that you cannot audit that is the problem here
and that makes it potentially unstable, not your changes. And also the
per-cpu data structures sounds a little risky (but for example for UP
that's not an issue).
> question that. You also can't question the benchmarks that show
> improvements in average _and_ worst case latency _and_ throughput.
I don't question some benchmark is faster with -preempt, the interesting
thing is to find why because it shouldn't be the case, Andrew for
example mentioned software raid, there are good reasons for which
-preempt could be faster there, so we added a single sechdule point and
we just have that case covered in 18pre2aa1, we don't need reschedule
points all over the place like in -preempt to cover things like that.
It is good to find them out so we can fix those bugs, I consider them
bugs :).
Again: I'm not completly against preempt, it can reach an mean latency
much lower than mainline (it can reschedule immediatly in the middle of
long copy-users for example), so it definitely has a value, it's just
that I'm not sure if it worth it.
Andrea
On January 9, 2002 02:55 pm, Andrea Arcangeli wrote:
> On Wed, Jan 09, 2002 at 12:56:50PM +0100, Daniel Phillips wrote:
> > BTW, I find your main argument confusing. First you don't want -preempt with
> > CONFIG_EXERIMENTAL because it might not get wide enough testing, so you want
> > to enable it by default in the mainline kernel, then you argue it's too risky
> > because everybody will use it and it might break some obscure driver. Sorry,
> > you lost me back there.
>
> the point I am making is very simple: _if_ we include it, it should _not_
> be a config option.
That doesn't make any sense to me. Why should _SMP be a config option and not
_PREEMPT?
--
Daniel
----- Original Message -----
From: "Andrea Arcangeli" <[email protected]>
To: "Robert Love" <[email protected]>
Cc: "Daniel Phillips" <[email protected]>; "Anton Blanchard"
<[email protected]>; "Luigi Genoni" <[email protected]>; "Dieter N?tzel"
<[email protected]>; "Marcelo Tosatti" <[email protected]>;
"Rik van Riel" <[email protected]>; "Linux Kernel List"
<[email protected]>; "Andrew Morton" <[email protected]>
Sent: Wednesday, January 09, 2002 6:24 AM
Subject: Re: [2.4.17/18pre] VM and swap - it's really unusable
> On Tue, Jan 08, 2002 at 03:55:38PM -0500, Robert Love wrote:
> > On Tue, 2002-01-08 at 10:29, Andrea Arcangeli wrote:
> >
> > > "extra schedule points all over the place", that's the -preempt kernel
> > > not the lowlatency kernel! (on yeah, you don't see them in the source
> > > but ask your CPU if it sees them)
> >
> > How so? The branch on drop of the last lock? It's not a factor in
>
> exactly, this is the reschedule point I meant. Oh note that it's
> unlikely also in the lowlatecy patch. Please count the number of time
> you add this branch in the -preempt, and how many times we add this
> branch in the lowlat and then tell me who is adding rescheduling points
> in the kernel all over the place.
>
> > This makes me think the end conclusion would be that preemptive
> > multitasking in general is bad. Why don't we increase the timeslice and
> > and tick period, in that case?
>
> that would increase performance, but we'd lost interactivity.
>
> > One can argue the complexity degrades performance, but tests show
> > otherwise. In throughput and latency. Besides, like I always say, its
>
> which benchmarks? you should make sure the CPU spend all its cycles in
> the kernel to benchmark the perfrormance degradation (this is the normal
> case of webserving with a few gigabit ethernet cards using sendfile).
I haven't seen any interactive tests that showed worse results than the
vanilla kernel with the preempt patch. The only cases where it gives a
worse performance is in a single tasking environment such as running bonnie
or dbench and apps like that that require to throttle the system. This is
obviously expected behavior though. Performance degradation might be seen
on a per app basis, but when looking at the system as a whole, performance
has never degraded with the patch as far as i've seen. Better overall
performance is what has lead to better "benchmark" performance on the tests
being run by people.
> > ride. On the other hand, the patch has a _huge_ userbase and you can't
>
> I question this because it is too risky to apply. There is no way any
> distribution or production system could ever consider applying the
> preempt kernel and ship it in its next kernel update 2.4. You never know
> if a driver will deadlock because it is doing a test and set bit busy
> loop by hand instead of using spin_lock and you cannot audit all the
> device drivers out there. It is not like the VM that is self contained
> and that can be replaced without any caller noticing, this instead
> impacts every single driver out there and you'd need to audit all of
> them, which is not feasible I think and that should be done by giving
> everybody the time to test. This is also what makes preempt config
> option risky, if we go preempt we should force everybody to use it, at
> least during 2.5, so we get the useful feedback from testers of all the
> hardware, or nobody could trust -preempt.
I disagree. Redhat shipped gcc 2.96 when it was producing incompatible
binaries and was buggy as all hell, why not ship a kernel that is "unstable"
and "risky" if it promises better performance.
If scheduling points are ugly in 2.4, then they'd be ugly in 2.5. The only
solution to the problem you see with it is making 2.5 fully preemptible from
the ground up instead of having to add fixes. If nobody wants to do things
the hard way (assuming there is a better way), is it better to leave it
unfixed rather than fix it? Of course i'm assuming that the idea of a fully
preemptible kernel is better than the current version we have now.
> NOTE: I trust your work with spinlocks, locks around per-cpu data
> structures etc.. is perfect, I trust that part, as said it's the driver
> doing test and set bit that you cannot audit that is the problem here
> and that makes it potentially unstable, not your changes. And also the
> per-cpu data structures sounds a little risky (but for example for UP
> that's not an issue).
>
> > question that. You also can't question the benchmarks that show
> > improvements in average _and_ worst case latency _and_ throughput.
>
> I don't question some benchmark is faster with -preempt, the interesting
> thing is to find why because it shouldn't be the case, Andrew for
> example mentioned software raid, there are good reasons for which
> -preempt could be faster there, so we added a single sechdule point and
> we just have that case covered in 18pre2aa1, we don't need reschedule
> points all over the place like in -preempt to cover things like that.
> It is good to find them out so we can fix those bugs, I consider them
> bugs :).
I think robert love is trying to give the kernel the highest flexibility.
Making it flexible in key areas will improve your worst cases but a lot of
the time during normal use it's the multitude of smaller cases that is
noticeable.
> Again: I'm not completly against preempt, it can reach an mean latency
> much lower than mainline (it can reschedule immediatly in the middle of
> long copy-users for example), so it definitely has a value, it's just
> that I'm not sure if it worth it.
>
> Andrea
Ok so the medicine is worse than the disease. I take it that you only want
some key points made for rescheduling instead of the full preempt patch by
Robert. That seems logical enough. The only issue i see is that for the
most part people dont like the idea of needing to add scheduling points. So
how would the kernel need to be fixed in order to not need them and still be
fully preemptible like it's getting in Robert's patch. If it just cant then
is it really best to hang out somewhere on the edge of preemptible
multitasking because some people are in denial that the kernel needs to be
patched so much to work correctly and for the sake of single tasking
performance?
Now in just my own opinion i think of linux as a multitasking kernel and as
thus it should perform that function as best as possible. If you want to
run a single program as fast as possible then absolutely dont run anything
else and nothing can preempt it to degrade it's performance. The fact that
you can run multiple apps and run a single program as fast as possible
without degrading it's performance is a bug if those other apps (at the
same priority) have to wait longer than they should if we want linux to be a
multitasking kernel. Just to sum things up, if there is a way to be fully
preemptible without scheduling points linux, then perhaps that should be a
major focus for 2.5 instead of picking and choosing (ugly)scheduling points,
but if not then the argument about them not being elegant is mute because
then the kernel, itself, is far from elegant already, so what exactly are
you saving?
- Formerly safemode
On Wed, Jan 09, 2002 at 03:07:40PM +0100, Daniel Phillips wrote:
> On January 9, 2002 02:55 pm, Andrea Arcangeli wrote:
> > On Wed, Jan 09, 2002 at 12:56:50PM +0100, Daniel Phillips wrote:
> > > BTW, I find your main argument confusing. First you don't want -preempt with
> > > CONFIG_EXERIMENTAL because it might not get wide enough testing, so you want
> > > to enable it by default in the mainline kernel, then you argue it's too risky
> > > because everybody will use it and it might break some obscure driver. Sorry,
> > > you lost me back there.
> >
> > the point I am making is very simple: _if_ we include it, it should _not_
> > be a config option.
>
> That doesn't make any sense to me. Why should _SMP be a config option and not
getting the drivers tested with preempt enable makes lots of sense to
me.
> _PREEMPT?
SMP in 2.1 wasn't a config option.
Andrea
On Wed, Jan 09, 2002 at 09:07:55AM -0500, Ed Sweetman wrote:
> Ok so the medicine is worse than the disease. I take it that you only want
> some key points made for rescheduling instead of the full preempt patch by
> Robert. That seems logical enough. The only issue i see is that for the
My ideal is to have the kernel to be as low worst latency as -preempt,
but without being preemptive. that's possible to achieve, I don't think
we're that far.
mean latency is another matter, but I personally don't mind about mean
latency and I much prefer to save cpu cycles instead.
Andrea
Andrea Arcangeli wrote:
>
> On Wed, Jan 09, 2002 at 09:07:55AM -0500, Ed Sweetman wrote:
> > Ok so the medicine is worse than the disease. I take it that you only want
> > some key points made for rescheduling instead of the full preempt patch by
> > Robert. That seems logical enough. The only issue i see is that for the
>
> My ideal is to have the kernel to be as low worst latency as -preempt,
> but without being preemptive. that's possible to achieve, I don't think
> we're that far.
>
> mean latency is another matter, but I personally don't mind about mean
> latency and I much prefer to save cpu cycles instead.
hear hear!
The akpm patch is achieving a MUCH better latency than pure -preempt,
and only has 40
or so coded preemption points instead of a few hundred (eg every
spin_unlock)....
and if with 40 we can get <= 1ms then everybody will be happy; if you
want, say, 50 usec
latency instead you need RTLinux anyway. With 1ms _worst case_ latency
the "mean" latency
is obviously also very good.......
> The high-end audio synth guys claim that two milliseconds is getting
> to be too much. They are generating real-time audio and they do
> have more than one round-trip through the software. It adds up.
Most of the stuff I've seen from high end audio people consists of
overthreaded, chains of code written without any consideration for the
actual cost of execution. There are exceptions - including
people dynamically compiling filters to get ideal cache and latency
behaviour, but not enough.
Alan
On Tue, Jan 08, 2002 at 04:29:59PM -0800, John Alvord wrote:
> Incidently human visual perception runs around 200 milliseconds
> minimum and hearing/touch perception around 100 milliseconds if the
> signal has to go through the brain. Of course we extend our
> perceptions with tools/programs etc.
Cool! That means movies don't need to run faster than 5
frames/second. Maybe 10 frames/second for plenty of overkill. No
need to look at keyboard and mice any more frequently either, what a
relief. (Any why do silly gamers want to go so much higher?)
Sarcasm mode "off" now...just because some experiments show it takes
humans a long time to push the correct button when you show them a
picture of a banana doesn't mean there is no reason to have a user
interface do anything any faster. (I can come up with plenty of
examples if you would like.)
OK, now that I have pissed off a big hunk of the folks on the list,
let me bring up a different question:
How does all this fit into doing a tick-less kernel?
There is something appealing about doing stuff only when there is
stuff to do, like: respond to input, handle some device that becomes
ready, or let another process run for a while. Didn't IBM do some
nice work on this for Linux? (*Was* it nice work?) I was under the
impression that the current kernel isn't that far from being tickless.
A tickless kernel would be wonderful for battery powered devices that
could literally shut off when there be nothing to do, and it seems it
would (trivially?) help performance on high end power hogs too.
Why do we have regular HZ ticks? (Other than I think I remember Linus
saying that he likes them.)
Thanks,
-kb, the Kent who knows more about user interfaces than he does
preemption.
On Wednesday den 9 January 2002 15.51, Arjan van de Ven wrote:
> Andrea Arcangeli wrote:
> > On Wed, Jan 09, 2002 at 09:07:55AM -0500, Ed Sweetman wrote:
> > > Ok so the medicine is worse than the disease. I take it that you only
> > > want some key points made for rescheduling instead of the full preempt
> > > patch by Robert. That seems logical enough. The only issue i see is
> > > that for the
> >
> > My ideal is to have the kernel to be as low worst latency as -preempt,
> > but without being preemptive. that's possible to achieve, I don't think
> > we're that far.
> >
> > mean latency is another matter, but I personally don't mind about mean
> > latency and I much prefer to save cpu cycles instead.
>
> hear hear!
>
> The akpm patch is achieving a MUCH better latency than pure -preempt,
> and only has 40
> or so coded preemption points instead of a few hundred (eg every
> spin_unlock)....
The difference is that the preemptive kernel mostly uses existing
infrastructure. When SMP scalability gets better due to holding locks
for a shorter time then the preemptive kernel will improve as well!
AND it can be used on a UP computer to "simulate" SMP and that
should help the quality of the total code base...
This is my idea:
* Add the preemptive kernel
* "Remove" reschedule points from main kernel.
note: that reschedule points that does nothing more than
test and schedule can be NOOPed since they will never trigger in a
preemptive kernel...
>
> and if with 40 we can get <= 1ms then everybody will be happy; if you
> want, say, 50 usec
> latency instead you need RTLinux anyway. With 1ms _worst case_ latency
> the "mean" latency
> is obviously also very good.......
Worst case latency... is VERY hard to prove if you rely on schedule points.
Since they are typically added after the fact...
If the code suddenly end up on a road less travelled...
With preemptive kernel your worst latency is the longest held spinlock.
PERIOD.
(you can of cause be delayed by an even higher priority process)
* Make sure that there are no "infinite" loops inside any spinlock.
"infinite" == over ALL or ALL/x of something since someone, somewere
will have ALL close to infinite... (infinity/x is still infinity... :-)
example code is looping through LRU list to find a victim page...
once it was not infinite due to the small number of pages...
Note: that akpm patches usually hava a - "do not do this list" with known
problem spots (ok, usually in a hard to break spinlocks).
/RogerL
--
Roger Larsson
Skellefte?
Sweden
On Wed, Jan 09, 2002 at 06:02:53PM +0100, Roger Larsson wrote:
> The difference is that the preemptive kernel mostly uses existing
> infrastructure. When SMP scalability gets better due to holding locks
> for a shorter time then the preemptive kernel will improve as well!
Ehm. Holding locks for a shorter time is not guaranteed to improve smp
scalability. In fact it can completely kill it due to cacheline pingpong
effects.
> > and if with 40 we can get <= 1ms then everybody will be happy; if you
> > want, say, 50 usec
> > latency instead you need RTLinux anyway. With 1ms _worst case_ latency
> > the "mean" latency
> > is obviously also very good.......
>
> Worst case latency... is VERY hard to prove if you rely on schedule points.
Agreed. It's "worst case" in the soft real time sence. But we've beaten the
kernel quite hard during such tests....
> With preemptive kernel your worst latency is the longest held spinlock.
> PERIOD.
Yes and without the same stuff akpm does that's about 80 to 90 ms right now.
> Note: that akpm patches usually hava a - "do not do this list" with known
> problem spots (ok, usually in a hard to break spinlocks).
Usually in hardware related parts. Even with -preempt you'll get this.
Hopefully only during hardware initialisation, but there are just cases
where you need to go WAAAY too far if you want to go below, say, 5ms during
device init.
On January 9, 2002 03:51 pm, Arjan van de Ven wrote:
> Andrea Arcangeli wrote:
> >
> > On Wed, Jan 09, 2002 at 09:07:55AM -0500, Ed Sweetman wrote:
> > > Ok so the medicine is worse than the disease. I take it that you only want
> > > some key points made for rescheduling instead of the full preempt patch by
> > > Robert. That seems logical enough. The only issue i see is that for the
> >
> > My ideal is to have the kernel to be as low worst latency as -preempt,
> > but without being preemptive. that's possible to achieve, I don't think
> > we're that far.
> >
> > mean latency is another matter, but I personally don't mind about mean
> > latency and I much prefer to save cpu cycles instead.
>
> hear hear!
> The akpm patch is achieving a MUCH better latency than pure -preempt,
Can you please point us at the benchmark results that support your claim?
> and only has 40
> or so coded preemption points instead of a few hundred (eg every
> spin_unlock)....
So? The cost of this is, in theory, a dec and a branch normally not taken.
Robert hasn't coded it that way in the current incarnation, and personally,
I'd rather see the correctness proven before the microoptimizations are
done, but that's where it's going in theory. Big deal.
On the other hand, I just did a test for myself that pretty well makes up
my mind about this patch. I'm typing this right now on a 64 Meg laptop with
a slow disk, dma turned off. On this machine, debian apt-get dist-upgrade
is essentially a DoS - once it gets to unpacking packages and configuring,
for whatever reason, the machine becomes almost ununsable. Changing windows
for example, can take 10-15 seconds. Updatedb, while not quite as bad, is
definitely an irritant as far as interactive use goes.
With Robert's patch, the machine is a little sluggish during apt-get, but
quite usable. This is a *huge* difference. And during updatedb, well, I
hardly notice it's happening, except for the disk light.
So I like this patch. What was your complaint again? If you've got hard
numbers and repeatable benchmarks, please trot them out.
--
Daniel
On Wednesday 09 January 2002 12:00 pm, Alan Cox wrote:
> > The high-end audio synth guys claim that two milliseconds is getting
> > to be too much. They are generating real-time audio and they do
> > have more than one round-trip through the software. It adds up.
>
> Most of the stuff I've seen from high end audio people consists of
> overthreaded, chains of code written without any consideration for the
> actual cost of execution. There are exceptions - including
> people dynamically compiling filters to get ideal cache and latency
> behaviour, but not enough.
>
> Alan
News flash: people are writing sub-optimal apps in user space.
Do you want an operating system capable of running real-world code written by
people who know more about their specific problem domain (audio) than about
optimal coding in general, or do you want an operating system intended to
only run well-behaved applications designed and implemented by experts?
Rob
Rob Landley wrote:
>
> On Wednesday 09 January 2002 12:00 pm, Alan Cox wrote:
> > > The high-end audio synth guys claim that two milliseconds is getting
> > > to be too much. They are generating real-time audio and they do
> > > have more than one round-trip through the software. It adds up.
> >
> > Most of the stuff I've seen from high end audio people consists of
> > overthreaded, chains of code written without any consideration for the
> > actual cost of execution. There are exceptions - including
> > people dynamically compiling filters to get ideal cache and latency
> > behaviour, but not enough.
> >
> > Alan
>
> News flash: people are writing sub-optimal apps in user space.
Not only in user=space :)
> Do you want an operating system capable of running real-world code written by
> people who know more about their specific problem domain (audio) than about
> optimal coding in general, or do you want an operating system intended to
> only run well-behaved applications designed and implemented by experts?
The people with whom I dealt (Benno Senoner, Dave Philips, Paul
Barton-Davis) with are deeply clueful about this stuff.
I'll quote from an email Paul set me a year ago. I don't think
he'll mind. This is, of course, a quite specialised application
area:
....
There are two kinds of situations where its needed:
1) real-time effects ("FX") processing
2) real-time synthesis influenced by external controllers
In (1), we have an incoming audio signal that is to be processed in
some way (echo/flange/equalization/etc. etc.) and then delivered back
to the output audio stream. If the delay between the input and output
is more than a few msecs, there are several possible
consequences:
* if the original source was acoustic (non-electronic), and
the processed material is played back over monitors
close to the acoustic source, you will get interesting
filtering effects as the two signals interfere with each
other.
* the musician will get confused by material in the
processed stream arriving "late"
* the result may be useless.
In (2), a musician is using, for example, a keyboard that sends MIDI
"note on/note off" messages to the computer which are intended to
cause the synthesis engine to start/stop generating certain sounds. If
the delay between the musician pressing the key and hearing the sound
exceeds about 5msec, the system will feel difficult to play; worse, if
there is actual jitter in the +/- 5msec range, it will feel impossible
to play.
....
Without LL, Linux cannot reasonably be used for professional audio
work that involves real time FX or real time synthesis. The default
kernel has worst-case latencies noticeably worse than Windows, and
most people are reluctant to use that system already, not just because
of instability but latencies also. Its not a matter of it being "a bit
of a problem" - the 100msec worst case latencies visible in the
standard kernel make it totally implausible that you would ever deploy
Linux in a situation where RT FX/synthesis were going to happen.
By contrast, if we get LL in place, then we can potentially use Linux
in "black box"/"embedded" systems designed specifically for audio
users; all the flexibility of Linux, but if they choose to ignore most
of that, they'll still have a black rack-mounted box capable of doing
everything (more mostly everything) currently done by dedicated
hardware. As general purpose CPU performance continues to increase,
this becomes more and more overwhelmingly obvious as the way forward
for audio processing.
....
> Do you want an operating system capable of running real-world code written by
> people who know more about their specific problem domain (audio) than about
> optimal coding in general, or do you want an operating system intended to
> only run well-behaved applications designed and implemented by experts?
I want an OS were a reasonably cluefully written audio program works. That
to me means aiming at the 1mS latency mark. Which doesn't seem to be needing
pre-empt. Beyond a typical 1mS latency you have hardware fun to worry about,
and the BIOS SMM code eating you.
I'm a fan of the fans of low latency. :)
A version of Andrew Morton's patch is in 2.4.18pre2aa2.
A test on 2.4.18pre2aa2 that lasted 57 minutes:
Simultaneously:
Loop that allocates and writes to 85% of VM with mtest01.
Create and copy 10 330 MB files.
Start nice -19 setiathome half way through.
Listen to 9 mp3s.
This is similar to something I did on 2.4.17rc2aa2, but
I added mp3blaster and setiathome this time.
Results:
Allocated 125.5 GB of VM.
All mtest01 allocations passed. (no OOM)
6.6 GB of file copying took place.
For the first 10-15 seconds of each mp3, there was some skipping
from mp3blaster. After 15 seconds, the mp3 played smoothly.
Adding setiathome to the load did not make mp3blaster skip more.
There is 880MB of RAM in machine and 1027 MB swap, so I
calculate each mtest01 iteration wrote about 680MB to swap.
6600 MB copy + (680 * 84) MB swapping = 63,720MB disk activity.
Divide by 57 minutes = 18.6 MB / second to the disk.
Comments:
>From 2.4.10 - 2.4.17pre I was testing just playing mp3blaster
and running an 80% VM allocation loop. There was a lot of
improvement during that time. Adding I/O and CPU bound jobs
to the load makes the results above quite amazing to me.
There's a lot of magic in 2.4.18pre2aa2 that hasn't made it
into the other mainline kernels yet.
BTW: Some more changelog details on 2.4.18pre2aa2 are at
http://home.earthlink.net/~rwhron/kernel/2.4.18pre2aa2.html
System is Athlon 1333 with 1 40GB 7200 rpm IDE disk.
--
Randy Hron
On Wed, Jan 09, 2002 at 12:10:38AM +0000, Alan Cox wrote:
| That is generally not true. Pe-emption is used in user space to prevent
| applications doing very stupid things. Pre-emption in a trusted environment
| can often be most efficient if done by the programs themselves.
| Userspace is not a trusted environment
That's true, but at some point in the future I think the work involved
in making sure all new additional kernel code and all new intra-kernel
interactions are "tuned" becomes larger than going preemptive all the
way down.
Apple had its arguments for cooperative, along the same lines as what
you've mentioned I believe. And while I agree that the kernel is a much
_more_ trusted environment, I think the possibilities easily remain for
abuse given that there are A) more and more people contributing kernel
code every day, and B) countless unspeakably evil modules out there.
And the preempt tunability that has been mentioned sounds like it would
go a long way.
| Andrew's patches give you 1mS worst case latency for normal situations, that
| is below human perception, and below scheduling granularity. In other words
| without the efficiency loss and the debugging problems you can place the
| far enough latency below other effects that it isnt worth attacking any more.
It sounds like the LL patches are easier and less prone to locking
issues with a lot of the benefit. But I can't help but feel that it's
not using the right tool for the job. I think the end result of
stabilizing a preemptive kernel (in 2.5?) is worth the price, IMHO.
--
Ken.
[email protected]
> That's true, but at some point in the future I think the work involved
> in making sure all new additional kernel code and all new intra-kernel
> interactions are "tuned" becomes larger than going preemptive all the
> way down.
It makes no difference to the kernel the work is the same in all cases
because you cannot pre-empt while holding a lock. Therefore you have to do
all the lock analysis anyway
On Wednesday 09 January 2002 13:57, Andrew Morton wrote:
[snip]
>
> Without LL, Linux cannot reasonably be used for professional audio
> work that involves real time FX or real time synthesis. The default
> kernel has worst-case latencies noticeably worse than Windows, and
> most people are reluctant to use that system already, not just because
> of instability but latencies also. Its not a matter of it being "a bit
> of a problem" - the 100msec worst case latencies visible in the
> standard kernel make it totally implausible that you would ever deploy
> Linux in a situation where RT FX/synthesis were going to happen.
>
> By contrast, if we get LL in place, then we can potentially use Linux
> in "black box"/"embedded" systems designed specifically for audio
> users; all the flexibility of Linux, but if they choose to ignore most
> of that, they'll still have a black rack-mounted box capable of doing
> everything (more mostly everything) currently done by dedicated
> hardware. As general purpose CPU performance continues to increase,
> this becomes more and more overwhelmingly obvious as the way forward
> for audio processing.
>
I keep on seeing these "Blackbox" apps like Tivo using Linux but the
fact remains the average folks cannot get any reasonable kind of A/V
performance and support under Linux. That's what we need.
Needing to save money and get some fast cash (I'm unemployed),
yesterday, I swapped out my dual P-III motherboard in my BeOS box
for a Via C-III (700MHz) based system. And I got my first real hiccups
while using the OS when I was playing MP3s and _launching_ the TV
program _full screen_(640x480 on 640x480 virtual desktop window).
Obviously, when this stuff is done right, more CPU power can only help,
but it still has to be done right. As I am sure that you know, BeOS claims
average latency of 250 microseconds.
--
[email protected].
On Wednesday 09 January 2002 09:25 pm, Alan Cox wrote:
> > Do you want an operating system capable of running real-world code
> > written by people who know more about their specific problem domain
> > (audio) than about optimal coding in general, or do you want an operating
> > system intended to only run well-behaved applications designed and
> > implemented by experts?
>
> I want an OS were a reasonably cluefully written audio program works. That
> to me means aiming at the 1mS latency mark. Which doesn't seem to be
> needing pre-empt. Beyond a typical 1mS latency you have hardware fun to
> worry about, and the BIOS SMM code eating you.
I don't know what BIOS SMM code is, or what you mean by "hardware fun". But
the worst audio dropouts I have are "cp file.wav /dev/audio" when I forgot to
kill cron and updatedb started up. (This is considerably WORSE than mp3
playing.) I take it "cp" is badly written? :)
And a sound card with only 1mS of buffer in it is definitely not useable on
windoze, the minimum buffer in the cheapest $12 PCI sound card I've seen is
about 1/4 second (250ms). (Is this what you mean by "hardware fun"?) Even
if the app was taking half that, it's still a > 100ms big gap where the OS
leaves it hanging before you get a dropout. (Okay, some of that's watermark
policy, not sending more data to the card until half the buffer is
exhausted...) What sound output device DOESN'T have this much cache? (You
mentioned USB speakers in your diary at one point, which seemed to be like
those old "paralell port cable plus a few resistors equals sound output"
hacks...)
Now VIDEO is a slightly more interesting problem. (Or synchronizing audio
and video by sending really tiny chunks of audio.) There's no hardware
buffer there to cover our latency sins. Then again, dropping frames is
considered normal in the video world, isn't it? :)
Rob
> I don't know what BIOS SMM code is, or what you mean by "hardware fun". But
> the worst audio dropouts I have are "cp file.wav /dev/audio" when I forgot to
> kill cron and updatedb started up. (This is considerably WORSE than mp3
> playing.) I take it "cp" is badly written? :)
Those are ones that Andrew's patch should fix nicely. You might need a
decent VM as well though.
The fun below 1mS comes from
1. APM bios calls where the bios decides to take >1mS to have
a chat with your batteries
2. Video cards pulling borderline legal PCI tricks to get
better benchmarketing by stalling the entire bus
> And a sound card with only 1mS of buffer in it is definitely not useable on
> windoze, the minimum buffer in the cheapest $12 PCI sound card I've seen is
> about 1/4 second (250ms). (Is this what you mean by "hardware fun"?) Even
For video conferencing and for real world audio mixing you can't use
that 250ms. Not even for games. If your audio is 150mS late in quake you
will notice it, really notice it. And the buffers on the audio card are
btw generally in RAM not the fifo on the chip, so they dont help when the
PCI bus loads up
> exhausted...) What sound output device DOESN'T have this much cache? (You
> mentioned USB speakers in your diary at one point, which seemed to be like
> those old "paralell port cable plus a few resistors equals sound output"
> hacks...)
Umm no USB audio is rather good. USB sends isosynchronous, time guaranteed
sample streams down the USB bus, to the speakers where the A to D is clear
of the machine proper.
> Now VIDEO is a slightly more interesting problem. (Or synchronizing audio
> and video by sending really tiny chunks of audio.) There's no hardware
> buffer there to cover our latency sins. Then again, dropping frames is
> considered normal in the video world, isn't it? :)
You'll see those too. Pure playback is ok because you have to buffer
equally rather than reliably hit deadlines
John Alvord wrote:
>
> Incidently human visual perception runs around 200 milliseconds
> minimum and hearing/touch perception around 100 milliseconds if the
> signal has to go through the brain. Of course we extend our
> perceptions with tools/programs etc.
Have you ever tried to play piano with 100 ms latency from pressing the key
to sound? I can tell you that it's pretty difficult...
- Jussi Laako
--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers
Rob Landley wrote:
> And a sound card with only 1mS of buffer in it is definitely not useable on
> windoze, the minimum buffer in the cheapest $12 PCI sound card I've seen is
> about 1/4 second (250ms). (Is this what you mean by "hardware fun"?) Even
> if the app was taking half that, it's still a > 100ms big gap where the OS
> leaves it hanging before you get a dropout. (Okay, some of that's watermark
> policy, not sending more data to the card until half the buffer is
> exhausted...) What sound output device DOESN'T have this much cache?
Imagine taking an input, doing dsp-type calculations on it, and sending it back
as output. Now...imagine doing it in realtime with the output being fed back to
a monitor speaker. Think about what would happen if the output of the monitor
speaker is 1/4 second behind the input at the mike. Now do you see the
problem? A few ms of delay might be okay. A few hundred ms definately is not.
> Now VIDEO is a slightly more interesting problem. (Or synchronizing audio
> and video by sending really tiny chunks of audio.) There's no hardware
> buffer there to cover our latency sins. Then again, dropping frames is
> considered normal in the video world, isn't it? :)
If I'm trying to watch a DVD on my computer, and assuming my CPU is powerful
enough to decode in realtime, then I want the DVD player to take
priority--dropping frames just because I'm starting up netscape is not
acceptable.
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
On Thu, Jan 10, 2002 at 01:34:17PM -0500, Chris Friesen wrote:
> Rob Landley wrote:
>
> > And a sound card with only 1mS of buffer in it is definitely not useable on
> > windoze, the minimum buffer in the cheapest $12 PCI sound card I've seen is
> > about 1/4 second (250ms). (Is this what you mean by "hardware fun"?) Even
> > if the app was taking half that, it's still a > 100ms big gap where the OS
> > leaves it hanging before you get a dropout. (Okay, some of that's watermark
> > policy, not sending more data to the card until half the buffer is
> > exhausted...) What sound output device DOESN'T have this much cache?
>
> Imagine taking an input, doing dsp-type calculations on it, and sending it back
> as output. Now...imagine doing it in realtime with the output being fed back to
> a monitor speaker. Think about what would happen if the output of the monitor
> speaker is 1/4 second behind the input at the mike. Now do you see the
> problem? A few ms of delay might be okay. A few hundred ms definately is not.
>
> > Now VIDEO is a slightly more interesting problem. (Or synchronizing audio
> > and video by sending really tiny chunks of audio.) There's no hardware
> > buffer there to cover our latency sins. Then again, dropping frames is
> > considered normal in the video world, isn't it? :)
>
> If I'm trying to watch a DVD on my computer, and assuming my CPU is powerful
> enough to decode in realtime, then I want the DVD player to take
> priority--dropping frames just because I'm starting up netscape is not
> acceptable.
Ummm, and you couldn't consider refraining from firing up Netscape
while watching the DVD, could you?!
I get your point, but the example was poorly chosen, imho.
Regards: David Weinehall
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </
David Weinehall wrote:
>
> On Thu, Jan 10, 2002 at 01:34:17PM -0500, Chris Friesen wrote:
> > If I'm trying to watch a DVD on my computer, and assuming my CPU is powerful
> > enough to decode in realtime, then I want the DVD player to take
> > priority--dropping frames just because I'm starting up netscape is not
> > acceptable.
>
> Ummm, and you couldn't consider refraining from firing up Netscape
> while watching the DVD, could you?!
>
> I get your point, but the example was poorly chosen, imho.
I chose netscape because it is probably the largest single app that I have on my
machine. Other possibilities would be running a kernel compile, a recursive
search for specific file content through the entire filesystem, or anything else
that is likely to cause problems. It might even be someone else in the house
logged into it and running stuff over the network.
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]
Chris Friesen wrote:
>
> machine. Other possibilities would be running a kernel compile, a
> recursive search for specific file content through the entire filesystem,
> or anything else that is likely to cause problems. It might even be
> someone else in the house logged into it and running stuff over the
> network.
It's not enjoyable late night DVD watching when updatedb fires up at middle
of the movie. Nor when you are trying to record some audio to the disk.
Vanilla kernel really chokes up on those situations. Lowlatency patches help
a lot on this.
- Jussi Laako
--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers
In article <[email protected]> you wrote:
> Imagine taking an input, doing dsp-type calculations on it, and sending it back
> as output. Now...imagine doing it in realtime with the output being fed back to
> a monitor speaker. Think about what would happen if the output of the monitor
> speaker is 1/4 second behind the input at the mike. Now do you see the
> problem? A few ms of delay might be okay.
What kind of signal run time do you normally have in digital sound processing
equipment? AFAIK one can expect a feew frames with of delay (n x 13ms).
Just dont feed back the processed signal to the singers monitor box.
> If I'm trying to watch a DVD on my computer, and assuming my CPU is powerful
> enough to decode in realtime, then I want the DVD player to take
> priority--dropping frames just because I'm starting up netscape is not
> acceptable.
You do not start up netscape while you do realtime av processing in
professional environemnt.
Well, an easy fix is to have the LL patch and do not use swap. Then you only
need reliable/predictable hardware (which is not so easy to get for PC).
Greetings
Bernd
On Thu, 10 Jan 2002, Alan Cox wrote:
> The fun below 1mS comes from
>
> 1. APM bios calls where the bios decides to take >1mS to have
> a chat with your batteries
> 2. Video cards pulling borderline legal PCI tricks to get
> better benchmarketing by stalling the entire bus
Don't forget the embedded space, where the hardware vendor can ensure
that their hardware is well-behaved. Even on a PC, it is possible for
someone who cares about realtime to spec a reasonable system.
On good hardware, we can easily do much better than 1ms latency with a
preemptible kernel and a spinlock cleanup. I don't think the
limitations of some PC hardware should limit our goals for Linux.
Nigel Gamble [email protected]
Mountain View, CA, USA. http://www.nrg.org/
Nigel Gamble wrote:
>
> On Thu, 10 Jan 2002, Alan Cox wrote:
> > The fun below 1mS comes from
> >
> > 1. APM bios calls where the bios decides to take >1mS to have
> > a chat with your batteries
> > 2. Video cards pulling borderline legal PCI tricks to get
> > better benchmarketing by stalling the entire bus
>
> Don't forget the embedded space, where the hardware vendor can ensure
> that their hardware is well-behaved. Even on a PC, it is possible for
> someone who cares about realtime to spec a reasonable system.
>
> On good hardware, we can easily do much better than 1ms latency with a
> preemptible kernel and a spinlock cleanup. I don't think the
> limitations of some PC hardware should limit our goals for Linux.
>
On 700MHz x86 running Cerberus we can do 50 microseconds average
and 1300 microseconds worst-case today.
Below 1000 uSec, the required changes get exponentially larger
and more complex. I doubt that it's sane to try to go below
a millisecond on a desktop-class machine with desktop-class
workload, disk, memory and swap capacities.
On a more constrained system, which is what I expect you're
referring to, 250 microseconds should be achievable. Whether
or not that is achieved via preemptability is pretty irrelevant.
-
> On good hardware, we can easily do much better than 1ms latency with a
> preemptible kernel and a spinlock cleanup. I don't think the
> limitations of some PC hardware should limit our goals for Linux.
Its more than a spinlock cleanup at that point. To do anything useful you have
to tackle both priority inversion and some kind of at least semi-formal
validation of the code itself. At the point it comes down to validating the
code I'd much rather validate rtlinux than the entire kernel
Alan
On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> Its more than a spinlock cleanup at that point. To do anything useful you have
> to tackle both priority inversion and some kind of at least semi-formal
> validation of the code itself. At the point it comes down to validating the
> code I'd much rather validate rtlinux than the entire kernel
The preemptible kernel plus the spinlock cleanup could really take us
far. Having locked at a lot of the long-held locks in the kernel, I am
confident at least reasonable progress could be made.
Beyond that, yah, we need a better locking construct. Priority
inversion could be solved with a priority-inheriting mutex, which we can
tackle if and when we want to go that route. Not now.
I want to lay the groundwork for a better kernel. The preempt-kernel
patch gives real-world improvements, it provides a smoother user desktop
experience -- just look at the positive feedback. Most importantly,
however, it provides a framework for superior response with our standard
kernel in its standard programming model.
Robert Love
After more testing, my original observations seem to be holding up,
except that under heavy VM load (e.g., "make -j bzImage") the machine's
overall performance seems far lower. For instance, without the patch
the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
I haven't had the patience to let it finish after more than an hour.
This is perhaps because the vmscan patch is too aggressively shrinking
the caches, or causing thrashing in another area? I'm also noticing
that the amount of swap used is nearly an order of magnitude higher,
which doesn't make sense at first glance... Also, there are extended
periods where idle CPU is 50-80%.
Maybe the patch or at least its intent can be merged with Andrea's work
if applicable?
Thanks,
--
Ken.
[email protected]
On Thu, Jan 03, 2002 at 02:23:01PM -0600, Ken Brownfield wrote:
| Unfortunately, I lost the response that basically said "2.4 looks stable
| to me", but let me count the ways in which I agree with Andreas'
| sentiment:
|
| A) VM has major issues
| 1) about a dozen recent OOPS reports in VM code
| 2) VM falls down on large-memory machines with a
| high inode count (slocate/updatedb, i/dcache)
| 3) Memory allocation failures and OOM triggers
| even though caches remain full.
| 4) Other bugs fixed in -aa and others
| B) Live- and dead-locks that I'm seeing on all 2.4 production
| machines > 2.4.9, possibly related to A. But how will I
| ever find out?
| C) IO-APIC code that requires noapic on any and all SMP
| machines that I've ever run on.
|
| I don't have anything against anyone here -- I think everyone is doing a
| fine job. It's an issue of acceptance of the problem and focus. These
| issues are all showstoppers for me, and while I don't represent the 90%
| of the Linux market that is UP desktops, IMHO future work on the kernel
| will be degraded by basic functionality that continues to cause
| problems.
|
| I think seeing some of Andrea's and Andrew's et al patches actually
| *happen* would be a good thing, since 2.4 kernels are decidedly not
| ready for production here. I am forced to apply 26 distinct patch sets
| to my kernels, and I am NOT the right person to make these judgements.
| Which is why I was interested in an LKML summary source, though I
| haven't yet had a chance to catch up on that thread of comment.
|
| Having a glitch in the radeon driver is one thing; having persistent,
| fatal, and reproducable failures in universal kernel code is entirely
| another.
|
| --
| Ken.
| [email protected]
|
|
| On Fri, Dec 28, 2001 at 09:16:38PM +0100, Andreas Hartmann wrote:
| | Hello all,
| |
| | Again, I did a rsync-operation as described in
| | "[2.4.17rc1] Swapping" MID <[email protected]>.
| |
| | This time, the kernel had a swappartition which was about 200MB. As the
| | swap-partition was fully used, the kernel killed all processes of knode.
| | Nearly 50% of RAM had been used for buffers at this moment. Why is there
| | so much memory used for buffers?
| |
| | I know I repeat it, but please:
| |
| | Fix the VM-management in kernel 2.4.x. It's unusable. Believe
| | me! As comparison: kernel 2.2.19 didn't need nearly any swap for
| | the same operation!
| |
| | Please consider that I'm using 512 MB of RAM. This should, or better:
| | must be enough to do the rsync-operation nearly without any swapping -
| | kernel 2.2.19 does it!
| |
| | The performance of kernel 2.4.18pre1 is very poor, which is no surprise,
| | because the machine swaps nearly nonstop.
| |
| |
| | Regards,
| | Andreas Hartmann
| |
| | -
| | To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| | the body of a message to [email protected]
| | More majordomo info at http://vger.kernel.org/majordomo-info.html
| | Please read the FAQ at http://www.tux.org/lkml/
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
> overall performance seems far lower. For instance, without the patch
> the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
please, PLEASE stop using "make -j"
for anything except the fork-bomb that it is.
pretending that it's a benchmark, especially one
to guide kernel tuning, is a travesty!
if you want to simulate VM load, so something sane like
boot with mem=32M, or a simple "mmap(lots); mlockall" tool.
regards, mark hahn.
Andrew Morton kindly pointed out that my crack pipe is dangerously empty
and I didn't specify what patch I was talking about. In my defense, I
was up all last night tracking down the ext3 bug that Andrew fixed right
under me. ;)
I replied to the wrong message, which I've pasted below. This is wrt
Martin's VM patch per the previous discussion.
Apologies,
--
Ken.
[email protected]
On Fri, Jan 11, 2002 at 02:41:17PM -0600, Ken Brownfield wrote:
| After more testing, my original observations seem to be holding up,
| except that under heavy VM load (e.g., "make -j bzImage") the machine's
| overall performance seems far lower. For instance, without the patch
| the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
| I haven't had the patience to let it finish after more than an hour.
|
| This is perhaps because the vmscan patch is too aggressively shrinking
| the caches, or causing thrashing in another area? I'm also noticing
| that the amount of swap used is nearly an order of magnitude higher,
| which doesn't make sense at first glance... Also, there are extended
| periods where idle CPU is 50-80%.
|
| Maybe the patch or at least its intent can be merged with Andrea's work
| if applicable?
|
| Thanks,
| --
| Ken.
| [email protected]
What I SHOULD have replied to:
| Date: Tue, 8 Jan 2002 09:19:57 -0600
| From: Ken Brownfield <[email protected]>
| To: Stephan von Krawczynski <[email protected]>,
| "M.H.VanLeeuwen" <[email protected]>, [email protected]
| Cc: [email protected]
| Subject: Update Re: [2.4.17/18pre] VM and swap - it's really unusable
| User-Agent: Mutt/1.2.5.1i
| In-Reply-To: <[email protected]>; from [email protected] o
| n Sat, Jan 05, 2002 at 04:08:33PM +0100
| Precedence: bulk
| X-Mailing-List: [email protected]
|
| I stayed at work all night banging out tests on a few of our machines
| here. I took 2.4.18-pre2 and 2.4.18-pre2 with the vmscan patch from
| "M.H.VanLeeuwen" <[email protected]>.
|
| My sustained test consisted of this type of load:
|
| ls -lR / > /dev/null &
| /usr/bin/slocate -u -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/
| usr/tmp,/afs,/net" &
| dd if=/dev/sda3 of=/sda3 bs=1024k &
| # Hit TUX on this machine repeatedly; html page with 1000 images
| # Wait for memory to be mostly used by buff/page cache
| ./a.out &
| # repeat finished commands -- keep all commands running
| # after a.out finishes, alow buff/page to refill before repeating
|
| The a.out in this case is a little program (attached, c.c) to allocate
| and write to an amount of memory equal to physical RAM. The example I
| chose below is from a 2xP3/600 with 1GB of RAM and 2GB swap.
|
| This was not a formal benchmark -- I think benchmarks have been
| presented before by other folks, and looking at benchmarks does not
| necessarily indicate the real-world problems that exist. My intent was
| to reproduce the issues I've been seeing, and then apply the MH (and
| only the MH) patch and observe.
|
| 2.4.18-pre2
|
| Once slocate starts and gets close to filling RAM with buffer/page
| cache, kupdated and kswapd have periodic spikes of 50-100% CPU.
|
| When a.out starts, kswapd and kupdated begin to eat significant portions
| of CPU (20-100%) and I/O becomes more and more sluggish as a.out
| allocates.
|
| When a.out uses all free RAM and should begin eating cache, significant
| swapping begins and cache is not decreased significantly until the
| machine goes 100-200MB into swap.
|
| Here are two readprofile outputs, sorted by ticks and load.
|
| 229689 default_idle 4417.0962
| 4794 file_read_actor 18.4385
| 405 __rdtsc_delay 14.4643
| 3763 do_anonymous_page 14.0410
| 3796 statm_pgd_range 9.7835
| 1535 prune_icache 6.9773
| 153 __free_pages 4.7812
| 1420 create_bounce 4.1765
| 583 sym53c8xx_intr 3.9392
| 221 atomic_dec_and_lock 2.7625
| 5214 generic_file_write 2.5659
|
| 273464 total 0.1903
| 234168 default_idle 4503.2308
| 5298 generic_file_write 2.6073
| 4868 file_read_actor 18.7231
| 3799 statm_pgd_range 9.7912
| 3763 do_anonymous_page 14.0410
| 1535 prune_icache 6.9773
| 1526 shrink_cache 1.6234
| 1469 create_bounce 4.3206
| 643 rmqueue 1.1320
| 591 sym53c8xx_intr 3.9932
| 505 __make_request 0.2902
|
|
| 2.4.18-pre2 with MH
|
| With the MH patch applied, the issues I witnessed above did not seem to
| reproduce. Memory allocation under pressure seemed faster and smoother.
| kswapd never went above 5-15% CPU. When a.out allocated memory, it did
| not begin swapping until buffer/page cache had been nearly completely
| cannibalized.
|
| And when a.out caused swapping, it was controlled and behaved like you
| would expect the VM to bahave -- slowly swapping out unused pages
| instead of large swap write-outs without the patch.
|
| Martin, have you done throughput benchmarks with MH/rmap/aa, BTW?
|
| But both kernels still seem to be sluggish when it comes to doing small
| I/O operations (vi, ls, etc) during heavy swapping activity.
|
| Here are the readprofile results:
|
| 206243 default_idle 3966.2115
| 6486 file_read_actor 24.9462
| 409 __rdtsc_delay 14.6071
| 2798 do_anonymous_page 10.4403
| 185 __free_pages 5.7812
| 1846 statm_pgd_range 4.7577
| 469 sym53c8xx_intr 3.1689
| 176 atomic_dec_and_lock 2.2000
| 349 end_buffer_io_async 1.9830
| 492 refill_inactive 1.8358
| 94 system_call 1.8077
|
| 245776 total 0.1710
| 216238 default_idle 4158.4231
| 6486 file_read_actor 24.9462
| 2799 do_anonymous_page 10.4440
| 1855 statm_pgd_range 4.7809
| 1611 generic_file_write 0.7928
| 839 __make_request 0.4822
| 820 shrink_cache 0.7374
| 540 rmqueue 0.9507
| 534 create_bounce 1.5706
| 492 refill_inactive 1.8358
| 487 sym53c8xx_intr 3.2905
|
|
| There may be significant differences in the profile outputs for those
| with VM fu.
|
| Summary: MH swaps _after_ cache has been properly cannibalized, and
| swapping activity starts when expected and is properly throttled.
| kswapd and kupdated don't seem to go into berserk 100% CPU mode.
|
| At any rate, I now have the MH patch (and Andrew Morton's mini-ll and
| read-latency2 patches) in production, and I like what I see so far. I'd
| vote for them to go into 2.4.18, IMHO. Maybe the full low-latency patch
| if it's not truly 2.5 material.
|
| My next cook-off will be with -aa and rmap, although if the rather small
| MH patch fixes my last issues it may be worth putting all VM effort into
| a 2.5 VM cook-off. :) Hopefully the useful stuff in -aa can get pulled
| in at some point soon, though.
|
| Thanks much to Martin H. VanLeeuwen for his patch and Stephan von
| Krawczynski for his recommendations. I'll let MH cook for a while and
| I'll follow up later.
| --
| Ken.
| [email protected]
|
| c.c:
|
| #include <stdio.h>
|
| #define MB_OF_RAM 1024
|
| int
| main()
| {
| long stuffsize = MB_OF_RAM * 1048576 ;
| char *stuff ;
|
| if ( stuff = (char *)malloc( stuffsize ) ) {
| long chunksize = 1048576 ;
| long c ;
|
| for ( c=0 ; c<chunksize ; c++ )
| *(stuff+c) = '\0' ;
| /* hack; last chunk discarded if stuffsize%chunksize != 0 */
| for ( ; (c+chunksize)<stuffsize ; c+=chunksize )
| memcpy( stuff+c, stuff, chunksize );
|
| sleep( 120 );
| }
| else
| printf("OOPS\n");
|
| exit( 0 );
| }
|
|
| On Sat, Jan 05, 2002 at 04:08:33PM +0100, Stephan von Krawczynski wrote:
| [...]
| | I am pretty impressed by Martins test case where merely all VM patches fail
| | with the exception of his own :-) The thing is, this test is not of nature
| | "very special" but more like "system driven to limit by normal processes". And
| | this is the real interesting part about it.
| [...]
|
I don't think I made the claim that this was a benchmark -- I certainly
realize that "make -j bzImage" is not real-world, but it is clearly
indicative of heavy VM/CPU/context load. Since I don't believe this
patch is currently in the running for inclusion, I'm just giving general
feedback to the patch author rather than making a case.
For instance, "make -j bzImage" reproduced the ext3 bug that Andrew
found where my other VM-intensive apps did not. I doubt we should keep
the bug in the kernel because the situation isn't real-world enough.
But yes, a bug is worse than a behavior flaw, granted.
--
Ken.
[email protected]
On Fri, Jan 11, 2002 at 04:13:00PM -0500, Mark Hahn wrote:
| > overall performance seems far lower. For instance, without the patch
| > the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
|
| please, PLEASE stop using "make -j"
| for anything except the fork-bomb that it is.
| pretending that it's a benchmark, especially one
| to guide kernel tuning, is a travesty!
|
| if you want to simulate VM load, so something sane like
| boot with mem=32M, or a simple "mmap(lots); mlockall" tool.
|
| regards, mark hahn.
|
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/
On Fri, 11 Jan 2002, Mark Hahn wrote:
> > overall performance seems far lower. For instance, without the patch
> > the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
>
> please, PLEASE stop using "make -j"
> for anything except the fork-bomb that it is.
> pretending that it's a benchmark, especially one
> to guide kernel tuning, is a travesty!
Actually, it's as good a benchmark as any. Knowing
how well the system is able to recover from heavy
overload situations is useful to know if your
server gets heavily overloaded at times.
If one VM falls over horribly under half the load
it takes to make another VM go slower, I know which
one I'd want on my server.
> if you want to simulate VM load, so something sane like
> boot with mem=32M, or a simple "mmap(lots); mlockall" tool.
... and then you come up with something WAY less
realistic than 'make -j' ;)))
cheers,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
Ken Brownfield wrote:
>
> After more testing, my original observations seem to be holding up,
> except that under heavy VM load (e.g., "make -j bzImage") the machine's
> overall performance seems far lower. For instance, without the patch
> the -j build finishes in ~10 minutes (2x933P3/256MB) but with the patch
> I haven't had the patience to let it finish after more than an hour.
>
> This is perhaps because the vmscan patch is too aggressively shrinking
> the caches, or causing thrashing in another area? I'm also noticing
> that the amount of swap used is nearly an order of magnitude higher,
> which doesn't make sense at first glance... Also, there are extended
> periods where idle CPU is 50-80%.
>
> Maybe the patch or at least its intent can be merged with Andrea's work
> if applicable?
>
> Thanks,
> --
> Ken.
> [email protected]
>
Ken,
Attached is an update to my previous vmscan.patch.2.4.17.c
Version "d" fixes a BUG due to a race in the old code _and_
is much less agressive at cache_shrinkage or conversely more
willing to swap out but not as much as the stock kernel.
It continues to work well wrt to high vm pressure.
Give it a whirl to see if it changes your "-j" symptoms.
If you like you can change the one line in the patch
from "DEF_PRIORITY" which is "6" to progressively smaller
values to "tune" whatever kind of swap_out behaviour you
like.
Martin
On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> The preemptible kernel plus the spinlock cleanup could really take us
> far. Having locked at a lot of the long-held locks in the kernel, I am
> confident at least reasonable progress could be made.
>
> Beyond that, yah, we need a better locking construct. Priority
> inversion could be solved with a priority-inheriting mutex, which we can
> tackle if and when we want to go that route. Not now.
Backing the car up to the edge of the cliff really gives us
good results. Beyond that, we could jump off the cliff
if we want to go that route.
Preempt leads to inheritance and inheritance leads to disaster.
All the numbers I've seen show Morton's low latency just works better. Are
there other numbers I should look at.
On Friday 11 January 2002 09:50 pm, [email protected] wrote:
> On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> > On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> > The preemptible kernel plus the spinlock cleanup could really take us
> > far. Having locked at a lot of the long-held locks in the kernel, I am
> > confident at least reasonable progress could be made.
> >
> > Beyond that, yah, we need a better locking construct. Priority
> > inversion could be solved with a priority-inheriting mutex, which we can
> > tackle if and when we want to go that route. Not now.
>
> Backing the car up to the edge of the cliff really gives us
> good results. Beyond that, we could jump off the cliff
> if we want to go that route.
> Preempt leads to inheritance and inheritance leads to disaster.
I preempt leads to disaster than Linux can't do SMP. Are you saying that's
the case?
The preempt patch is really "SMP on UP". If pre-empt shows up a problem,
then it's a problem SMP users will see too. If we can't take advantage of
the existing SMP locking infrastructure to improve latency and interactive
feel on UP machines, than SMP for linux DOES NOT WORK.
> All the numbers I've seen show Morton's low latency just works better. Are
> there other numbers I should look at.
This approach is basically a collection of heuristics. The kernel has been
profiled and everywhere a latency spike was found, a band-aid was put on it
(an explicit scheduling point). This doesn't say there aren't other latency
spikes, just that with the collection of hardware and software being
benchmarked, the latency spikes that were found have each had a band-aid
individually applied to them.
This isn't a BAD thing. If the benchmarks used to find latency spikes are at
all like real-world use, then it helps real-world applications. But of
COURSE the benchmarks are going to look good, since tuning the kernel to
those benchmarks is the way the patch was developed!
The majority of the original low latency scheduling point work is handled
automatically by the SMP on UP kernel. You don't NEED to insert scheduling
points anywhere you aren't inside a spinlock. So the SMP on UP patch makes
most of the explicit scheduling point patch go away, accomplishing the same
thing in a less intrusive manner. (Yes, it makes all kernels act like SMP
kernels for debugging purposes. But you can turn it off for debugging if you
want to, that's just another toggle in the magic sysreq menu. And this isn't
entirely a bad thing: applying the enormous UP userbase to the remaining SMP
bugs is bound to squeeze out one or two more obscure ones, but those bugs DO
exist already on SMP.)
However, what's left of the explicit scheduling work is still very useful.
When you ARE inside a spinlock, you can't just schedule, you have to save
state, drop the lock(s), schedule, re-acquire the locks, and reload your
state in case somebody else diddled with the structures you were using. This
is a lot harder than just scheduling, but breaking up long-held locks like
this helps SMP scalability, AND helps latency in the SMP-on-UP case.
So the best approach is a combination of the two patches. SMP-on-UP for
everything outside of spinlocks, and then manually yielding locks that cause
problems. Both Robert Love and Andrew Morton have come out in favor of each
other's patches on lkml just in the past few days. The patches work together
quite well, and each wants to see the other's patch applied.
Rob
On Fri, Jan 11, 2002 at 03:22:08PM -0500, Rob Landley wrote:
> I preempt leads to disaster than Linux can't do SMP. Are you saying that's
> the case?
>
> The preempt patch is really "SMP on UP". If pre-empt shows up a problem,
People keep repeating this, it must feel reassuring.
/* in kernel mode: does it need a lock? */
m = next_free_page_from_per_cpu_cache();
To start, preemptive means that the optimizations for SMP that reduce
locking by per cpu localization do not work.
So, as I understand it:
the preempt patch is really "crappy SMP on UP"
may be correct. But what you wrote is not.
Did I miss something? That's not a rhetorical question - I recall
being wrong before so go ahead and explain what's wrong with my logic.
> then it's a problem SMP users will see too. If we can't take advantage of
> the existing SMP locking infrastructure to improve latency and interactive
> feel on UP machines, than SMP for linux DOES NOT WORK.
>
> > All the numbers I've seen show Morton's low latency just works better. Are
> > there other numbers I should look at.
>
> This approach is basically a collection of heuristics. The kernel has been
One patch makes the numbers look good (sort of)
One patch does not but "improves feel" and breaks a exceptionally useful
rule: per cpu data in kernel that is not touched by interrupt code does not
need to be locked.
The basic assumption of the preempt trick is that locking for SMP is
based on the same principles as locking for preemption and that's completely
false.
I believe that the preempt path leads inexorably to
mutex-with-stupid-priority-trick and that would be very unfortunate indeed.
It's unavoidable because sooner or later someone will find that preempt +
SCHED_FIFO leads to
niced app 1 in K mode gets Sem A
SCHED_FIFO app prempts and blocks on Sem A
whoops! app 2 in K more preempts niced app 1
Hey my DVD player has stalled, lets add sem_with_revolting_priority_trick!
Why the hell is UP Windows XP3 blowing away my Linux box on DVD playing while
Linux now runs with the grace and speed of IRIX?
And has anyone fixed all those mysterious hangs caused by the interesting
interaction of hundreds of preempted semaphores?
> profiled and everywhere a latency spike was found, a band-aid was put on it
> (an explicit scheduling point). This doesn't say there aren't other latency
> spikes, just that with the collection of hardware and software being
> benchmarked, the latency spikes that were found have each had a band-aid
> individually applied to them.
>
> This isn't a BAD thing. If the benchmarks used to find latency spikes are at
> all like real-world use, then it helps real-world applications. But of
> COURSE the benchmarks are going to look good, since tuning the kernel to
> those benchmarks is the way the patch was developed!
>
> The majority of the original low latency scheduling point work is handled
> automatically by the SMP on UP kernel. You don't NEED to insert scheduling
> points anywhere you aren't inside a spinlock. So the SMP on UP patch makes
> most of the explicit scheduling point patch go away, accomplishing the same
> thing in a less intrusive manner. (Yes, it makes all kernels act like SMP
> kernels for debugging purposes. But you can turn it off for debugging if you
> want to, that's just another toggle in the magic sysreq menu. And this isn't
> entirely a bad thing: applying the enormous UP userbase to the remaining SMP
> bugs is bound to squeeze out one or two more obscure ones, but those bugs DO
> exist already on SMP.)
This is the logic of _every_ variant of "lets put X windows in the kernel
and let the kernel hackers fix it" . Konquerer crashed when I used it
yesterday. Let's put it in the kernel too and apply that enormous UP
userbase to the remaining bugs.
>
> However, what's left of the explicit scheduling work is still very useful.
> When you ARE inside a spinlock, you can't just schedule, you have to save
> state, drop the lock(s), schedule, re-acquire the locks, and reload your
> state in case somebody else diddled with the structures you were using. This
> is a lot harder than just scheduling, but breaking up long-held locks like
> this helps SMP scalability, AND helps latency in the SMP-on-UP case.
>
> So the best approach is a combination of the two patches. SMP-on-UP for
> everything outside of spinlocks, and then manually yielding locks that cause
> problems. Both Robert Love and Andrew Morton have come out in favor of each
> other's patches on lkml just in the past few days. The patches work together
> quite well, and each wants to see the other's patch applied.
>
> Rob
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
Rob Landley wrote:
>
> On Friday 11 January 2002 09:50 pm, [email protected] wrote:
> > On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> > > On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> > > The preemptible kernel plus the spinlock cleanup could really take us
> > > far. Having locked at a lot of the long-held locks in the kernel, I am
> > > confident at least reasonable progress could be made.
> > >
> > > Beyond that, yah, we need a better locking construct. Priority
> > > inversion could be solved with a priority-inheriting mutex, which we can
> > > tackle if and when we want to go that route. Not now.
> >
> > Backing the car up to the edge of the cliff really gives us
> > good results. Beyond that, we could jump off the cliff
> > if we want to go that route.
> > Preempt leads to inheritance and inheritance leads to disaster.
>
> I preempt leads to disaster than Linux can't do SMP. Are you saying that's
> the case?
Victor is referring to priority inheritance, to solve priority inversion.
Priority inheritance seems undesirable for Linux - these applications are
already in the minority. A realtime application on Linux should simply
avoid complex system calls which can lead to blockage on a SCHED_OTHER
thread.
If the app is well-designed, the only place in which it is likely to
be unexpectedly blocked inside the kernel is in the page allocator.
My approach to this problem is to cause non-SCHED_OTHER processes
to perform atomic (non-blocking) memory allocations, with a fallback
to non-atomic.
> The preempt patch is really "SMP on UP". If pre-empt shows up a problem,
> then it's a problem SMP users will see too. If we can't take advantage of
> the existing SMP locking infrastructure to improve latency and interactive
> feel on UP machines, than SMP for linux DOES NOT WORK.
>
> > All the numbers I've seen show Morton's low latency just works better. Are
> > there other numbers I should look at.
>
> This approach is basically a collection of heuristics. The kernel has been
> profiled and everywhere a latency spike was found, a band-aid was put on it
> (an explicit scheduling point). This doesn't say there aren't other latency
> spikes, just that with the collection of hardware and software being
> benchmarked, the latency spikes that were found have each had a band-aid
> individually applied to them.
The preempt patch needs all this as well.
> This isn't a BAD thing. If the benchmarks used to find latency spikes are at
> all like real-world use, then it helps real-world applications. But of
> COURSE the benchmarks are going to look good, since tuning the kernel to
> those benchmarks is the way the patch was developed!
>
> The majority of the original low latency scheduling point work is handled
> automatically by the SMP on UP kernel.
No it is not.
The preempt code only obsoletes a handful of the low-latency patch's
resceduling. The most trivial ones. generic_file_read, generic_file_write
and a couple of /proc functions.
Of the sixty or so rescheduling points in the low-latency patch, about
fifty are inside locks. Many of these are just lock_kernel(). About
half are not.
> You don't NEED to insert scheduling
> points anywhere you aren't inside a spinlock.
I know of only four or five places in the kernel where large amount of
time are spent in unlocked code. All the other problem areas are inside locks.
> So the SMP on UP patch makes
> most of the explicit scheduling point patch go away,
s/most/a trivial minority/
> accomplishing the same
> thing in a less intrusive manner.
s/less/more/
> (Yes, it makes all kernels act like SMP
> kernels for debugging purposes. But you can turn it off for debugging if you
> want to, that's just another toggle in the magic sysreq menu. And this isn't
> entirely a bad thing: applying the enormous UP userbase to the remaining SMP
> bugs is bound to squeeze out one or two more obscure ones, but those bugs DO
> exist already on SMP.)
Saying "it's a config option" is a cop-out. The kernel developers should
be aiming at producing a piece of software which can be shrink-wrap
deployed to millions of people.
Arguably, enabling it on UP and disabling it on SMP may be a sensible
approach, meraly because SMP tends to map onto applications which
do not require lower latencies.
> However, what's left of the explicit scheduling work is still very useful.
> When you ARE inside a spinlock, you can't just schedule, you have to save
> state, drop the lock(s), schedule, re-acquire the locks, and reload your
> state in case somebody else diddled with the structures you were using. This
> is a lot harder than just scheduling, but breaking up long-held locks like
> this helps SMP scalability, AND helps latency in the SMP-on-UP case.
Yes, it _may_ help SMP scalability. But a better approach is to replace
spinlocks with rwlocks when a lock is fond to have this access pattern.
> So the best approach is a combination of the two patches. SMP-on-UP for
> everything outside of spinlocks, and then manually yielding locks that cause
> problems.
Well the ideal approach is to simply make the long-running locked code
faster, by better choice of algorithm and data structure. Unfortunately,
in the majority of cases, this isn't possible.
-
On Fri, 2002-01-11 at 15:22, Rob Landley wrote:
> So the best approach is a combination of the two patches. SMP-on-UP for
> everything outside of spinlocks, and then manually yielding locks that cause
> problems. Both Robert Love and Andrew Morton have come out in favor of each
> other's patches on lkml just in the past few days. The patches work together
> quite well, and each wants to see the other's patch applied.
Right. Here is what I want for 2.5 as a _general_ step towards a better
kernel that will yield better performance:
Merge the preemptible kernel patch. A version is now out for
2.5.2-pre11 with support for Ingo's scheduler:
ftp://ftp.kernel.org/pub/linux/kernel/people/rml/preempt-kernel
Next, make available a tool for profiling kernel latencies. I have one
available now, preempt-stats, at the above url. Andrew has some
excellent tools available at his website, too. Something like this
could even be merged. Daniel Phillips suggested a passive tool on IRC.
Preempt-stats works like this. It is off-by-default and, when enabled,
measures time between lock and unlock, reporting the top 20 worst-cases.
Begin working on the worst-case locks. Solutions like Andrew's
low-latency and my lock-break are a start. Better (at least in general)
solutions are to analyze the locks. Localize them; make them finer
grained. Analyze the algorithms. Find the big problems. Anyone look
at the tty layer lately? Ugh. Using the preemptive kernel as a base
and the analysis of the locks as a list of culprits, clean this cruft
up. This would benefit SMP, too. Perhaps a better locking construct is
useful.
The immediate result is good; the future is better.
Robert Love
In article <20020112042404.WCSI23959.femail47.sdc1.sfba.home.com@there> you wrote:
> The preempt patch is really "SMP on UP".
Ok I've seen this misconception quite a lot now. THIS IS NOT TRUE. For one,
constructs that are ok on SMP are not automatically ok with the -preempt
patch, like per-cpu data. And there's a LOT more of that than you think.
Basically with preempt you change the locking rules from under all existing
code. Most will work, even more will appear to work as preemption isn't
an event THAT common (by this I mean the chance of getting preempted in your
4 lines of C code where you have per cpu data).
Also, once you add locks around the per-cpu data, for the core code it
might be close to smp. For drivers it's not though. Drivers assume that when
they do
outb(foo,bar);
outb(foo2,bar2);
that those happen "close" to eachother in time. Especially in initialisation
paths (where the driver thread is the only thread that can see the
datastructures/device) there's no spinlocks helt so preempt can trigger
here. Sure in the current situation you can get an interrupt but the linux
interrupt delay is not more than, say, 1ms while a schedule-out can take a
second or two easily. Do we know all devices can stand such delays ?
I dare to say we don't as the hardware requirements currently aren't coded
in the drivers.
Add to that that there's no actual benefit of -preempt over the -lowlat
patch latency wise (you REALLY need to combine them or -preempt sucks raw
eggs for latency)....
Greetings,
Arjan van de Ven
On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
>
> > Its more than a spinlock cleanup at that point. To do anything useful you have
> > to tackle both priority inversion and some kind of at least semi-formal
> > validation of the code itself. At the point it comes down to validating the
> > code I'd much rather validate rtlinux than the entire kernel
>
> The preemptible kernel plus the spinlock cleanup could really take us
> far. Having locked at a lot of the long-held locks in the kernel, I am
> confident at least reasonable progress could be made.
>
> Beyond that, yah, we need a better locking construct. Priority
> inversion could be solved with a priority-inheriting mutex, which we can
> tackle if and when we want to go that route. Not now.
>
> I want to lay the groundwork for a better kernel. The preempt-kernel
> patch gives real-world improvements, it provides a smoother user desktop
> experience -- just look at the positive feedback. Most importantly,
> however, it provides a framework for superior response with our standard
I don't know how to tell you, positive feedback compared to mainline
kernel is totally irrelevant, mainline has broken read/write/sendfile
syscalls that can hang the machine etc... That was fixed ages ago in
many ways, current way is very lightweight, if you can get positive
feedback compared to -aa _that_ will matter.
Andrea
Hi,
[email protected] wrote:
> I believe that the preempt path leads inexorably to
> mutex-with-stupid-priority-trick and that would be very unfortunate indeed.
> It's unavoidable because sooner or later someone will find that preempt +
> SCHED_FIFO leads to
> niced app 1 in K mode gets Sem A
> SCHED_FIFO app prempts and blocks on Sem A
> whoops! app 2 in K more preempts niced app 1
Please explain what's different without the preempt patch.
> Hey my DVD player has stalled, lets add sem_with_revolting_priority_trick!
> Why the hell is UP Windows XP3 blowing away my Linux box on DVD playing while
> Linux now runs with the grace and speed of IRIX?
Because the IRIX implementation sucks, every implementation has to suck?
Somehow I have the suspicion you're trying to discourage everyone from
even trying, because if he'd succeeded you'd loose a big chunk of
potential RTLinux customers.
bye, Roman
On Sat, Jan 12, 2002 at 12:53:06PM +0100, Roman Zippel wrote:
> Hi,
>
> [email protected] wrote:
>
> > I believe that the preempt path leads inexorably to
> > mutex-with-stupid-priority-trick and that would be very unfortunate indeed.
> > It's unavoidable because sooner or later someone will find that preempt +
> > SCHED_FIFO leads to
> > niced app 1 in K mode gets Sem A
> > SCHED_FIFO app prempts and blocks on Sem A
> > whoops! app 2 in K more preempts niced app 1
>
> Please explain what's different without the preempt patch.
See that "preempt" in line 2 . Linux does not
preempt kernel mode processes otherwise. The beauty of the
non-preemptive kernel is that "in K mode every process makes progress"
and even the "niced app" will complete its use of SemA and
release it in one run. If you have a reasonably fair scheduler you
can make very useful analysis with Linux now of the form
Under 50 active proceses in the system means that in every
2 second interval every process
will get at least 10ms of time to run.
That's a very valuable property and it goes away in a preemptive kernel
to get you something vague.
>
> > Hey my DVD player has stalled, lets add sem_with_revolting_priority_trick!
> > Why the hell is UP Windows XP3 blowing away my Linux box on DVD playing while
> > Linux now runs with the grace and speed of IRIX?
>
> Because the IRIX implementation sucks, every implementation has to suck?
> Somehow I have the suspicion you're trying to discourage everyone from
> even trying, because if he'd succeeded you'd loose a big chunk of
> potential RTLinux customers.
So your argument is that I'm advocating Andrew Morton's patch which
reduces latencies more than the preempt patch because I have a
financial interest in not reducing latencies? Subtle.
In any case, motive has no bearing on a technical argument.
Your motive could be to make the 68K look better by reducing
performance on other processors for all I know.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Sat, Jan 12, 2002 at 01:01:39AM -0500, Robert Love wrote:
> could even be merged. Daniel Phillips suggested a passive tool on IRC.
> Preempt-stats works like this. It is off-by-default and, when enabled,
> measures time between lock and unlock, reporting the top 20 worst-cases.
I think one of the problems with this entire debate is lack of meaningful
numbers. Not for the first time, I propose that you test with something
that tests application benefits instead of internal numbers that may not
mean anything. For example, there is a simple test
/* user code */
get time.
count = 200*3600; /* one hour */
while(count--){
read cycle timer
clock_nanosleep(5 milliseconds)
read cycle timer
compute actual delay and difference from 5 milliseconds
store the worst case
}
get time.
printf("After one hour the worst deviation is %d clock ticks\n",worst);
printf("This was supposed to take one hour and it took %d", compute_elapsed());
>
> Begin working on the worst-case locks. Solutions like Andrew's
> low-latency and my lock-break are a start. Better (at least in general)
> solutions are to analyze the locks. Localize them; make them finer
> grained. Analyze the algorithms. Find the big problems. Anyone look
The theory that "fine grained = better" is not proved. It's obvious that
"fine grained = more time spent in the overhead of locking and unlocking locks and
potentially more time spent in lock contention
and lots more opportunities of cache ping-pong in real smp
and much harder to debug"
But the performance gain that is supposed to balance that is often elusive.
> at the tty layer lately? Ugh. Using the preemptive kernel as a base
> and the analysis of the locks as a list of culprits, clean this cruft
> up. This would benefit SMP, too. Perhaps a better locking construct is
> useful.
>
> The immediate result is good; the future is better.
Removing synchronization by removing contention
is better engineering than fiddling about with synchronization
primitives, but it is much harder.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
Hi,
[email protected] wrote:
> > > SCHED_FIFO leads to
> > > niced app 1 in K mode gets Sem A
> > > SCHED_FIFO app prempts and blocks on Sem A
> > > whoops! app 2 in K more preempts niced app 1
> >
> > Please explain what's different without the preempt patch.
>
> See that "preempt" in line 2 . Linux does not
> preempt kernel mode processes otherwise. The beauty of the
> non-preemptive kernel is that "in K mode every process makes progress"
> and even the "niced app" will complete its use of SemA and
> release it in one run.
The point of using semaphores is that one can sleep while holding them,
whether this is forced by preemption or voluntary makes no difference.
> If you have a reasonably fair scheduler you
> can make very useful analysis with Linux now of the form
>
> Under 50 active proceses in the system means that in every
> 2 second interval every process
> will get at least 10ms of time to run.
>
> That's a very valuable property and it goes away in a preemptive kernel
> to get you something vague.
How is that changed? AFAIK inserting more schedule points does not
change the behaviour of the scheduler. The niced app will still get its
time.
> So your argument is that I'm advocating Andrew Morton's patch which
> reduces latencies more than the preempt patch because I have a
> financial interest in not reducing latencies? Subtle.
Andrew's patch requires constant audition and Andrew can't audit all
drivers for possible problems. That doesn't mean Andrew's work is
wasted, since it identifies problems, which preempting can't solve, but
it will always be a hunt for the worst cases, where preempting goes for
the general case.
> In any case, motive has no bearing on a technical argument.
> Your motive could be to make the 68K look better by reducing
> performance on other processors for all I know.
I am more than busy to keep it running (together with a few others, who
are left) and more important I make no money of it.
bye, Roman
On Sat, Jan 12, 2002 at 02:25:03PM +0100, Roman Zippel wrote:
> Hi,
>
> [email protected] wrote:
>
> > > > SCHED_FIFO leads to
> > > > niced app 1 in K mode gets Sem A
> > > > SCHED_FIFO app prempts and blocks on Sem A
> > > > whoops! app 2 in K more preempts niced app 1
> > >
> > > Please explain what's different without the preempt patch.
> >
> > See that "preempt" in line 2 . Linux does not
> > preempt kernel mode processes otherwise. The beauty of the
> > non-preemptive kernel is that "in K mode every process makes progress"
> > and even the "niced app" will complete its use of SemA and
> > release it in one run.
>
> The point of using semaphores is that one can sleep while holding them,
> whether this is forced by preemption or voluntary makes no difference.
No. The point of using semaphores is that one can sleep while
_waiting_ for the resource. Sleeping while holding semaphores is
a different kettle of lampreys entirely.
And it makes a very big difference
A:
get sem on memory pool
do something horrible to pool
release sem on memory pool
In a preemptive kernel this can cause a deadlock. In a non
preemptive it cannot. You are correct in that
B:
get sem on memory pool
do potentially blocking operations
release sem
is also dangerous - but I don't think that helps your case.
To fix B, we can enforce a coding rule - one of the reasons why
we have all those atomic ops in the kernel is to be able to
avoid this problem.
To fix A in a preemptive kernel we need to start messing about with
priorities and that's a major error.
"The current kernel has too many places where processes
can sleep while holding semaphores so we should always have the
potential of blocking with held semaphores" is, to me, a backwards
argument.
> > If you have a reasonably fair scheduler you
> > can make very useful analysis with Linux now of the form
> >
> > Under 50 active proceses in the system means that in every
> > 2 second interval every process
> > will get at least 10ms of time to run.
> >
> > That's a very valuable property and it goes away in a preemptive kernel
> > to get you something vague.
>
> How is that changed? AFAIK inserting more schedule points does not
> change the behaviour of the scheduler. The niced app will still get its
> time.
How many times can an app be preempted? In a non preempt kernel
is can be preempted during user mode at timer frequency and no more
and it cannot be preempted during kernel mode. So
while(1){
read mpeg data
process
write bitmap
}
Assuming Andrew does not get too ambitious about read/write granularity, once this
process is scheduled on a non-preempt system it will always make progress. The non
preempt kernel says, "your kernel request will complete - if we have resources".
A preempt kernel says: "well, if nobody more important activates you get some time"
Now you do the analysis based on the computation of "goodness" to show that there is
a bound on preemption count during an execution of this process. I don't want to
have to think that hard.
Let's suppose the Gnome desktop constantly creates and
destroys new fresh i/o bound tasks to do something. So with the old fashioned non
preempt (ignoring Andrew) we get
wait no more than 1 second
I'm scheduled and start a read
wait no more than one second
I'm scheduled and in user mode for at least 10milliseconds
wait no more than 1 second
I'm scheduled and do my write
...
with preempt we get
wait no more than 1 second
I'm scheduled and start a read
I'm preempted
read not done
come back for 2 microseconds
preempted again
haven't issued the damn read request yet
ok a miracle happens, I finish the read request
go to usermode and an interrupt happens
well it would be stupid to have a goodness
function in a preempt kernel that lets a low
priority task finish its time slice so preempt
...
>
> > So your argument is that I'm advocating Andrew Morton's patch which
> > reduces latencies more than the preempt patch because I have a
> > financial interest in not reducing latencies? Subtle.
>
> Andrew's patch requires constant audition and Andrew can't audit all
> drivers for possible problems. That doesn't mean Andrew's work is
> wasted, since it identifies problems, which preempting can't solve, but
> it will always be a hunt for the worst cases, where preempting goes for
> the general case.
the preempt requires constant auditing too - and more complex auditing.
After all, a missed audit in Andrew will simply increase worst case timing.
A missed audit in preempt will hang the system.
>
> > In any case, motive has no bearing on a technical argument.
> > Your motive could be to make the 68K look better by reducing
> > performance on other processors for all I know.
>
> I am more than busy to keep it running (together with a few others, who
> are left) and more important I make no money of it.
Come on! First of all, you are causing me a great deal of pain by making
me struggle not to make some bad joke about the economics of Linux companies.
More important, not making money has nothing to do with purity of motivation -
don't you read this list?
And how do I know that you haven't got a stockpile of 68K boards that may
be worth big money once it's known that 68K linux is at the top of the heap?
Much less plausible money making schemes have been tried.
Seriously: for our business, a Linux kernel that can reliably run at millisecond
level latencies is only good. If you could get a Linux kernel to run at
latencies of 100 microseconds worst case on a 486, I'd be a little more
worried but even then ...
On a 800Mhz Athlon, RTLinux scheduling jitter is 17microseconds worst case right now.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Sat, Jan 12, 2002 at 12:13:15PM +0100, Andrea Arcangeli wrote:
> On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> > On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> >
> > > Its more than a spinlock cleanup at that point. To do anything useful you have
> > > to tackle both priority inversion and some kind of at least semi-formal
> > > validation of the code itself. At the point it comes down to validating the
> > > code I'd much rather validate rtlinux than the entire kernel
> >
> > The preemptible kernel plus the spinlock cleanup could really take us
> > far. Having locked at a lot of the long-held locks in the kernel, I am
> > confident at least reasonable progress could be made.
> >
> > Beyond that, yah, we need a better locking construct. Priority
> > inversion could be solved with a priority-inheriting mutex, which we can
> > tackle if and when we want to go that route. Not now.
> >
> > I want to lay the groundwork for a better kernel. The preempt-kernel
> > patch gives real-world improvements, it provides a smoother user desktop
> > experience -- just look at the positive feedback. Most importantly,
> > however, it provides a framework for superior response with our standard
>
> I don't know how to tell you, positive feedback compared to mainline
> kernel is totally irrelevant, mainline has broken read/write/sendfile
> syscalls that can hang the machine etc... That was fixed ages ago in
> many ways, current way is very lightweight, if you can get positive
> feedback compared to -aa _that_ will matter.
Hello Andrea,
I did my usual compile testings (untar kernel archive, apply patches,
make -j<value> ...
Here are some results (Wall time + Percent cpu) for each of the consecutive five runs:
13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp
j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83%
j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83%
j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87%
j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87%
j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86%
j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93%
j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88%
j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91%
j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91%
j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88%
* build incomplete (OOM killer killed several cc1 ... )
So far 2.4.13-pre5aa1 had been the king of the block in compile times.
But this has changed. Now the (by far) fastest kernel is 2.4.18-pre
+ Ingos scheduler patch (s) + preemptive patch (p). I did not test
preemptive patch alone so far since I don't know if the one I have
applies cleanly against -pre3 without Ingos patch. I used the
following patches:
s: sched-O1-2.4.17-H6.patch
p: preempt-kernel-rml-2.4.18-pre3-ingo-1.patch
I hope this info is useful to someone.
Kind regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> On Sat, Jan 12, 2002 at 12:13:15PM +0100, Andrea Arcangeli wrote:
> > On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> > > On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> > >
> > > > Its more than a spinlock cleanup at that point. To do anything useful you have
> > > > to tackle both priority inversion and some kind of at least semi-formal
> > > > validation of the code itself. At the point it comes down to validating the
> > > > code I'd much rather validate rtlinux than the entire kernel
> > >
> > > The preemptible kernel plus the spinlock cleanup could really take us
> > > far. Having locked at a lot of the long-held locks in the kernel, I am
> > > confident at least reasonable progress could be made.
> > >
> > > Beyond that, yah, we need a better locking construct. Priority
> > > inversion could be solved with a priority-inheriting mutex, which we can
> > > tackle if and when we want to go that route. Not now.
> > >
> > > I want to lay the groundwork for a better kernel. The preempt-kernel
> > > patch gives real-world improvements, it provides a smoother user desktop
> > > experience -- just look at the positive feedback. Most importantly,
> > > however, it provides a framework for superior response with our standard
> >
> > I don't know how to tell you, positive feedback compared to mainline
> > kernel is totally irrelevant, mainline has broken read/write/sendfile
> > syscalls that can hang the machine etc... That was fixed ages ago in
> > many ways, current way is very lightweight, if you can get positive
> > feedback compared to -aa _that_ will matter.
>
> Hello Andrea,
>
> I did my usual compile testings (untar kernel archive, apply patches,
> make -j<value> ...
>
> Here are some results (Wall time + Percent cpu) for each of the consecutive five runs:
>
> 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp
> j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83%
> j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83%
> j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87%
> j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87%
> j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86%
>
> j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93%
> j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88%
> j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91%
> j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91%
> j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88%
>
> * build incomplete (OOM killer killed several cc1 ... )
>
> So far 2.4.13-pre5aa1 had been the king of the block in compile times.
> But this has changed. Now the (by far) fastest kernel is 2.4.18-pre
> + Ingos scheduler patch (s) + preemptive patch (p). I did not test
> preemptive patch alone so far since I don't know if the one I have
> applies cleanly against -pre3 without Ingos patch. I used the
> following patches:
>
> s: sched-O1-2.4.17-H6.patch
> p: preempt-kernel-rml-2.4.18-pre3-ingo-1.patch
>
> I hope this info is useful to someone.
the improvement of "sp" compared to "s" is quite visible, not sure how
can a little different time spent in kernel make such a difference on
the final numbers, also given compilation is mostly an userspace task, I
assume you were swapping out or running out of cache at the very least,
right?
btw, I'd be curious if you could repeat the same test with -j1 or -j2?
(actually real world)
Still the other numbers remains interesting for a trashing machine, but
a few percent difference with a trashing box isn't a big difference, vm
changes can infulence those numbers more than any preempt or scheduler
number (of course if my guess that you're swapping out is really right :).
I guess "p" helps because we simply miss some schedule point in some vm
routine. Hints?
Andrea
On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> I did my usual compile testings (untar kernel archive, apply patches,
> make -j<value> ...
If I understand your test,
you are testing different loads - you are compiling kernels that may differ
in size and makefile organization, not to mention different layout on the
file system and disk.
What happens when you do the same test, compiling one kernel under multiple
different kernels?
On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > I did my usual compile testings (untar kernel archive, apply patches,
> > make -j<value> ...
>
> If I understand your test,
> you are testing different loads - you are compiling kernels that may differ
> in size and makefile organization, not to mention different layout on the
> file system and disk.
Ouch, I assumed this wasn't the case indeed.
>
> What happens when you do the same test, compiling one kernel under multiple
> different kernels?
Andrea
Hi,
[email protected] wrote:
> No. The point of using semaphores is that one can sleep while
> _waiting_ for the resource.
> [...]
> In a preemptive kernel this can cause a deadlock. In a non
> preemptive it cannot. You are correct in that
> B:
> get sem on memory pool
> do potentially blocking operations
> release sem
> is also dangerous - but I don't think that helps your case.
> To fix B, we can enforce a coding rule - one of the reasons why
> we have all those atomic ops in the kernel is to be able to
> avoid this problem.
Sorry I can't follow you. First, one can sleep while waiting for the
semaphore _and_ while holding it. Second we use atomic ops (e.g. for
resource management) exactly because there are not protected by any
semaphore/spinlock.
> Let's suppose the Gnome desktop constantly creates and
> destroys new fresh i/o bound tasks to do something. So with the old fashioned non
> preempt (ignoring Andrew) we get
> [...]
There is no priority problem! If there is a more important task to run,
the less important one simply has to wait, but it will still get its
time. Your deadlock situation does not exists. The average time a
process has to wait for a lower priority process might be increased, but
the worst case behaviour is still the same.
The problem that does exist is the coarse time slice accounting, which
is easier to exploit with the preempt kernel, but it's not a new
problem. On the other hand it's a solvable problem, which requires no
priority inversion.
> > Andrew's patch requires constant audition and Andrew can't audit all
> > drivers for possible problems. That doesn't mean Andrew's work is
> > wasted, since it identifies problems, which preempting can't solve, but
> > it will always be a hunt for the worst cases, where preempting goes for
> > the general case.
>
> the preempt requires constant auditing too - and more complex auditing.
> After all, a missed audit in Andrew will simply increase worst case timing.
> A missed audit in preempt will hang the system.
As long as the scheduler isn't changed, this isn't true and as I said
there are latency problems which preempting can't solve, but it will
automatically take care of the rest.
> Come on! First of all, you are causing me a great deal of pain by making
> me struggle not to make some bad joke about the economics of Linux companies.
Feel free, I'm not a big believer in the economics of software companies
in general, anyway.
> More important, not making money has nothing to do with purity of motivation -
> don't you read this list?
Everyone has its motivation and I do respect that, but I'm getting
suspicious as soon as money is involved. If people disagree, they can
still get along nicely and do their thing independently. But if they
have to make a living by getting a share of a cake, it usually only
works as long as there is enough cake, otherwise it can get nasty very
quickly (and usually there is never enough cake).
bye, Roman
Andrew Morton wrote:
>
> Priority inheritance seems undesirable for Linux - these applications are
> already in the minority. A realtime application on Linux should simply
> avoid complex system calls which can lead to blockage on a SCHED_OTHER
> thread.
I think it's very common to have SCHED_FIFO thread communicating with
various other processes through pipe/fifo/socket or some other IPC
mechanism.
It would be great to have priority inheritance where process receiving data
through fifo from SCHED_FIFO process would have raised priority for transfer
time. (see QNX priority inheriting message queues) Too bad we don't have
message queues so we could have send/receive/reply time priority
inheritance.
So we could have
Process 1 at SCHED_FIFO sending data to two processes.
Process 2 at SCHED_FIFO receiving data from process 1.
Process 3 at SCHED_OTHER receiving data from process 1.
Process 4 at SCHED_OTHER sending data to process 5.
Process 5 at SCHED_OTHER receiving data from process 4.
And (2) would get the data first from (1) and then (3). And if (1) starts
sending data to (2) system would immediately start running (1/2) and even
pre-empt the ongoing system call of (3). Also (1/3) would take over/pre-empt
(4/5) because (3) inherits priority from sending process (1).
If this is currently _not_ done I think it's very strange.
But I think I have misunderstood the whole point of original message... :)
- Jussi Laako
--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers
> > Hey my DVD player has stalled, lets add sem_with_revolting_priority_trick!
> > Why the hell is UP Windows XP3 blowing away my Linux box on DVD playing while
> > Linux now runs with the grace and speed of IRIX?
>
> Because the IRIX implementation sucks, every implementation has to suck?
> Somehow I have the suspicion you're trying to discourage everyone from
> even trying, because if he'd succeeded you'd loose a big chunk of
> potential RTLinux customers.
Victor has had the same message for years, as have others like Larry McVoy
(in fact if Larry and Victor agree on something its unusual enough to
remember). So I can vouch for the fact Victor hasn't changed his tune from
before rtlinux was ever any real commercial toy. I think you owe him an
apology.
Now rtlinux and low latency in the main kernel are two different things. One
gives you effectively a small embedded system to program for which talks
to Linux. From that you draw extremely reliable behaviour and very bounded
delay times. Its small enough you can validate it too
RtLinux isn't going to help you one bit when it comes to smooth movie playback
because the DVD playback is dependant on the Linux file system layers and a
whole pile of other code. Low-latency does this quite nicely, and it takes
you to the point where hardware becomes the biggest latency cause for the
general case. Pre-empt doesn't buy you anything more. You can spend a
millisecond locked in an I/O instruction to an irritating device.
Alan
Another example is in the network drivers. The 8390 core for one example
carefully disables an IRQ on the card so that it can avoid spinlocking on
uniprocessor boxes.
So with pre-empt this happens
driver magic
disable_irq(dev->irq)
PRE-EMPT:
[large periods of time running other code]
PRE-EMPT:
We get back and we've missed 300 packets, the serial port sharing
the IRQ has dropped our internet connection completely.
["Don't do that then" isnt a valid answer here. If I did hold a lock
it would be for several milliseconds at a time anyway and would reliably
trash performance this time]
There are numerous other examples in the kernel tree where the current code
knows that there is a small bounded time between two actions in kernel space
that do not have a sleep. They are not spin locked, and putting spin locks
everywhere will just trash performance. They are pure hardware interactions
so you can't automatically detect them.
That is why the pre-empt code is a much much bigger problem and task than the
low latency code.
Alan
> Right. Here is what I want for 2.5 as a _general_ step towards a better
> kernel that will yield better performance:
I see absolutely _no_ evidence to support this repeated claim. I'm still
waiting to see any evidence that low latency patches are not sufficient, or
an explanation of who is going to fix all the drivers you break in subtle
ways
> On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> > On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > > I did my usual compile testings (untar kernel archive, apply patches,
> > > make -j<value> ...
> >
> > If I understand your test,
> > you are testing different loads - you are compiling kernels that may
differ
> > in size and makefile organization, not to mention different layout on
the
> > file system and disk.
Can someone tell me why we're "testing" the preempt kernel by running
make -j on a build? What exactly is this going to show us? The only thing
i can think of is showing us that throughput is not damaged when you want to
run single apps by using preempt. You dont get to see the effects of the
kernel preemption because all the damn thing is doing is preempting itself.
If you want to test the preempt kernel you're going to need something that
can find the mean latancy or "time to action" for a particular program or
all programs being run at the time and then run multiple programs that you
would find on various peoples' systems. That is the "feel" people talk
about when they praise the preempt patch. make -j'ing something and not
testing anything else but that will show you nothing important except "does
throughput get screwed by the preempt patch." Perhaps checking the
latencies on a common program on people's systems like mozilla or konqueror
while doing a 'make -j N bzImage' would be a better idea.
> Ouch, I assumed this wasn't the case indeed.
>
> >
> > What happens when you do the same test, compiling one kernel under
multiple
> > different kernels?
>
> Andrea
You should _always_ use the same kernel tree at the same point each time you
rerun the test under a different kernel. Always make clean before rebooting
to the next kernel. setting up the test bed should be pretty straight
forward. make sure the build tree is clean then make dep it. reboot to
the next kernel. load up mozilla but nothing else (mozilla should be
modified a bit to display the time it takes to do certain functions such as
displaying drop down menus, loading, opening a new window. Also you should
make the homepage something on the drive or blank.). start make -j 4
bzImage then load mozilla (no other gnome gtk libraries or having them
loaded via running gnome doesn't matter, just as long as it's the same each
time). Mozilla should then output times it takes to do certain things and
that should give you a good idea of how the preempt patch is performing
assuming everything is running on the same priority and your memory isn't
being maxed out and your hdd isn't eating the majority of the cpu time.
But i really think make -j'ing and only testing that or reporting those
numbers is a complete waste of time if you're trying to look at the
preempt's patch performance. I like using mozilla in this example because
it's a big bulky app that most people have (kde users possibly excluded)
where an improvement in latency or "time to action" is actually important to
people, and cant be easily ignored.
well those are just my two cents, i'd do it myself but i'm waiting for
hardware to replace the broken crap i have now. but if nobody has done it
by then i'll set that up.
-formerly safemode
> Another example is in the network drivers. The 8390 core for one example
> carefully disables an IRQ on the card so that it can avoid spinlocking on
> uniprocessor boxes.
>
> So with pre-empt this happens
>
> driver magic
> disable_irq(dev->irq)
> PRE-EMPT:
> [large periods of time running other code]
> PRE-EMPT:
> We get back and we've missed 300 packets, the serial port sharing
> the IRQ has dropped our internet connection completely.
>
> ["Don't do that then" isnt a valid answer here. If I did hold a lock
> it would be for several milliseconds at a time anyway and would reliably
> trash performance this time
> There are numerous other examples in the kernel tree where the current
code
> knows that there is a small bounded time between two actions in kernel
space
> that do not have a sleep. They are not spin locked, and putting spin locks
> everywhere will just trash performance. They are pure hardware
interactions
> so you can't automatically detect them.
hardware to hardware could have a higher priority than normal programs being
run. That way they're not preempted by simple programs, it would have to
be purposely preempted by the user.
> That is why the pre-empt code is a much much bigger problem and task than
the
> low latency code.
Lowering the latency, sure the low latency code probably does nearly as well
as the preempt patch. that's fine. Shortening the time locks are held by
better code can help to a certain extent (unless a lot of the kernel code is
poorly written, which i doubt). at it's present state though, my idea to
fix the kernel would be to give parts of the kernel where locks are made,
that shouldn't be broken normally, higher priorities. That way we can
distinguish between safe locks to preempt at and the ones that can do harm.
But those people who require their app to be treated special can run it
with -20 and preempt everything. To me that makes sense. Is there a
reason why it doesn't? Besides ethstetics. the only way the ethsetic
argument people are going to be pleased is if the kernel is designed from
the ground up to be better latency and lock-wise. A lot of people would
like to not have to wait until that time in the meantime.
On Sat, 2002-01-12 at 13:54, Alan Cox wrote:
> Another example is in the network drivers. The 8390 core for one example
> carefully disables an IRQ on the card so that it can avoid spinlocking on
> uniprocessor boxes.
>
> So with pre-empt this happens
>
> driver magic
> disable_irq(dev->irq)
> PRE-EMPT:
> [large periods of time running other code]
> PRE-EMPT:
> We get back and we've missed 300 packets, the serial port sharing
> the IRQ has dropped our internet connection completely.
We don't preempt while IRQ are disabled.
Robert Love
On Sat, Jan 12, 2002 at 06:48:28PM +0100, Roman Zippel wrote:
> Hi,
>
> [email protected] wrote:
>
> > No. The point of using semaphores is that one can sleep while
> > _waiting_ for the resource.
> > [...]
> > In a preemptive kernel this can cause a deadlock. In a non
> > preemptive it cannot. You are correct in that
> > B:
> > get sem on memory pool
> > do potentially blocking operations
> > release sem
> > is also dangerous - but I don't think that helps your case.
> > To fix B, we can enforce a coding rule - one of the reasons why
> > we have all those atomic ops in the kernel is to be able to
> > avoid this problem.
>
> Sorry I can't follow you. First, one can sleep while waiting for the
We're having a write only discussion - time to stop.
On Sat, Jan 12, 2002 at 02:23:00PM -0500, Ed Sweetman wrote:
> hardware to hardware could have a higher priority than normal programs being
> run. That way they're not preempted by simple programs, it would have to
> be purposely preempted by the user.
Priority is currently, and sensibly, by process. A process may run user
code, do sys-calls, or field interrupts both soft and hard. Now do you want to
adjust the priority at every transition?
> Lowering the latency, sure the low latency code probably does nearly as well
> as the preempt patch. that's fine. Shortening the time locks are held by
> better code can help to a certain extent (unless a lot of the kernel code is
> poorly written, which i doubt). at it's present state though, my idea to
> fix the kernel would be to give parts of the kernel where locks are made,
"Fix" what? What is the objective of your fix?
> that shouldn't be broken normally, higher priorities. That way we can
> distinguish between safe locks to preempt at and the ones that can do harm.
> But those people who require their app to be treated special can run it
> with -20 and preempt everything. To me that makes sense. Is there a
So:
get semaphore on slab memory and raise priority
get preempted by "treated special" app that then
does an operation on the slab queues
Is that what you want?
> reason why it doesn't? Besides ethstetics. the only way the ethsetic
It doesn't work? Is that a sufficient reason?
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Sat, Jan 12, 2002 at 02:26:27PM -0500, Robert Love wrote:
> On Sat, 2002-01-12 at 13:54, Alan Cox wrote:
> > Another example is in the network drivers. The 8390 core for one example
> > carefully disables an IRQ on the card so that it can avoid spinlocking on
> > uniprocessor boxes.
> >
> > So with pre-empt this happens
> >
> > driver magic
> > disable_irq(dev->irq)
> > PRE-EMPT:
> > [large periods of time running other code]
> > PRE-EMPT:
> > We get back and we've missed 300 packets, the serial port sharing
> > the IRQ has dropped our internet connection completely.
>
> We don't preempt while IRQ are disabled.
You read the mask map? and somehow figure out which masked irqs correspond to
active devices?
>
> Robert Love
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
> > PRE-EMPT:
> > We get back and we've missed 300 packets, the serial port sharing
> > the IRQ has dropped our internet connection completely.
>
> We don't preempt while IRQ are disabled.
I must have missed that in the code. I can see you check __cli() status but
I didn't see anywhere you check disable_irq(). Even if you did it doesnt
help when I mask the irq on the chip rather than using disable_irq() calls.
Alan
> hardware to hardware could have a higher priority than normal programs being
> run. That way they're not preempted by simple programs, it would have to
> be purposely preempted by the user.
How do you know they are there. How do you detect the situation, or do you
plan to audit every driver ?
> Lowering the latency, sure the low latency code probably does nearly as well
> as the preempt patch. that's fine. Shortening the time locks are held by
Not nearly as well. The tests I've seen it runs _better_ than just pre-empt
and pre-empt + low latency is the same as pure low latency - 1mS
Alan
On Sat, 2002-01-12 at 15:07, Alan Cox wrote:
> > We don't preempt while IRQ are disabled.
>
> I must have missed that in the code. I can see you check __cli() status but
> I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> help when I mask the irq on the chip rather than using disable_irq() calls.
Well, if IRQs are disabled we won't have the timer... would not the
system panic anyhow if schedule() was called while in an interrupt
handler?
Robert Love
> > I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> > help when I mask the irq on the chip rather than using disable_irq() calls.
>
> Well, if IRQs are disabled we won't have the timer... would not the
> system panic anyhow if schedule() was called while in an interrupt
> handler?
You completely misunderstand.
disable_irq(n)
I disable a single specific interrupt, I don't disable the timer interrupt.
Your code doesn't seem to handle that. Its just one of the examples of where
you really need priority handling, and thats a horrible dark and slippery
slope
Alan
Roman Zippel wrote:
>
> Andrew's patch requires constant audition and Andrew can't audit all
> drivers for possible problems. That doesn't mean Andrew's work is
> wasted, since it identifies problems, which preempting can't solve, but
> it will always be a hunt for the worst cases, where preempting goes for
> the general case.
Guys,
I've heard this so many times, and it just ain't so. The overwhelming
majority of problem areas are inside locks. All the complexity and
maintainability difficulties to which you refer exist in the preempt
patch as well. There just is no difference.
-
Ed Sweetman wrote:
>
> If you want to test the preempt kernel you're going to need something that
> can find the mean latancy or "time to action" for a particular program or
> all programs being run at the time and then run multiple programs that you
> would find on various peoples' systems. That is the "feel" people talk
> about when they praise the preempt patch.
Right. And that is precisely why I created the "mini-ll" patch. To
give the improved "feel" in a way which is acceptable for merging into
the 2.4 kernel.
And guess what? Nobody has tested the damn thing, so it's going
nowhere.
Here it is again:
--- linux-2.4.18-pre3/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Jan 12 12:22:29 2002
@@ -249,12 +249,19 @@ static int wait_for_buffers(kdev_t dev,
struct buffer_head * next;
int nr;
- next = lru_list[index];
nr = nr_buffers_type[index];
+repeat:
+ next = lru_list[index];
while (next && --nr >= 0) {
struct buffer_head *bh = next;
next = bh->b_next_free;
+ if (dev == NODEV && current->need_resched) {
+ spin_unlock(&lru_list_lock);
+ conditional_schedule();
+ spin_lock(&lru_list_lock);
+ goto repeat;
+ }
if (!buffer_locked(bh)) {
if (refile)
__refile_buffer(bh);
@@ -1174,8 +1181,10 @@ struct buffer_head * bread(kdev_t dev, i
bh = getblk(dev, block, size);
touch_buffer(bh);
- if (buffer_uptodate(bh))
+ if (buffer_uptodate(bh)) {
+ conditional_schedule();
return bh;
+ }
ll_rw_block(READ, 1, &bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
--- linux-2.4.18-pre3/fs/dcache.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/dcache.c Sat Jan 12 12:22:29 2002
@@ -71,7 +71,7 @@ static inline void d_free(struct dentry
* d_iput() operation if defined.
* Called with dcache_lock held, drops it.
*/
-static inline void dentry_iput(struct dentry * dentry)
+static void dentry_iput(struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;
if (inode) {
@@ -84,6 +84,7 @@ static inline void dentry_iput(struct de
iput(inode);
} else
spin_unlock(&dcache_lock);
+ conditional_schedule();
}
/*
--- linux-2.4.18-pre3/fs/jbd/commit.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/jbd/commit.c Sat Jan 12 12:22:29 2002
@@ -212,6 +212,16 @@ write_out_data_locked:
__journal_remove_journal_head(bh);
refile_buffer(bh);
__brelse(bh);
+ if (current->need_resched) {
+ if (commit_transaction->t_sync_datalist)
+ commit_transaction->t_sync_datalist =
+ next_jh;
+ if (bufs)
+ break;
+ spin_unlock(&journal_datalist_lock);
+ conditional_schedule();
+ goto write_out_data;
+ }
}
}
if (bufs == ARRAY_SIZE(wbuf)) {
--- linux-2.4.18-pre3/fs/proc/array.c Thu Oct 11 09:00:01 2001
+++ linux-akpm/fs/proc/array.c Sat Jan 12 12:22:29 2002
@@ -415,6 +415,8 @@ static inline void statm_pte_range(pmd_t
pte_t page = *pte;
struct page *ptpage;
+ conditional_schedule();
+
address += PAGE_SIZE;
pte++;
if (pte_none(page))
--- linux-2.4.18-pre3/fs/proc/generic.c Fri Sep 7 10:53:59 2001
+++ linux-akpm/fs/proc/generic.c Sat Jan 12 12:22:29 2002
@@ -98,7 +98,9 @@ proc_file_read(struct file * file, char
retval = n;
break;
}
-
+
+ conditional_schedule();
+
/* This is a hack to allow mangling of file pos independent
* of actual bytes read. Simply place the data at page,
* return the bytes, and set `start' to the desired offset
--- linux-2.4.18-pre3/include/linux/condsched.h Thu Jan 1 00:00:00 1970
+++ linux-akpm/include/linux/condsched.h Sat Jan 12 12:22:29 2002
@@ -0,0 +1,18 @@
+#ifndef _LINUX_CONDSCHED_H
+#define _LINUX_CONDSCHED_H
+
+#ifndef __LINUX_COMPILER_H
+#include <linux/compiler.h>
+#endif
+
+#ifndef __ASSEMBLY__
+#define conditional_schedule() \
+do { \
+ if (unlikely(current->need_resched)) { \
+ __set_current_state(TASK_RUNNING); \
+ schedule(); \
+ } \
+} while(0)
+#endif
+
+#endif
--- linux-2.4.18-pre3/include/linux/sched.h Fri Dec 21 11:19:23 2001
+++ linux-akpm/include/linux/sched.h Sat Jan 12 12:22:29 2002
@@ -13,6 +13,7 @@ extern unsigned long event;
#include <linux/times.h>
#include <linux/timex.h>
#include <linux/rbtree.h>
+#include <linux/condsched.h>
#include <asm/system.h>
#include <asm/semaphore.h>
--- linux-2.4.18-pre3/mm/filemap.c Thu Jan 10 13:39:50 2002
+++ linux-akpm/mm/filemap.c Sat Jan 12 12:22:29 2002
@@ -296,10 +296,7 @@ static int truncate_list_pages(struct li
page_cache_release(page);
- if (current->need_resched) {
- __set_current_state(TASK_RUNNING);
- schedule();
- }
+ conditional_schedule();
spin_lock(&pagecache_lock);
goto restart;
@@ -609,6 +606,7 @@ void filemap_fdatasync(struct address_sp
UnlockPage(page);
page_cache_release(page);
+ conditional_schedule();
spin_lock(&pagecache_lock);
}
spin_unlock(&pagecache_lock);
@@ -1392,6 +1390,9 @@ page_ok:
offset &= ~PAGE_CACHE_MASK;
page_cache_release(page);
+
+ conditional_schedule();
+
if (ret == nr && desc->count)
continue;
break;
@@ -3025,6 +3026,8 @@ unlock:
SetPageReferenced(page);
UnlockPage(page);
page_cache_release(page);
+
+ conditional_schedule();
if (status < 0)
break;
--- linux-2.4.18-pre3/drivers/block/ll_rw_blk.c Thu Jan 10 13:39:49 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Sat Jan 12 12:22:29 2002
@@ -917,6 +917,7 @@ void submit_bh(int rw, struct buffer_hea
kstat.pgpgin += count;
break;
}
+ conditional_schedule();
}
/**
Alan Cox wrote:
> > > PRE-EMPT:
> > > We get back and we've missed 300 packets, the serial port sharing
> > > the IRQ has dropped our internet connection completely.
> >
> > We don't preempt while IRQ are disabled.
>
> I must have missed that in the code. I can see you check __cli() status but
> I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> help when I mask the irq on the chip rather than using disable_irq() calls.
>
> Alan
But you get interrupted by other interrups then so you have the same problem
reagardless of any preemtion patch you hopefully lose the cpu for a much
shorter time but still the same problem.
Hi,
Alan Cox wrote:
> > Because the IRIX implementation sucks, every implementation has to suck?
> > Somehow I have the suspicion you're trying to discourage everyone from
> > even trying, because if he'd succeeded you'd loose a big chunk of
> > potential RTLinux customers.
>
> Victor has had the same message for years, as have others like Larry McVoy
> (in fact if Larry and Victor agree on something its unusual enough to
> remember). So I can vouch for the fact Victor hasn't changed his tune from
> before rtlinux was ever any real commercial toy. I think you owe him an
> apology.
Did I really say something that bad? I would be actually surprised, if
Victor wouldn't act in the best interest of his company. The other
possibility is that Victor must have had such a terrible experience with
IRIX, so that he thinks any attempts to add better soft realtime or even
hard realtime capabilities (not just as addon) must be doomed to fail.
> RtLinux isn't going to help you one bit when it comes to smooth movie playback
> because the DVD playback is dependant on the Linux file system layers and a
> whole pile of other code. Low-latency does this quite nicely, and it takes
> you to the point where hardware becomes the biggest latency cause for the
> general case. Pre-empt doesn't buy you anything more. You can spend a
> millisecond locked in an I/O instruction to an irritating device.
Preemption doesn't solve of course every problem. It's mainly useful to
get an event as fast as possible from kernel to user space. This can be
the mouse click or the buffer your process is waiting for. Latencies can
quickly sum up here to be sensible.
bye, Roman
Hi,
Alan Cox wrote:
> So with pre-empt this happens
>
> driver magic
> disable_irq(dev->irq)
> PRE-EMPT:
> [large periods of time running other code]
> PRE-EMPT:
> We get back and we've missed 300 packets, the serial port sharing
> the IRQ has dropped our internet connection completely.
But it shouldn't deadlock as Victor is suggesting.
> There are numerous other examples in the kernel tree where the current code
> knows that there is a small bounded time between two actions in kernel space
> that do not have a sleep. They are not spin locked, and putting spin locks
> everywhere will just trash performance. They are pure hardware interactions
> so you can't automatically detect them.
Why should spin locks trash perfomance, while an expensive disable_irq()
doesn't?
bye, Roman
On Sat Jan 12, 2002 at 12:23:09PM -0800, Andrew Morton wrote:
> Ed Sweetman wrote:
> >
> > If you want to test the preempt kernel you're going to need something that
> > can find the mean latancy or "time to action" for a particular program or
> > all programs being run at the time and then run multiple programs that you
> > would find on various peoples' systems. That is the "feel" people talk
> > about when they praise the preempt patch.
>
> Right. And that is precisely why I created the "mini-ll" patch. To
> give the improved "feel" in a way which is acceptable for merging into
> the 2.4 kernel.
>
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
I've tested it. I've been running it on my box for the last
several days. Works just great. My box has been quite solid
with it and I've not seen anything to prevent your sending it
to Marcelo for 2.4.18...
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
On Sat, 12 Jan 2002 12:23:09 -0800
Andrew Morton <[email protected]> wrote:
> Ed Sweetman wrote:
> >
> > If you want to test the preempt kernel you're going to need something that
> > can find the mean latancy or "time to action" for a particular program or
> > all programs being run at the time and then run multiple programs that you
> > would find on various peoples' systems. That is the "feel" people talk
> > about when they praise the preempt patch.
>
> Right. And that is precisely why I created the "mini-ll" patch. To
> give the improved "feel" in a way which is acceptable for merging into
> the 2.4 kernel.
Hm, I am not quite sure about what you expect to hear about it, but:
a) It applies cleanly to 2.4.18-pre3.
b) It compiles
c) During a load of around 150 produced by (of course :-) "make -j bzImage" and
concurrent XMMS playing while my mail-client and mozilla are open, I cannot
"feel" a real big difference in interactivity compared to vanilla kernel. XMMS
hickups sometimes, mouse does kangaroo'ing, switching around different
X-screens and screen refresh (especially mozilla of course) are no big hit.
This is a dual PIII-1GHz/2 GB RAM and some swap. During make no swapping is
going on.
Sorry, but I cannot see (feel) the difference in _this_ test (if this is really
a test for what you intend to do). Compile time btw makes no difference either.
Perhaps this try is rather something for ingo and the scheduler...
Regards,
Stephan
On Sat, 12 Jan 2002 14:02:13 -0700
Erik Andersen <[email protected]> wrote:
> > And guess what? Nobody has tested the damn thing, so it's going
> > nowhere.
>
> I've tested it. I've been running it on my box for the last
> several days. Works just great. My box has been quite solid
> with it and I've not seen anything to prevent your sending it
> to Marcelo for 2.4.18...
Sorry for this dumb question:
What is the difference to vanilla exactly like in your setup? Better
interactive feeling? Throughput?
Regards,
Stephan
Hi,
[email protected] wrote:
> We're having a write only discussion - time to stop.
Sorry, but I'm still waiting for the proof that preempting deadlocks the
system.
If n running processes have together m time slices, after m ticks every
process will have run it's full share of the time, no matter how often
you schedule. (I assume here a correct time accounting, which is
currently not the case, but that's a different (and not new) problem.)
So even the low priority process will have the same time as before to do
it's job, it will be delayed, but it will not be delayed forever, so I'm
failing to see how preempting Linux should deadlock.
bye, Roman
On Sat, Jan 12, 2002 at 09:42:26PM +0100, Roman Zippel wrote:
> Hi,
>
> Alan Cox wrote:
>
> > > Because the IRIX implementation sucks, every implementation has to suck?
> > > Somehow I have the suspicion you're trying to discourage everyone from
> > > even trying, because if he'd succeeded you'd loose a big chunk of
> > > potential RTLinux customers.
> >
> > Victor has had the same message for years, as have others like Larry McVoy
> > (in fact if Larry and Victor agree on something its unusual enough to
> > remember). So I can vouch for the fact Victor hasn't changed his tune from
> > before rtlinux was ever any real commercial toy. I think you owe him an
> > apology.
>
> Did I really say something that bad? I would be actually surprised, if
> Victor wouldn't act in the best interest of his company. The other
> possibility is that Victor must have had such a terrible experience with
> IRIX, so that he thinks any attempts to add better soft realtime or even
> hard realtime capabilities (not just as addon) must be doomed to fail.
Well, how about a third possibility - that I see a problem you have not
seen and that you should try to argue on technical terms instead of psychoanlyzing
me or looking for financial motives?
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
[Cc: trimmed]
Andrew Morton <[email protected]> :
[mini-ll]
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
It allows me to del^W read NFS-mounted mail behind a linux router while I
copy files locally on the router. If I don't apply mini-ll to the router,
it's a "server foo not responding, still trying" fest. You know what
"interactivity feel" means when it happens.
If someone suspects the hardware is crap, it's a PIV motherboard with
built-in Promise20265 and four IBM IC35L060AVER07-0 on their own channel.
Each disk has been able to behave normally during RAID1 rebuild.
Without mini-ll:
well choosen file I/O => no file I/O, no networking, no console, *big pain*.
With mini-ll:
well choosen file I/O => *only* those I/O suck (less than before btw).
--
Ueimor
On Sat, 2002-01-12 at 15:36, Kenneth Johansson wrote:
> > I must have missed that in the code. I can see you check __cli() status but
> > I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> > help when I mask the irq on the chip rather than using disable_irq() calls.
> >
> > Alan
>
> But you get interrupted by other interrups then so you have the same problem
> reagardless of any preemtion patch you hopefully lose the cpu for a much
> shorter time but still the same problem.
Agreed. Further, you can't put _any_ upper bound on the number of
interrupts that could occur, preempt or not. Sure, preempt can make it
worse, but I don't see it. I have no bug reports to correlate.
Robert Love
On Sat Jan 12, 2002 at 10:18:35PM +0100, Stephan von Krawczynski wrote:
> On Sat, 12 Jan 2002 14:02:13 -0700
> Erik Andersen <[email protected]> wrote:
>
> > > And guess what? Nobody has tested the damn thing, so it's going
> > > nowhere.
> >
> > I've tested it. I've been running it on my box for the last
> > several days. Works just great. My box has been quite solid
> > with it and I've not seen anything to prevent your sending it
> > to Marcelo for 2.4.18...
>
> Sorry for this dumb question:
> What is the difference to vanilla exactly like in your setup? Better
> interactive feeling? Throughput?
To be honest, not a _huge_ difference. I've been mostly doing
development, and when I happen to have, for example, a kernel
compile and a gcc compile going on, xmms isn't skipping for me
at all, while previously I would hear skips every so often.
-Erik
--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
Robert Love wrote:
>Agreed. Further, you can't put _any_ upper bound on the number of interrupts that could occur, preempt or not. Sure, preempt can make it worse, but I don't see it. I have no bug reports to correlate.
>
OTOH we do have a pile of user reports which
say the low latency patches give better results.
From my view here, low latency provides a more
silky feel when e.g. playing RtCW or Q3A -
BTW I have checked out 2.4.18pre2-aa2 and
am now running 2.4.18-pre3 + mini low latency.
* -aa absolutely kicks major booty in benchmarks.
* -mini-low-latency seems to do no worse than
stock kernel benchmark-wise, but seems to be
somehow smoother. I played some mp3s while
running dbench 16 and heard no hitches. Also
the RtCW test was successful, e.g. movement
was fluid and I was victorious in most skirmishes
with win32 opponents.
Regards,
jjs
On Sat, 2002-01-12 at 14:00, Alan Cox wrote:
> I see absolutely _no_ evidence to support this repeated claim. I'm still
> waiting to see any evidence that low latency patches are not sufficient, or
> an explanation of who is going to fix all the drivers you break in subtle
> ways
I'll work on fixing things the patch breaks. I don't think it will be
that bad. I've been working on preemption for a long long time, and
before me others have been working for a long long time, and I just
don't see the hordes of broken drivers or the tons of race-conditions
due to per-CPU data. I have seen some, and I have fixed them.
For a solution to latency concerns, I'd much prefer to lay a framework
down that provides a proper solution and then work on fine tuning the
kernel to get the desired latency out of it.
Robert Love
> So even the low priority process will have the same time as before to do
> it's job, it will be delayed, but it will not be delayed forever, so I'm
> failing to see how preempting Linux should deadlock.
First task scheduled takes a resource that a second task needs. 150 other
threads schedule via pre-emption, the one that it should share the resource
with cannot run but the rest do. Repeat. It doesn't deadlock but it goes
massively unfair
> > everywhere will just trash performance. They are pure hardware interactions
> > so you can't automatically detect them.
>
> Why should spin locks trash perfomance, while an expensive disable_irq()
> doesn't?
disable_irq only blocks _one_ interrupt line, spin_lock_irqsave locks the
interrupt off on a uniprocessor, and 50% of the time off on a
dual processor.
If I use a spin lock you can't run a modem and an NE2000 card together on
Linux 2.4. Thats why I had to do that work on the code. Its one of myriads
of basic obvious cases that the pre-empt patch gets wrong
> Preemption doesn't solve of course every problem. It's mainly useful to
> get an event as fast as possible from kernel to user space. This can be
> the mouse click or the buffer your process is waiting for. Latencies can
> quickly sum up here to be sensible.
The pre-emption patch doesn't change the average latencies. Go run some real
benchmarks. Its lost in the noise after the low latency patch. A single inw
from some I/O cards can cost as much as the latency target we hit.
Its not a case of the 90% of the result with 10% of the work, the pre-empt
patch is firmly in the all pain no gain camp
> > I must have missed that in the code. I can see you check __cli() status but
> > I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> > help when I mask the irq on the chip rather than using disable_irq() calls.
>
> But you get interrupted by other interrups then so you have the same problem
> reagardless of any preemtion patch you hopefully lose the cpu for a much
> shorter time but still the same problem.
Interrupt paths are well sub millisecond, a pre empt might mean I don't get
the CPU back for measurable chunks of a second. They are totally different
guarantees.
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
I've been testing it. It works for me, its not as good as the full one,
it seems obviously correct. What else am I supposed to say.
I'm pretty much exclusively running Andre's new IDE code too. In fact I'm
back to a page long list of applied diffs versus 2.4.18pre3, most of which
I need to feed Marcelo
> > But you get interrupted by other interrups then so you have the same problem
> > reagardless of any preemtion patch you hopefully lose the cpu for a much
> > shorter time but still the same problem.
>
> Agreed. Further, you can't put _any_ upper bound on the number of
> interrupts that could occur, preempt or not. Sure, preempt can make it
> worse, but I don't see it. I have no bug reports to correlate.
How may full benchmark sets have you done on an NE2000. Its quite obvious
from your earlier mail you hadn't even considered problems like this.
Let me ask you the _right_ question instead
- Prove to me that there are no cases that pre-empt doesn't screw up
like this.
- Prove to me that pre-empt is better than the big low latency patch
All I have seen so far is benchmarks that say low latency is better as is,
and evidence that preempt patches cause far more problems than they solve
and have complex and subtle side effects nobody yet understands.
Furthermore its obvious that the only way to fix these side effects is to
implement full priority handling to avoid priority inversion issues (which
is precisely what the IRQ problem is) , that means implementing interrupt
handlers as threads, heavyweight locks and an end result I'm really not
interested in using.
Alan
> For a solution to latency concerns, I'd much prefer to lay a framework
> down that provides a proper solution and then work on fine tuning the
> kernel to get the desired latency out of it.
As the low latency patch proves, the framework has always been there, the
ll patches do the rest
On 20020112 Andrew Morton wrote:
>Ed Sweetman wrote:
>>
>> If you want to test the preempt kernel you're going to need something that
>> can find the mean latancy or "time to action" for a particular program or
>> all programs being run at the time and then run multiple programs that you
>> would find on various peoples' systems. That is the "feel" people talk
>> about when they praise the preempt patch.
>
>Right. And that is precisely why I created the "mini-ll" patch. To
>give the improved "feel" in a way which is acceptable for merging into
>the 2.4 kernel.
>
>And guess what? Nobody has tested the damn thing, so it's going
>nowhere.
>
I have been running mini-ll on -pre3 for a time. And have just booted pre3
with full-ll. I see no marvelous diff between them, but I am not pushing
my box to their knees.
I can get numbers for you, but is there any test out there that gives them ?
Something like 'under this damned test your system just delayed as much as xxx us'.
That kind of 'my xmms does not skip' does not look like a very serious measure.
And could you tell me if some of this patches can interfere with results ?
This is what I am running just now:
- 2.4.18-pre3
- vm fixes from aa (vm-22, vm-raend, truncate-garbage)
- ext3-0.9.17 update
- ide-20011210 (hint: plz, make it in mainline for the time of .18...)
- irqrate-A1
- interrupts-seq-file
- spinlock-cacheline + fast-pte from -aa
- scalable timers
- sensors-cvs
- bproc 3.1.5
On that i have run full-ll+ll_fixes (from -aa) or mini-ll+ll_fixes.
(If someone is interested, patches are at
http://giga.cps.unizar.es/~magallon/linux/ )
TIA.
--
J.A. Magallon # Let the source be with you...
mailto:[email protected]
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.18-pre3-beo #5 SMP Sun Jan 13 02:14:04 CET 2002 i686
On Sat, 2002-01-12 at 15:21, Alan Cox wrote:
> You completely misunderstand.
>
> disable_irq(n)
>
> I disable a single specific interrupt, I don't disable the timer interrupt.
> Your code doesn't seem to handle that.
It can if we increment the preempt_count in disable_irq_nosync and
decrement it on enable_irq.
Robert Love
Hi,
[email protected] wrote:
> Well, how about a third possibility - that I see a problem you have not
> seen and that you should try to argue on technical terms
I just don't see any problem that is really new. Alan's example is one
of more extreme ones, but the only effect is that an operation can be
delayed far more than usual, but not indefinitely.
If you think preemption can cause a deadlock, maybe you could give me a
hint, which of the conditions for a deadlock is changed by preemption?
> instead of psychoanlyzing
> me or looking for financial motives?
If I had known, how easily people are offended by implying they could
act out of financial interest, I hadn't made that comment. Sorry, but
I'm just annoyed, how you attack any attempt to add realtime
capabilities to the kernel, mostly with the argument that it sucks under
IRIX. I people want to try it, let them. I prefer to see patches and if
they should really suck, I would be first one to say so.
bye, Roman
On Sun, Jan 13, 2002 at 04:33:44AM +0100, Roman Zippel wrote:
> Hi,
>
> [email protected] wrote:
>
> > Well, how about a third possibility - that I see a problem you have not
> > seen and that you should try to argue on technical terms
>
> I just don't see any problem that is really new. Alan's example is one
> of more extreme ones, but the only effect is that an operation can be
> delayed far more than usual, but not indefinitely.
> If you think preemption can cause a deadlock, maybe you could give me a
> hint, which of the conditions for a deadlock is changed by preemption?
>
> > instead of psychoanlyzing
> > me or looking for financial motives?
>
> If I had known, how easily people are offended by implying they could
> act out of financial interest, I hadn't made that comment. Sorry, but
> I'm just annoyed, how you attack any attempt to add realtime
> capabilities to the kernel, mostly with the argument that it sucks under
> IRIX. I people want to try it, let them. I prefer to see patches and if
> they should really suck, I would be first one to say so.
I'm annoyed that you take a comment in which I said that the Morton approach
was much preferrable to the preempt patch and respond by saying I "attack
any attempt to add realtime capabilities to the kernel".
I'm all in favor of people trying all sorts of things. My original comment
was that the numbers I'd seen all favored the Morton patch and I still
haven't seen any evidence to the contrary.
I also made two very simple and specific comments:
1) I don't see how processor specific caching, which seems
essential for smp performance and will be more essential
with numa, works with this patch
2) preempt seems to lead inescapably to priority inherit. If this
is true, people better understand the ramifications now before they
commit.
Of course, I think there are strong limits to what you can get for RT
performance in the kernel - I think the RTLinux method is far superior.
Believe what you want - it won't change the numbers.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Saturday 12 January 2002 03:53 pm, Roman Zippel wrote:
> Hi,
>
> Alan Cox wrote:
> > So with pre-empt this happens
> >
> > driver magic
> > disable_irq(dev->irq)
> > PRE-EMPT:
> > [large periods of time running other code]
> > PRE-EMPT:
> > We get back and we've missed 300 packets, the serial port sharing
> > the IRQ has dropped our internet connection completely.
>
> But it shouldn't deadlock as Victor is suggesting.
Um, hang on...
Obvioiusly, Alan, you know more about the networking stack than I do. :) But
could you define "large periods of time running other code"?
The real performance KILLER is when your code gets swapped out, and that
simply doesn't happen in-kernel. Yes, the niced app may get swapped out, but
the syscall it's making is pinned in ram and it will only block on disk
access when it returns. So we're talking what kind of delay here, one second?
As for scheduling, even a nice 19 task will still get SOME time, and we can
find out exactly what the worst case is since we hand out time slices and we
don't hand out more until EVERYBODY exhausts theirs, including seriously
niced processes. So this isn't exactly non-deterministic behavior, is it?
There IS an upper bound here...
There ISN'T an upper bound on interrupts. We've got some nasty interrupts in
the system. How long does the PCI bus get tied up with spinning video cards
flooding the bus to make their benchmarks look 5% better? How long of a
latency spike did we (until recently) get switching between graphics and text
consoles? (I heard that got fixed, moved into a tasklet or some such.
Haven't looked at it yet.) Without Andre's IDE patches, how much latency can
the disk insert at random?
Yes, it's possible than if you have a fork bomb trying to take down your
system, and you're using an old 10baseT ethernet driver built with some
serious assumptions about how the kernel works, that you could drop some
packets. But I find it interesting that make -j can be considered a fairly
unrealistic test intentionally overloading the system, yet an example with
150 active threads all eating CPU time is NOT considered an example of how
your process's receive buffer could easily fill up and drop packets no matter
HOW fast the interrupt is working since even 10baseT feeds you 1.1 megabytes
per second and with a 1 second delay we might have to swap stuff out to make
room for them if we don't read from the socket in that long...
One other fun little thing about the scheduler: a process that is submitting
network packets probably isn't entirely CPU bound, is it? It's doing I/O.
So even if it's niced, if it's competing with totally CPU bound tasks isn't
it likely to get promoted? How real-world is your overload-induced failure
case example?
As for dropping 300 packets killing your connection, are you saying 802.11
cards can't have a static burst that blocks your connection for half a
second? I've had full second gaps in network traffic on my cable modem, all
time time, and the current overload behavior of most routers is dropping lots
and lots of packets on the floor. (My in-house network is still using an
ancient 10baseT half-duplex hub. I'm lazy, and it's still way faster than my
upstream connection to the internet.) Datagram delivery is not guaranteed.
It never has been. Maybe it will be once ECN comes in, but that's not yet.
What's one of the other examples you were worried about, besides NE2K (which
can't do 100baseT, even on PCI, and a 100baseT PCI card is now $9 new. Just
a data point.)
Rob
(P.S. The only behavior difference between preempt and SMP, apart from
contention for per-cpu data, is the potential insertion of latency spikes in
kernel code, which interrupts do anyway. You're saying it can matter when
something disables an interrupt. Robert Love suggested the macro that
disables an interrupt can count as a preemption guard just like a spinlock.
Is this NOT enough to fix the objection?)
On Sat, 12 Jan 2002 12:23:09 -0800
Andrew Morton <[email protected]> wrote:
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
Haven't had latency problems, to be honest. Maybe I should start playing mp3s
while I code?
1) conditional_schedule? Hmmm... Why the __set_current_state? I think I prefer
an explicit "if (need_schedule()) schedule()", with
#define need_schedule() unlikely(current->need_resched)
2) I hate condsched.h: Use sched.h please!
3) Why this:
> +#ifndef __LINUX_COMPILER_H
> +#include <linux/compiler.h>
> +#endif
Other than that, I like this patch. Linus?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Sat, Jan 12, 2002 at 10:10:55PM -0500, Robert Love wrote:
> It can if we increment the preempt_count in disable_irq_nosync and
> decrement it on enable_irq.
Who says you're going to be enabling IRQs any time soon? AFAIK, there is
nothing that requires enable_irq to be called after disable_irq_nosync.
In fact, you could well have the following in a driver:
/* initial shutdown of device */
disable_irq_nosync(i); /* or disable_irq(i); */
/* other shutdown stuff */
free_irq(i, private);
and you would have to audit all drivers to find out if they did this -
this would seriously damage your preempt_count.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
Hi,
Alan Cox wrote:
> disable_irq only blocks _one_ interrupt line, spin_lock_irqsave locks the
> interrupt off on a uniprocessor, and 50% of the time off on a
> dual processor.
>
> If I use a spin lock you can't run a modem and an NE2000 card together on
> Linux 2.4. Thats why I had to do that work on the code. Its one of myriads
> of basic obvious cases that the pre-empt patch gets wrong
I wouldn't say it gets it wrong, the driver also has to take a non irq
spinlock anyway, so the window is quite small and even then the packet
is only delayed.
But now I really have to look at that driver and try a more optimistic
irq disabling approach, otherwise it will happily disable the most
important shared interrupt on my Amiga for ages.
bye, Roman
> I wouldn't say it gets it wrong, the driver also has to take a non irq
> spinlock anyway, so the window is quite small and even then the packet
> is only delayed.
Or you lose a pile of them
> But now I really have to look at that driver and try a more optimistic
> irq disabling approach, otherwise it will happily disable the most
> important shared interrupt on my Amiga for ages.
If you play with the code remember that the irq delivery on x86 is
asynchronous. You can disable the irq on the chip, synchronize_irq() on
the result and very occasionally get the irq delivered after all of that
Hi,
Alan Cox wrote:
> All I have seen so far is benchmarks that say low latency is better as is,
If Andrew did a good job (what he obviously did), I don't doubt that.
> and evidence that preempt patches cause far more problems than they solve
> and have complex and subtle side effects nobody yet understands.
I'm aware of two side effects:
- preempt exposes already existing problems, which are worth fixing
independent of preempt.
- it can cause unexpected delays, which should be nonfatal, otherwise
worth fixing as well.
What somehow got lost in this discussion, that both patches don't
necessarily conflict with each other, they both attack the same problem
with different approaches, which complement each other. I prefer to get
the best of both patches.
The ll patch identifies problem, which preempt alone can't fix, on the
other hand the ll patch inserts schedule calls all over the place, where
preempt can handle this transparently. So far I haven't seen any
evidence, that preempt introduces any _new_ serious problems, so I'd
rather like to see to get the best out of both.
> Furthermore its obvious that the only way to fix these side effects is to
> implement full priority handling to avoid priority inversion issues (which
> is precisely what the IRQ problem is) , that means implementing interrupt
> handlers as threads, heavyweight locks and an end result I'm really not
> interested in using.
It's not really needed to go that far, it's generally a good idea to
keep interrupt handler as short as possible, we use bh or tasklets for
exactly that reason. I don't think we need to work around broken
hardware, but halfway decent hardware should not be a problem to get
decent latency.
bye, Roman
On Sat, Jan 12, 2002 at 02:00:17PM -0500, Ed Sweetman wrote:
>
>
> > On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> > > On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > > > I did my usual compile testings (untar kernel archive, apply patches,
> > > > make -j<value> ...
> > >
> > > If I understand your test,
> > > you are testing different loads - you are compiling kernels that may
> differ
> > > in size and makefile organization, not to mention different layout on
> the
> > > file system and disk.
>
> Can someone tell me why we're "testing" the preempt kernel by running
> make -j on a build? What exactly is this going to show us? The only thing
> i can think of is showing us that throughput is not damaged when you want to
> run single apps by using preempt. You dont get to see the effects of the
> kernel preemption because all the damn thing is doing is preempting itself.
>
> If you want to test the preempt kernel you're going to need something that
> can find the mean latancy or "time to action" for a particular program or
> all programs being run at the time and then run multiple programs that you
> would find on various peoples' systems. That is the "feel" people talk
> about when they praise the preempt patch. make -j'ing something and not
> testing anything else but that will show you nothing important except "does
> throughput get screwed by the preempt patch." Perhaps checking the
> latencies on a common program on people's systems like mozilla or konqueror
> while doing a 'make -j N bzImage' would be a better idea.
That's the second test I am normally running. Just running xmms while
doing the kernel compile. I just wanted to check if the system slows
down because of preemption but instead it compiled the kernel even
faster :-) But so far I was not able to test the latency and furthermore
it is very difficult to "measure" skipping of xmms ...
> > Ouch, I assumed this wasn't the case indeed.
Sorry for not answering immedeatly but I am compiling the same kernel
source with the same .config and everything I could think of being the
same! I even do a 'rm -rf linux' after every run and untar the same
sources *every* time.
Regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > I did my usual compile testings (untar kernel archive, apply patches,
> > make -j<value> ...
>
> If I understand your test,
> you are testing different loads - you are compiling kernels that may differ
> in size and makefile organization, not to mention different layout on the
> file system and disk.
No, I use a script which is run in single user mode after a reboot. So
there are only a few processes running when I start the script (see
attachment) and the jobs should start from the same environment.
> What happens when you do the same test, compiling one kernel under multiple
> different kernels?
That is exactly what I am doing. I even try to my best to have the exact
same starting environment ...
Regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On Sat, Jan 12, 2002 at 05:05:28PM +0100, Andrea Arcangeli wrote:
> On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
[...]
> > Hello Andrea,
> >
> > I did my usual compile testings (untar kernel archive, apply patches,
> > make -j<value> ...
> >
> > Here are some results (Wall time + Percent cpu) for each of the consecutive five runs:
> >
> > 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp
> > j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83%
> > j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83%
> > j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87%
> > j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87%
> > j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86%
> >
> > j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93%
> > j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88%
> > j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91%
> > j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91%
> > j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88%
> >
> > * build incomplete (OOM killer killed several cc1 ... )
> >
> > So far 2.4.13-pre5aa1 had been the king of the block in compile times.
> > But this has changed. Now the (by far) fastest kernel is 2.4.18-pre
> > + Ingos scheduler patch (s) + preemptive patch (p). I did not test
> > preemptive patch alone so far since I don't know if the one I have
> > applies cleanly against -pre3 without Ingos patch. I used the
> > following patches:
> >
> > s: sched-O1-2.4.17-H6.patch
> > p: preempt-kernel-rml-2.4.18-pre3-ingo-1.patch
> >
> > I hope this info is useful to someone.
>
> the improvement of "sp" compared to "s" is quite visible, not sure how
> can a little different time spent in kernel make such a difference on
> the final numbers, also given compilation is mostly an userspace task, I
> assume you were swapping out or running out of cache at the very least,
> right?
The system is *heavily* swapping. Plain 2.4.18-pre3 can not even finish
the jobs because it runs out of memory. That's why I used j75 or j100
initially. Otherwise there was not even a difference between the 2.4.10+
vm and the 2.4.9-ac+ vm. All I want to test with this "benchmark" is how
well the system reacts when I throw *lots* of compilation jobs at it ...
> btw, I'd be curious if you could repeat the same test with -j1 or -j2?
> (actually real world)
Using just -j1 or -j2 will probably be no difference (I will test it anyway
and post the results).
> Still the other numbers remains interesting for a trashing machine, but
> a few percent difference with a trashing box isn't a big difference, vm
> changes can infulence those numbers more than any preempt or scheduler
> number (of course if my guess that you're swapping out is really right :).
> I guess "p" helps because we simply miss some schedule point in some vm
> routine. Hints?
But what *I* like most about the preemptive results are that the results
for all runs do not vary that much. Looking at plain 2.4.18-pre3 there
is a huge difference in runtime between the fastest and the longest run.
Regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
> What somehow got lost in this discussion, that both patches don't
> necessarily conflict with each other, they both attack the same problem
> with different approaches, which complement each other. I prefer to get
> the best of both patches.
When you look at the benchmark there is no difference between ll and
ll+pre-empt. ll alone takes you to the 1ms point. pre-empt takes you no
further and to get much out of pre-emption requires you go and do all the
hideously slow and complex priority inversion stuff.
> exactly that reason. I don't think we need to work around broken
> hardware, but halfway decent hardware should not be a problem to get
> decent latency.
We have to work around common hardware not designed for SMP - the 8390 isnt
a broken chip in that sense, its just from a different era, and there are a
lot of them.
On Sun, Jan 13, 2002 at 04:18:29PM +0100, Roman Zippel wrote:
> What somehow got lost in this discussion, that both patches don't
> necessarily conflict with each other, they both attack the same problem
> with different approaches, which complement each other. I prefer to get
> the best of both patches.
If you do this (and I've seen the results of doing both at once vs only
either of then vs pure) then there's NO benifit for the preemption left.
Sure AVERAGE latency goes down slightly, however this is talking in the usec
range since worst case is already 1msec or less. Below the 1msec range it
really doesn't matter anymore however.
At that point you're adding all the complexity for the negliable-to-no-gain
case...
Greetings,
Arjan van de Ven
> Obvioiusly, Alan, you know more about the networking stack than I do. :) But
> could you define "large periods of time running other code"?
10ths of a second if there is a lot to let run instead of this thread. Even
1/100th is bad news.
> There ISN'T an upper bound on interrupts. We've got some nasty interrupts in
> the system. How long does the PCI bus get tied up with spinning video cards
> flooding the bus to make their benchmarks look 5% better? How long of a
They dont flood the bus with interrupts, the lock the bus off for several
millseconds worst case. Which btw you'll note means that lowlatency already
achieves the best value you can get
> latency spike did we (until recently) get switching between graphics and text
> consoles? (I heard that got fixed, moved into a tasklet or some such.
> Haven't looked at it yet.) Without Andre's IDE patches, how much latency can
Been fixed in -ac for ages, and finally made Linus tree.
> the disk insert at random?
IDE with or without Andre's changes can insert multiple millisecond delays
on the bus in some situations. Again pre-empt patch can offer you nothing
because the hardware limit is easily met by low latency
> One other fun little thing about the scheduler: a process that is submitting
> network packets probably isn't entirely CPU bound, is it? It's doing I/O.
Network packets get submitted from _outside_
Alan
> > I disable a single specific interrupt, I don't disable the timer interrupt.
> > Your code doesn't seem to handle that.
>
> It can if we increment the preempt_count in disable_irq_nosync and
> decrement it on enable_irq.
A driver that knows about how its irq is handled and that it is sole
user (eg ISA) may and some do leave it disabled for hours at a time
On Sat, Jan 12, 2002 at 12:23:09PM -0800, Andrew Morton wrote:
> Ed Sweetman wrote:
> >
> > If you want to test the preempt kernel you're going to need something that
> > can find the mean latancy or "time to action" for a particular program or
> > all programs being run at the time and then run multiple programs that you
> > would find on various peoples' systems. That is the "feel" people talk
> > about when they praise the preempt patch.
>
> Right. And that is precisely why I created the "mini-ll" patch. To
> give the improved "feel" in a way which is acceptable for merging into
> the 2.4 kernel.
>
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
Ok, as promised, here are the results:
13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
Regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On Sun, Jan 13, 2002 at 04:18:23PM +0100, [email protected] wrote:
> On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> > On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > > I did my usual compile testings (untar kernel archive, apply patches,
> > > make -j<value> ...
> >
> > If I understand your test,
> > you are testing different loads - you are compiling kernels that may differ
> > in size and makefile organization, not to mention different layout on the
> > file system and disk.
>
> No, I use a script which is run in single user mode after a reboot. So
> there are only a few processes running when I start the script (see
> attachment) and the jobs should start from the same environment.
But your description makes it sound like you do
untar kernel X
apply patches Y
make -j Tree
I'm sorry if I'm getting you wrong, but each of these steps is
variable.
Even if X and Y are the same each time, "Tree" is different.
The test should be
reboot
N times
make clean
time make -j Tree
Am I misunderstaning your test?
>
> > What happens when you do the same test, compiling one kernel under multiple
> > different kernels?
>
> That is exactly what I am doing. I even try to my best to have the exact
> same starting environment ...
>
> Regards,
>
> Jogi
>
> --
>
> Well, yeah ... I suppose there's no point in getting greedy, is there?
>
> << Calvin & Hobbes >>
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Sun, 2002-01-13 at 10:18, [email protected] wrote:
> No, I use a script which is run in single user mode after a reboot. So
> there are only a few processes running when I start the script (see
> attachment) and the jobs should start from the same environment.
>
> > What happens when you do the same test, compiling one kernel under multiple
> > different kernels?
>
> That is exactly what I am doing. I even try to my best to have the exact
> same starting environment ...
So there you go, his testing is accurate. Now we have results that
preempt works and is best and it is still refuted. Everyone is running
around with these "ll is best" or "preempt sucks throughput" and that is
not true. Further, with preempt we can improve things cleanly, and I
don't think that necessarily implies priority inversion problems.
Robert Love
On Sun, Jan 13, 2002 at 10:51:04AM -0700, [email protected] wrote:
> On Sun, Jan 13, 2002 at 04:18:23PM +0100, [email protected] wrote:
> > On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> > > On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > > > I did my usual compile testings (untar kernel archive, apply patches,
> > > > make -j<value> ...
> > >
> > > If I understand your test,
> > > you are testing different loads - you are compiling kernels that may differ
> > > in size and makefile organization, not to mention different layout on the
> > > file system and disk.
> >
> > No, I use a script which is run in single user mode after a reboot. So
> > there are only a few processes running when I start the script (see
> > attachment) and the jobs should start from the same environment.
>
> But your description makes it sound like you do
> untar kernel X
> apply patches Y
> make -j Tree
>
> I'm sorry if I'm getting you wrong, but each of these steps is
> variable.
> Even if X and Y are the same each time, "Tree" is different.
X and Y are the same. But I don't really get why this is still
"different" ... If you think this could be because of the fs
fragmentation then I will enhance my test. I think I have a spare
partition somewhere which I can format each time before untar the
kernel sources and so on. But why can I reproduce the results then?
Ok, not exactly but the results do get close ...
Furthermore I am timing not only the make -j<value> but also the
complete untar and applying of patches. So basically I am timing the
following:
tar xvf linux-x.y.z.tar
patch -p0 < some_patches
cd linux; cp ../config-x.y.z .config
make oldconfig dep
make -j $PAR bzImage modules
and afterwards
cd .. ; rm -rf linux
and start again. Its just the same as doing 'rpm --rebuild' with
MAKE=make -j $PAR
> The test should be
> reboot
> N times
> make clean
> time make -j Tree
>
> Am I misunderstaning your test?
No, but I don't understand why this should make any difference. I do not
propose my way of testing as *the* benchmark. Its just a benchmark of
something which I do most of the time on my system (compiling) in an
extreme way ...
Kind regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On Sun, 2002-01-13 at 10:18, Roman Zippel wrote:
> What somehow got lost in this discussion, that both patches don't
> necessarily conflict with each other, they both attack the same problem
> with different approaches, which complement each other. I prefer to get
> the best of both patches.
> The ll patch identifies problem, which preempt alone can't fix, on the
> other hand the ll patch inserts schedule calls all over the place, where
> preempt can handle this transparently. So far I haven't seen any
> evidence, that preempt introduces any _new_ serious problems, so I'd
> rather like to see to get the best out of both.
Good point. In fact, I have an "ll patch" for preempt-kernel, it is
called lock-break and available at
ftp://ftp.kernel.org/pub/linux/kernel/people/rml/lock-break
While I am not so sure this sort of explicit work is the answer -- I'd
much prefer to work on the algorithms to shorten lock time or lock into
different locks -- it does work. The work is based heavily on Andrew's
ll patch but designed for use with preempt-kernel. This means we can
drop some of the conditional schedules that aren't needed, and in others
we don't need to call schedule (just drop the locks).
Robert Love
On Sun, 2002-01-13 at 12:42, [email protected] wrote:
> 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
> j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
> j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
> j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
> j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
> j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
>
> j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
> j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
> j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
> j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
> j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
Again, preempt seems to reign supreme. Where is all the information
correlating preempt is inferior? To be fair, however, we should bench a
mini-ll+s test.
But I stand by my original point that none of this matters all too
much. A preemptive kernel will allow for future latency reduction
_without_ using explicit scheduling points everywhere there is a
problem. This means we can tackle the problem and not provide a million
bandaids.
Robert Love
On Sun, 2002-01-13 at 10:59, Alan Cox wrote:
> > I disable a single specific interrupt, I don't disable the timer interrupt.
> > Your code doesn't seem to handle that.
>
> It can if we increment the preempt_count in disable_irq_nosync and
> decrement it on enable_irq.
OK, Alan, you spooked me with the disable_irq mess and admittedly my
initial solution wasn't ideal for a few reasons.
But it isn't a problem after all. In hw_irq.h we bump the count in the
interrupt path. This should handle any handler, however we end up in
it.
I realized it because if we did not have a global solution to interrupt
request handlers, dropping spinlocks in the handler, even with IRQs
disabled, would cause a preemptive schedule. All interrupts are
properly protected.
Robert Love
On Sun, 2002-01-13 at 06:39, Russell King wrote:
> On Sat, Jan 12, 2002 at 10:10:55PM -0500, Robert Love wrote:
> > It can if we increment the preempt_count in disable_irq_nosync and
> > decrement it on enable_irq.
>
> Who says you're going to be enabling IRQs any time soon? AFAIK, there is
> nothing that requires enable_irq to be called after disable_irq_nosync.
>
> In fact, you could well have the following in a driver:
>
> /* initial shutdown of device */
>
> disable_irq_nosync(i); /* or disable_irq(i); */
>
> /* other shutdown stuff */
>
> free_irq(i, private);
>
> and you would have to audit all drivers to find out if they did this -
> this would seriously damage your preempt_count.
I wasn't thinking. Anytime we are in an interrupt handler, preemption
is disabled. Regardless of how (or even if) interrupts are disabled.
We bump preempt_count on the entry path. So, no problem.
Robert Love
On Sun, Jan 13, 2002 at 01:24:20PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 06:39, Russell King wrote:
> > On Sat, Jan 12, 2002 at 10:10:55PM -0500, Robert Love wrote:
> > > It can if we increment the preempt_count in disable_irq_nosync and
> > > decrement it on enable_irq.
> >
> > Who says you're going to be enabling IRQs any time soon? AFAIK, there is
> > nothing that requires enable_irq to be called after disable_irq_nosync.
> >
> > In fact, you could well have the following in a driver:
> >
> > /* initial shutdown of device */
> >
> > disable_irq_nosync(i); /* or disable_irq(i); */
> >
> > /* other shutdown stuff */
> >
> > free_irq(i, private);
> >
> > and you would have to audit all drivers to find out if they did this -
> > this would seriously damage your preempt_count.
>
> I wasn't thinking. Anytime we are in an interrupt handler, preemption
> is disabled. Regardless of how (or even if) interrupts are disabled.
> We bump preempt_count on the entry path. So, no problem.
Err. This isn't *inside* an interrupt handler. This could well be in
the driver shutdown code (eg, when fops->release is called).
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
> Again, preempt seems to reign supreme. Where is all the information
> correlating preempt is inferior? To be fair, however, we should bench a
> mini-ll+s test.
How about some actual latency numbers ?
> I wasn't thinking. Anytime we are in an interrupt handler, preemption
> is disabled. Regardless of how (or even if) interrupts are disabled.
> We bump preempt_count on the entry path. So, no problem.
The code path isnt in an interrupt handler.
The problem here is that when people report
that the low latency patch works better for them
than the preempt patch, they aren't talking about
bebnchmarking the time to compile a kernel, they
are talking about interactive feel and smoothness.
You're speaking to a peripheral issue.
I've no agenda other than wanting to see linux
as an attractive option for the multimedia and
gaming crowds - and in my experience, the low
latency patches simply give a much smoother
feel and a more pleasant experience. Kernel
compilation time is the farthest thing from my
mind when e.g. playing Q3A!
I'd be happy to check out the preempt patch
again and see if anything's changed, if the
problem of tux+preempt oopsing has been
dealt with -
Regards,
jjs
Robert Love wrote:
>On Sun, 2002-01-13 at 12:42, [email protected] wrote:
>
>> 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
>>j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
>>j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
>>j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
>>j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
>>j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
>>
>>j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
>>j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
>>j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
>>j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
>>j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
>>
>
>Again, preempt seems to reign supreme. Where is all the information
>correlating preempt is inferior? To be fair, however, we should bench a
>mini-ll+s test.
>
>But I stand by my original point that none of this matters all too
>much. A preemptive kernel will allow for future latency reduction
>_without_ using explicit scheduling points everywhere there is a
>problem. This means we can tackle the problem and not provide a million
>bandaids.
>
> Robert Love
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
Robert Love wrote:
>
> Again, preempt seems to reign supreme. Where is all the information
> correlating preempt is inferior? To be fair, however, we should bench a
> mini-ll+s test.
I can't say that I have ever seen any significant change in throughput
of anything with any of this stuff.
Benchmarks are well and good, but until we have a solid explanation for
the throughput changes which people are seeing, it's risky to claim
that there is a general benefit.
-
On Sun, 2002-01-13 at 14:46, Andrew Morton wrote:
> I can't say that I have ever seen any significant change in throughput
> of anything with any of this stuff.
I can send you some numbers. It is typically 5-10% throughput increase
under load. Obviously this work won't help a single task on a single
user system. But things like (ack!) dbench 16 show a marked
improvement.
> Benchmarks are well and good, but until we have a solid explanation for
> the throughput changes which people are seeing, it's risky to claim
> that there is a general benefit.
I have an explanation. We can schedule quicker off a woken task. When
an event occurs that allows an I/O-blocked task to run, its time-to-run
is shorter. Same event/response improvement that helps interactivity.
Robert Love
> The problem here is that when people report
> that the low latency patch works better for them
> than the preempt patch, they aren't talking about
> bebnchmarking the time to compile a kernel, they
> are talking about interactive feel and smoothness.
>
> You're speaking to a peripheral issue.
Yes, but I did latency testing for Robert for several months, now.
> I've no agenda other than wanting to see linux
> as an attractive option for the multimedia and
> gaming crowds
I am, too. But more for 3D visualization/simulation (with audio).
> - and in my experience, the low
> latency patches simply give a much smoother
> feel and a more pleasant experience.
Not for me. Even when lock-brake is applied.
> Kernel
> compilation time is the farthest thing from my
> mind when e.g. playing Q3A!
Q3A is _NOT_ changed in any case. Even some smoother system "feeling" with
Q3A and UT 436 running in parallel on an UP 1 GHz Athlon II, 640 MB. Have you
seen something on any Win box?
> I'd be happy to check out the preempt patch
> again and see if anything's changed, if the
> problem of tux+preempt oopsing has been
> dealt with -
You told me that TUX show some problems with preempt before.
What problems? Are they TUX specific?
Some latency numbers coming soon.
-Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Sun, Jan 13, 2002 at 01:22:57PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 12:42, [email protected] wrote:
>
> > 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
> > j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
> > j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
> > j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
> > j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
> > j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
> >
> > j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
> > j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
> > j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
> > j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
> > j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
>
> Again, preempt seems to reign supreme. Where is all the information
> correlating preempt is inferior? To be fair, however, we should bench a
> mini-ll+s test.
Your wish is granted. Here are the results for mini-ll + scheduler:
j100: 8:26.54
j100: 7:50.35
j100: 6:49.59
j100: 6:39.30
j100: 6:39.70
j75: 6:01.02
j75: 6:12.16
j75: 6:04.60
j75: 6:24.58
j75: 6:28.00
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
Hi,
Alan Cox wrote:
> > What somehow got lost in this discussion, that both patches don't
> > necessarily conflict with each other, they both attack the same problem
> > with different approaches, which complement each other. I prefer to get
> > the best of both patches.
>
> When you look at the benchmark there is no difference between ll and
> ll+pre-empt. ll alone takes you to the 1ms point.
I don't doubt that, but would you seriously consider the ll patch for
inclusion into the main kernel?
It's a useful patch for anyone, who needs good latencies now, but it's
still a quick&dirty solution. Preempt offers a clean solution for a
certain part of the problem, as it's possible to cleanly localize the
needed changes for preemption (at least for UP). That means the ll patch
becomes smaller and future work on ll becomes simpler, since a certain
type of latency problems is handled automatically (and transparently),
so you do gain something by it.
The remaining places pointed out in the ll patch are worth a closer look
as well, as mostly now we hold a spinlock for too long. These should be
fixed as well, as they mean possible contention problems on SMP.
> pre-empt takes you no
> further and to get much out of pre-emption requires you go and do all the
> hideously slow and complex priority inversion stuff.
The possibility of priority inversion problems are not new, it was
already discussed before. It was considered not a serious problem, since
all processes will still make progress. Preempt now increases the
likeliness such a situation occurs, but nonetheless the processes will
still make progress. In the past I can't remember any report that
indicated a problem caused by priority inversion and so I simply can't
believe it should become a massive problem now with preempt.
> > exactly that reason. I don't think we need to work around broken
> > hardware, but halfway decent hardware should not be a problem to get
> > decent latency.
>
> We have to work around common hardware not designed for SMP - the 8390 isnt
> a broken chip in that sense, its just from a different era, and there are a
> lot of them.
Please let me rephrase, I just don't expect terrible good latency
numbers with non dma hardware.
bye, Roman
Robert Love wrote:
>
> ...
> > Benchmarks are well and good, but until we have a solid explanation for
> > the throughput changes which people are seeing, it's risky to claim
> > that there is a general benefit.
>
> I have an explanation. We can schedule quicker off a woken task. When
> an event occurs that allows an I/O-blocked task to run, its time-to-run
> is shorter. Same event/response improvement that helps interactivity.
>
Sounds more like handwaving that an explanation :)
The way to speed up dbench is to allow the processes which want to delete
files to actually do that. This reduces the total amount of IO which the
test performs. Another way is to increase usable memory (or at least to
delay the onset of balance_dirty going synchronous). Possibly it's something
to do with letting kswapd schedule earlier. Or bdflush.
In the swapstorm case, it's again not clear to me. Perhaps it's due to prompter
kswapd activity, perhaps due somehow to improved request merging.
As I say, without a precise and detailed understanding of the mechanisms
I wouldn't be prepared to claim more than "speeds up dbench and swapstorms
for some reason".
(I'd _like_ to know the complete reason - that way we can stare at it
and maybe make things even better. Doing a binary search through the
various chunks of the mini-ll patch would be instructive).
-
> I don't doubt that, but would you seriously consider the ll patch for
> inclusion into the main kernel?
The mini ll patch definitely. The full ll one needs some head scratching to
be sure its correct. pre-empt is a 2.5 thing which in some ways is easier
because it doesnt matter if it breaks something.
> Please let me rephrase, I just don't expect terrible good latency
> numbers with non dma hardware.
Expect the same with DMA hardware too at times.
On January 12, 2002 09:21 pm, Alan Cox wrote:
> > > I didn't see anywhere you check disable_irq(). Even if you did it doesnt
> > > help when I mask the irq on the chip rather than using disable_irq() calls.
> >
> > Well, if IRQs are disabled we won't have the timer... would not the
> > system panic anyhow if schedule() was called while in an interrupt
> > handler?
>
> You completely misunderstand.
>
> disable_irq(n)
>
> I disable a single specific interrupt, I don't disable the timer interrupt.
> Your code doesn't seem to handle that. Its just one of the examples of where
> you really need priority handling, and thats a horrible dark and slippery
> slope
He just needs to disable preemption there, it's just a slight mod to
disable/enable_irq. You probably have a few more of those, though...
--
Daniel
On January 12, 2002 07:54 pm, Alan Cox wrote:
> Another example is in the network drivers. The 8390 core for one example
> carefully disables an IRQ on the card so that it can avoid spinlocking on
> uniprocessor boxes.
>
> So with pre-empt this happens
>
> driver magic
> disable_irq(dev->irq)
inc task's preempt inhibit
> PRE-EMPT:
> [large periods of time running other code]
> PRE-EMPT:
> We get back and we've missed 300 packets, the serial port sharing
> the IRQ has dropped our internet connection completely.
We are ok
dec tasks's preempt inhibit
jmp if nonzero
--
Daniel
On Sun, Jan 13, 2002 at 03:35:08PM -0500, Mark Hahn wrote:
> > > > 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
> > > > j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
> > > > j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
> > > > j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
> > > > j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
> > > > j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
> > > >
> > > > j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
> > > > j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
> > > > j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
> > > > j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
> > > > j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
> > >
> > > Again, preempt seems to reign supreme. Where is all the information
> > > correlating preempt is inferior? To be fair, however, we should bench a
> > > mini-ll+s test.
> >
> > Your wish is granted. Here are the results for mini-ll + scheduler:
> >
> > j100: 8:26.54
> > j100: 7:50.35
> > j100: 6:49.59
> > j100: 6:39.30
> > j100: 6:39.70
> > j75: 6:01.02
> > j75: 6:12.16
> > j75: 6:04.60
> > j75: 6:24.58
> > j75: 6:28.00
>
> how about a real benchmark like -j2 or so (is this a dual machine?)
Why does everybody think this is no *real* benchmark? When I remember
the good old days at the university the systems I tried to compile some
applications on were *always* overloaded. Would it make a difference for
you if I would run
for a in lots_of.srpm; do
rpm --rebuild $a &
done
Basically this gives the same result: lots of compile jobs running in
parallel. All *I* am doing is doing it a little extreme since running
the compile with make -j2 does not make a *noticable* difference at all.
And as I said previously my idea was to get the system into high memory
pressure and test the different vms (AA and RvR) ...
Furthermore some people think this combination (sched+preempt) is only
good for latency (if at all) all I can say is that this works *very*
well for me latency wise. Since I don't know how to measure latency
exactly I tried to run my compile script (make -j50) while running my
usual desktop + xmms. Result: xmms was *not* skipping, although the
system was ~70MB into swap and the load was >50. Changing workspaces
worked immedeatly all the time. But I was able to get xmms to skip for
a short while by starting galeon, StarOffice, gimp with ~10 pictures
all at the same time. But when all applications came up xmms was not
skipping any more and the system was ~130MB into swap. This is the best
result for me so far but I have to admit that I did not test mini-ll
+sched in this area (I can test this earliest on wednesday, sorry).
Since it is a little while since I posted my system specs here they are:
- Athlon 1.2GHz (single proc)
- 256 MB
- IDE drive (quantum)
> also, I've often found the user/sys/elapsed components to all be interesting;
> how do they look? (I'd expect preempt to have more sys, for instance.)
13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3smini
(sys) (user)
j100: 30.78 297.07 32.40 294.38 * 27.74 296.02 27.55 292.95 28.30 297.67
j100: 30.92 297.11 33.04 295.15 * 29.14 296.25 26.88 292.77 28.13 296.44
j100: 29.58 297.90 34.01 294.16 * 27.56 295.76 26.25 293.79 27.96 296.47
j100: 30.62 297.13 32.00 294.30 * 28.47 296.46 27.64 293.42 27.50 297.47
j100: 30.48 299.43 32.28 295.42 * 27.77 296.44 27.53 292.10 27.23 297.24
As expected the system and the user times are almost identical. The "fastest"
compile results are always where the job gets the most %cpu time. So I guess
it would be more interesting to see how much cpu time e.g. kswapd gets.
Probably I have to enhance my script to run vmstat in the background ...
Would this provide useful data?
Regards,
Jogi
--
Well, yeah ... I suppose there's no point in getting greedy, is there?
<< Calvin & Hobbes >>
On January 13, 2002 02:41 am, Alan Cox wrote:
> [somebody wrote]
> > For a solution to latency concerns, I'd much prefer to lay a framework
> > down that provides a proper solution and then work on fine tuning the
> > kernel to get the desired latency out of it.
>
> As the low latency patch proves, the framework has always been there, the
> ll patches do the rest
For that matter, the -preempt patch proves that the framework has always -
i.e., since genesis of SMP - been there for a preemptible kernel.
--
Daniel
On Sun, 2002-01-13 at 17:55, Daniel Phillips wrote:
> I'd like to add my 'me too' to those who have requested a re-run of this test, building
> the *identical* kernel tree every time, starting from the same initial conditions.
> Maybe that's what you did, but it's not clear from your post.
He later said he did in fact build the same tree, from the same initial
condition, in single user mode, etc etc ... sounded like good testing
methodology to me.
I later asked for a test of Ingo's sched with ll (to compare to Ingo's
sched with preempt). In this test, like the others, preempt gives the
best times.
Robert Love
Dieter N?tzel wrote:
>Yes, but I did latency testing for Robert for several months, now.
>
I'm not saying you see no improvements, I'm
saying that most actual latency tests show
that low-latency gets better low latency results.
Perhaps there are differences in how each of
our particular setups responds to the respective
approaches.
>I am, too. But more for 3D visualization/simulation (with audio).
>
Certainly not conflicting goals!
>Not for me. Even when lock-brake is applied.
>
That's odd - but IIUC preempt + lock-break was the
RML answer to the low latency patch and seems to
give similar results for a more complex solution.
>>Kernel
>>compilation time is the farthest thing from my
>>mind when e.g. playing Q3A!
>>
>
>Q3A is _NOT_ changed in any case. Even some smoother system "feeling" with Q3A and UT 436 running in parallel on an UP 1 GHz Athlon II, 640 MB.
>
That's odd - for me the low latency kernels give
not only a smoother feel, but also markedly higher
standing on average at the end of the game.
Perhaps your setup has something that is
mitigating the beneficial effects of the low
latency modifications?
Are you running a non-ext2 filesystem?
Do you have a video card that grabs the
bus for long periods?
And you set /proc/sys/kernel/lowlatency=1...
On my PIII 933 UP w/512 MB it does help.
>Have you seen something on any Win box?
>
I have seen the games played on windoze and
have played at lan parties with win32 opponents
but I do not personally play games on windoze.
Lack of interest, I guess...
>>I'd be happy to check out the preempt patch
>>again and see if anything's changed, if the
>>problem of tux+preempt oopsing has been
>>dealt with -
>>
>
>You told me that TUX show some problems with preempt before. What problems? Are they TUX specific?
>
On a kernel with both tux and preempt, upon
access to the tux webserver the kernel oopses
and tux dies. Not good when I depend on tux.
OTOH the low latency patch plays quite well
with tux. As said, I have no anti-preempt agenda,
I just need for whatever solution I use to work,
and not crash programs and services we use.
>Some latency numbers coming soon.
>
Great!
jjs
Well to start with:
1) Maybe I should be more precise: The latency measures I've seen
posted all favor Morton and not preempt. Since the claimed purpose
of both patches is improving latency isn't that more interesting
than measuremts of kernel compile?
2) In these measurements
the tree is different each time so the measurement doesn't
seem very stable. It's not exactly a secret that file layout
can have an affect on performance.
3) There is no measure of preempt without Ingo's scheduler
4) this is what I want to see:
Run the periodic SCHED_FIFO task I've posted multiple times
Let's see worst case error
Let's see effect on the background kernel compile
All the rest is just so much talk about "interactive feel". I saw
exactly the same claims from the people who wanted kernel graphics.
On Sun, Jan 13, 2002 at 01:22:57PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 12:42, [email protected] wrote:
>
> > 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
> > j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
> > j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
> > j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
> > j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
> > j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
> >
> > j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
> > j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
> > j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
> > j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
> > j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
>
> Again, preempt seems to reign supreme. Where is all the information
> correlating preempt is inferior? To be fair, however, we should bench a
> mini-ll+s test.
>
> But I stand by my original point that none of this matters all too
> much. A preemptive kernel will allow for future latency reduction
> _without_ using explicit scheduling points everywhere there is a
> problem. This means we can tackle the problem and not provide a million
> bandaids.
>
> Robert Love
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Monday, 14. January 2002 00:33, J Sloan wrote:
> Dieter N?tzel wrote:
> >I am, too. But more for 3D visualization/simulation (with audio).
>
> Certainly not conflicting goals!
Yes.
> >>Kernel
> >>compilation time is the farthest thing from my
> >>mind when e.g. playing Q3A!
> >
> >Q3A is _NOT_ changed in any case. Even some smoother system "feeling" with
> > Q3A and UT 436 running in parallel on an UP 1 GHz Athlon II, 640 MB.
>
> That's odd - for me the low latency kernels give
> not only a smoother feel, but also markedly higher
> standing on average at the end of the game.
What did you see?
During timedemo or avg fps?
> Perhaps your setup has something that is
> mitigating the beneficial effects of the low
> latency modifications?
>
> Are you running a non-ext2 filesystem?
Of course. All ReiserFS.
I have normally posted that together with my preempt numbers.
lock-brake has some code for ReiserFS in it. I did the testing for Robert.
> Do you have a video card that grabs the
> bus for long periods?
Don't know.
Any tools available for measuring?
Latest development stuff for a Voodoo5 5500 AGP (XFree86 DRI CVS,
mesa-4-branch). I am somewhat in the DRI development involved.
Glide3/3DNow!
> And you set /proc/sys/kernel/lowlatency=1...
I think I hadn't forgetten that.
> >Have you seen something on any Win box?
>
> I have seen the games played on windoze and
> have played at lan parties with win32 opponents
> but I do not personally play games on windoze.
> Lack of interest, I guess...
I meant both running together at the same time.
Never seen that on a Windows box...
> On a kernel with both tux and preempt, upon
> access to the tux webserver the kernel oopses
> and tux dies. Not good when I depend on tux.
>
> OTOH the low latency patch plays quite well
> with tux. As said, I have no anti-preempt agenda,
> I just need for whatever solution I use to work,
> and not crash programs and services we use.
Sure. I only want to know your problem.
> >Some latency numbers coming soon.
>
> Great!
With some luck tonight.
It is 1 o'clock local time, here...
-Dieter
On Sun, Jan 13, 2002 at 09:25:50PM +0100, Roman Zippel wrote:
> I don't doubt that, but would you seriously consider the ll patch for
> inclusion into the main kernel?
> It's a useful patch for anyone, who needs good latencies now, but it's
> still a quick&dirty solution. Preempt offers a clean solution for a
> certain part of the problem, as it's possible to cleanly localize the
> needed changes for preemption (at least for UP). That means the ll patch
> becomes smaller and future work on ll becomes simpler, since a certain
That is exactly what Andrew Morton disputes. So why do you think he is
wrong?
On Sun, Jan 13, 2002 at 05:56:25PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 17:55, Daniel Phillips wrote:
>
> > I'd like to add my 'me too' to those who have requested a re-run of this test, building
> > the *identical* kernel tree every time, starting from the same initial conditions.
> > Maybe that's what you did, but it's not clear from your post.
>
> He later said he did in fact build the same tree, from the same initial
> condition, in single user mode, etc etc ... sounded like good testing
> methodology to me.
Really? You think that
unpack a tar archive
make
is a repeatable benchmark?
On Monday, 13. January 2002 17:56, yodaiken wrote:
> > He later said he did in fact build the same tree, from the same initial
> > condition, in single user mode, etc etc ... sounded like good testing
> > methodology to me.
>
> Really? You think that
> unpack a tar archive
> make
>
> is a repeatable benchmark?
Do it tree times and send the geometric middle. Where's the problem?
Even with disk fragmentation. Maybe use an empty disk...
What about latencytest0.42-png?
We used it even before Ingo's O(1) existed.
Were can I find your latency test? Link?
-Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
> > I don't doubt that, but would you seriously consider the ll patch
for
> > inclusion into the main kernel?
>
> The mini ll patch definitely.
Huh?
Can you point at anyone who experienced a significant benefit from it?
I can see a lot of interesting patches ahead if you let this go.
Tell me honestly that the idea behind this patch is not _crap_. You
can only make this basic idea work if you patch a tremendous lot of
those conditional_schedules() through the kernel. We already saw it
starting off in some graphics drivers, network drivers. Why not just
all of it? You will not be far away in the end from the 'round 4000 I
already stated in earlier post.
I do believe Roberts' preempt is a lot cleaner in its idea _how_ to
achieve basically the same goal. Although I am at least as sceptic as
you about a race-free implementation.
> The full ll one needs some head scratching to
> be sure its correct.
You may simply call it _counting_ (the files to patch).
> pre-empt is a 2.5 thing which in some ways is easier
> because it doesnt matter if it breaks something.
So I understand you agree somehow with me in the answer to "what idea
is really better?"...
Regards,
Stephan
>I question this because it is too risky to apply. There is no way any
>distribution or production system could ever consider applying the
>preempt kernel and ship it in its next kernel update 2.4. You never know
>if a driver will deadlock because it is doing a test and set bit busy
>loop by hand instead of using spin_lock and you cannot audit all the
>device drivers out there.
Quick question from a kernel newbie.
Could this audit be partially automated by the Stanford Checker? or would
there
be too many false positives from other similar looping code?
-Robert
> Tell me honestly that the idea behind this patch is not _crap_. You
> can only make this basic idea work if you patch a tremendous lot of
> those conditional_schedules() through the kernel. We already saw it
> starting off in some graphics drivers, network drivers. Why not just
> all of it? You will not be far away in the end from the 'round 4000 I
> already stated in earlier post.
There are very few places you need to touch to get a massive benefit. Most
of the kernel already behaves extremely well.
> So I understand you agree somehow with me in the answer to "what idea
> is really better?"...
Do you want a clean simple solution or complex elegance ? For 2.4 I definitely
favour clean and simple. For 2.5 its an open debate
Hi,
[email protected] wrote:
> > It's a useful patch for anyone, who needs good latencies now, but it's
> > still a quick&dirty solution. Preempt offers a clean solution for a
> > certain part of the problem, as it's possible to cleanly localize the
> > needed changes for preemption (at least for UP). That means the ll patch
> > becomes smaller and future work on ll becomes simpler, since a certain
>
> That is exactly what Andrew Morton disputes. So why do you think he is
> wrong?
Please explain, what do you mean?
bye, Roman
> Quick question from a kernel newbie.
>
> Could this audit be partially automated by the Stanford Checker? or would
> there be too many false positives from other similar looping code?
Some it can probably be audited but much of this stuff depends on knowing
the hardware. I've yet to meet a gcc that can read manuals alas
On Tue, 8 Jan 2002, Andrea Arcangeli wrote:
> I'm not against preemption (I can see the benefits about the mean
> latency for real time DSP) but the claims about preemption making the
> kernel faster doesn't make sense to me. more frequent scheduling,
> overhead of branches in the locks (you've to conditional_schedule after
> the last preemption lock is released and the cachelines for the per-cpu
> preemption locks) and the other preemption stuff can only make the
> kernel slower. Furthmore for multimedia playback any sane kernel out
> there with lowlatency fixes applied will work as well as a preemption
> kernel that pays for all the preemption overhead.
I'm not sure I have seen claims that it makes the kernel faster, but it
sure makes the latency lower, and improves performance for systems doing a
lot of network activity (DNS servers) with anything else going on in the
systems, such as daily reports and backups.
I will try the low latency kernel stuff, but I think intrinsically that if
you want to service the incoming requests quickly you have to dispatch to
them quickly, not at the end of a time slice. Preempt is a way to avoid
having to play with RT processes, and I think it's desirable in general as
an option where the load will benefit from such behaviour.
I'm not sure it "competes" with low latency, since many of the thing LL is
doing are "good things" in general.
Finally, I doubt that any of this will address my biggest problem with
Linux, which is that as memory gets cheap a program doing significant disk
writing can get buffers VERY full (perhaps a while CD worth) before the
kernel decides to do the write, at which point the system becomes
non-responsive for seconds at a time while the disk light comes on and
stays on. That's another problem, and I did play with some patches this
weekend without making myself really happy :-( Another topic,
unfortunately.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tue, 8 Jan 2002, Andrea Arcangeli wrote:
> Note that some of them are bugfixes, without them an luser can hang the
> machine for several seconds on any box with some giga of ram by simple
> reading and writing into a large enough buffer. I think we never had
> time to care merging those bits into mainline yet and this is probably
> the main reason they're not integrated but it's something that should be
> in mainline IMHO.
Or just doing a large write while doing lots of reads... my personal
nemesis is "mkisofs" for backups, which reads lots of small files and
builds a CD image, which suddenly gets discovered by the kernel and
written, seemingly in a monolythic chunk. I MAY be able to improve this
with tuning the bdflush parameters, and I tried some tentative patches
which didn't make a huge gain.
I don't know if the solution lies in forcing write to start when a certain
size of buffers are queued regardless of percentages, or in better
scheduling of reads ahead of writes, or whatever.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Tue, 8 Jan 2002, Daniel Phillips wrote:
> On January 8, 2002 08:47 pm, Andrew Morton wrote:
> > There's no point in just merging the preempt patch and saying "there,
> > that's done". It doesn't do anything.
> >
> > Instead, a decision needs to be made: "Linux will henceforth be a
> > low-latency kernel".
>
> I thought the intention was to make it a config option?
Irrelevant, it has to be implemented in order to be an option, so the
amount of work involved is the same either way. And if you want to make it
a runtime setting you add a slight bit of work and overhead deciding if LL
is wanted.
I'm not advocating that, but it would allow admins to enable LL when the
system was slow and see if it really made a change. Rebooting is bound to
change the load ;-)
> > Now, IF we can come to this decision, then
> > internal preemption is the way to do it. But it affects ALL kernel
> > developers. Because we'll need to introduce a new rule: "it is a
> > bug to spend more than five milliseconds holding any locks".
> >
> > So. Do we we want a low-latency kernel? Are we prepared to mandate
> > the five-millisecond rule? It can be done, but won't be easy, and
> > we'll never get complete coverage. But I don't see the will around
> > here.
Really? You have people working on low latency, people working on preempt,
and at least a few of us trying to characterize the problems with large
memory and i/o. I would say latency has become a real issue, and you only
need enough "will" to have one person write useful code, this is a
committee.
Since changes of this type don't need to be perfect and address all cases,
just help some and not make other worse, I think we will see improvement
in 2.4.xx without waiting for 2.5 or 2.6. No one is complaining that the
Linux overall thruput is bad, that network performance is bad, etc. But
responsiveness has become an issue, and I'm sure there's enough will to
solve it. "Solve" means getting most of the delays to be caused by
hardware capacity instead of kernel ineptitude.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Mon, Jan 14, 2002 at 01:41:35AM +0100, Roman Zippel wrote:
> Hi,
>
> [email protected] wrote:
>
> > > It's a useful patch for anyone, who needs good latencies now, but it's
> > > still a quick&dirty solution. Preempt offers a clean solution for a
> > > certain part of the problem, as it's possible to cleanly localize the
> > > needed changes for preemption (at least for UP). That means the ll patch
> > > becomes smaller and future work on ll becomes simpler, since a certain
> >
> > That is exactly what Andrew Morton disputes. So why do you think he is
> > wrong?
>
> Please explain, what do you mean?
I mean, that these conversations are not very useful if you don't
read what the other people write.
Here's a prior response by Andrew to a post by you.
>From [email protected] Sat Jan 12 13:15:22 2002
Roman Zippel wrote:
>
> Andrew's patch requires constant audition and Andrew can't audit all
> drivers for possible problems. That doesn't mean Andrew's work is
> wasted, since it identifies problems, which preempting can't solve, but
> it will always be a hunt for the worst cases, where preempting goes for
> the general case.
Guys,
I've heard this so many times, and it just ain't so. The overwhelming
majority of problem areas are inside locks. All the complexity and
maintainability difficulties to which you refer exist in the preempt
patch as well. There just is no difference.
>
> bye, Roman
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Sun, 2002-01-13 at 19:50, Alan Cox wrote:
> Do you want a clean simple solution or complex elegance ? For 2.4 I definitely
> favour clean and simple. For 2.5 its an open debate
Make no mistake, I do not intend to see preempt-kernel in 2.4. I will,
however, continue to maintain the patch for endusers and such that use
it. A proper in-kernel solution for 2.4 in my opinion in mini-ll,
perhaps extend with any other obviously-completely-utterly sane bits
from full-ll.
For 2.5, however, I tout preempt as the answer. This does not mean just
preempt. It means a preemptible kernel as a basis for beginning
low-latency works in manners other than explicit scheduling statements.
Robert Love
On Sun, 2002-01-13 at 19:41, Roman Zippel wrote:
> > That is exactly what Andrew Morton disputes. So why do you think he is
> > wrong?
Victor is saying that Andrew contends the hard parts of his low-latency
patch are just as hard to maintain with a preemptive kernel. This is
true, for the places where spinlocks are held anyway, but it assumes we
continue to treat lock breaking and explicit scheduling as our only
solution. It isn't under a preemptible kernel.
Robert Love
On Tue, 8 Jan 2002, Robert Love wrote:
> On Tue, 2002-01-08 at 15:59, Daniel Phillips wrote:
>
> > And while I'm enumerating differences, the preemptable kernel (in this
> > incarnation) has a slight per-spinlock cost, while the non-preemptable kernel
> > has the fixed cost of checking for rescheduling, at intervals throughout all
> > 'interesting' kernel code, essentially all long-running loops. But by clever
> > coding it's possible to finesse away almost all the overhead of those loop
> > checks, so in the end, the non-preemptible low-latency patch has a slight
> > efficiency advantage here, with emphasis on 'slight'.
>
> True (re spinlock weight in preemptible kernel) but how is that not
> comparable to explicit scheduling points? Worse, the preempt-kernel
> typically does its preemption on a branch on return to interrupt
> (similar to user space's preemption). What better time to check and
> reschedule if needed?
I'm not sure that preempt and low latency really are attacking the same
problem. What I am finding is the LL improves overall performance when a
process does something which is physically slow, like a find in a
directory with 20k files. On the other hand PK makes the response of the
system better to changes. In particular I see the DNS servers which have
other work running, even backups or reports, are more responsive with PK,
as are usenet news servers. I find it hard to measure "feels faster" with
either approach, although like the supreme court "I know it when I see
it."
I'd like to hope that some of each will get in the main kernel, PK has
been stable for me for a while, LL has never been unstable but I've run it
less.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Wed, 9 Jan 2002, Kent Borg wrote:
> How does all this fit into doing a tick-less kernel?
>
> There is something appealing about doing stuff only when there is
> stuff to do, like: respond to input, handle some device that becomes
> ready, or let another process run for a while. Didn't IBM do some
> nice work on this for Linux? (*Was* it nice work?) I was under the
> impression that the current kernel isn't that far from being tickless.
>
> A tickless kernel would be wonderful for battery powered devices that
> could literally shut off when there be nothing to do, and it seems it
> would (trivially?) help performance on high end power hogs too.
>
> Why do we have regular HZ ticks? (Other than I think I remember Linus
> saying that he likes them.)
Feel free to quantify the savings over the current setup with max power
saving enabled in the kernel. I just don't see how "wonderful" it would
be, given that an idle system currently uses very little battery if you
setup the options to save power.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
On Sun, Jan 13, 2002 at 07:46:54PM -0500, Bill Davidsen wrote:
> Finally, I doubt that any of this will address my biggest problem with
> Linux, which is that as memory gets cheap a program doing significant disk
> writing can get buffers VERY full (perhaps a while CD worth) before the
> kernel decides to do the write, at which point the system becomes
> non-responsive for seconds at a time while the disk light comes on and
> stays on. That's another problem, and I did play with some patches this
> weekend without making myself really happy :-( Another topic,
> unfortunately.
I think this is a critical problem. I'd like to be able to have some
assurance that a task with a buffer of size N doing read-disk->write-disk
will maintain data flow at some minimal rate over intervals of 1 or 2
seconds or something like that.
> Feel free to quantify the savings over the current setup with max power
> saving enabled in the kernel. I just don't see how "wonderful" it would
> be, given that an idle system currently uses very little battery if you
> setup the options to save power.
IBM have a tickless kernel patch set for the S/390. Here its not battery at
stake but VM overhead sending timer interrupts to hundreds of otherwise idle
virtual machines
On Sun, 13 Jan 2002, Roman Zippel wrote:
> So far I haven't seen any evidence, that preempt introduces any _new_
> serious problems, so I'd rather like to see to get the best out of
> both.
Are you seriously suggesting you haven't read a single
email in this thread yet ?
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Sun, 2002-01-13 at 20:50, Rik van Riel wrote:
> > So far I haven't seen any evidence, that preempt introduces any _new_
> > serious problems, so I'd rather like to see to get the best out of
> > both.
>
> Are you seriously suggesting you haven't read a single
> email in this thread yet ?
No, I think he is suggesting he doesn't consider any of the problems
serious. A lot of it is just smoke. What is "bad" wrt 2.5?
Robert Love
Bill Davidsen wrote:
>
> Finally, I doubt that any of this will address my biggest problem with
> Linux, which is that as memory gets cheap a program doing significant disk
> writing can get buffers VERY full (perhaps a while CD worth) before the
> kernel decides to do the write, at which point the system becomes
> non-responsive for seconds at a time while the disk light comes on and
> stays on. That's another problem, and I did play with some patches this
> weekend without making myself really happy :-( Another topic,
> unfortunately.
/proc/sys/vm/bdflush: Decreasing the kupdate interval from five
seconds, decreasing the nfract and nfract_sync setting in there
should smooth this out. The -aa patches add start and stop
levels for bdflush as well, which means that bdflush can be the
one who blocks on IO rather than your process. And it means that
the request queue doesn't get 100% drained as soon as the writer
hits nfract_sync.
All very interesting and it will be fun to play with when it
*finally* gets merged.
But with the current elevator design, disk read latencies will
still be painful.
-
> Or just doing a large write while doing lots of reads... my personal
> nemesis is "mkisofs" for backups, which reads lots of small files and
> builds a CD image, which suddenly gets discovered by the kernel and
> written, seemingly in a monolythic chunk. I MAY be able to improve this
> with tuning the bdflush parameters, and I tried some tentative patches
> which didn't make a huge gain.
>
> I don't know if the solution lies in forcing write to start when a certain
> size of buffers are queued regardless of percentages, or in better
> scheduling of reads ahead of writes, or whatever.
Have you observed it with -rmap or -aa, too?
I bet, you have.
Try Andrew's read-latency.patch then.
I use it on top of O(1) and preempt all the time.
It should be one of the next 2.4.18-preX/2.4.19-preX patches.
Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On January 12, 2002 04:07 pm, [email protected] wrote:
> Hello Andrea,
>
> I did my usual compile testings (untar kernel archive, apply patches,
> make -j<value> ...
>
> Here are some results (Wall time + Percent cpu) for each of the consecutive five runs:
>
> 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp
> j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83%
> j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83%
> j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87%
> j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87%
> j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86%
>
> j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93%
> j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88%
> j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91%
> j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91%
> j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88%
>
> * build incomplete (OOM killer killed several cc1 ... )
>
> So far 2.4.13-pre5aa1 had been the king of the block in compile times.
> But this has changed. Now the (by far) fastest kernel is 2.4.18-pre
> + Ingos scheduler patch (s) + preemptive patch (p). I did not test
> preemptive patch alone so far since I don't know if the one I have
> applies cleanly against -pre3 without Ingos patch. I used the
> following patches:
>
> s: sched-O1-2.4.17-H6.patch
> p: preempt-kernel-rml-2.4.18-pre3-ingo-1.patch
>
> I hope this info is useful to someone.
I'd like to add my 'me too' to those who have requested a re-run of this test, building
the *identical* kernel tree every time, starting from the same initial conditions.
Maybe that's what you did, but it's not clear from your post.
Thanks,
Daniel
On January 13, 2002 04:36 pm, Arjan van de Ven wrote:
> On Sun, Jan 13, 2002 at 04:18:29PM +0100, Roman Zippel wrote:
>
> > What somehow got lost in this discussion, that both patches don't
> > necessarily conflict with each other, they both attack the same problem
> > with different approaches, which complement each other. I prefer to get
> > the best of both patches.
>
> If you do this (and I've seen the results of doing both at once vs only
> either of then vs pure) then there's NO benifit for the preemption left.
Sorry, that's incorrect. I stated why earlier in this thread and akpm signed
off on it. With preempt you get ASAP (i.e., as soon as the outermost
spinlock is done) process scheduling. With hand-coded scheduling points you
get 'as soon as it happens to hit a scheduling point'.
That is not the only benefit, just the most obvious one.
--
Daniel
Daniel Phillips wrote:
>
> On January 13, 2002 04:36 pm, Arjan van de Ven wrote:
> > On Sun, Jan 13, 2002 at 04:18:29PM +0100, Roman Zippel wrote:
> >
> > > What somehow got lost in this discussion, that both patches don't
> > > necessarily conflict with each other, they both attack the same problem
> > > with different approaches, which complement each other. I prefer to get
> > > the best of both patches.
> >
> > If you do this (and I've seen the results of doing both at once vs only
> > either of then vs pure) then there's NO benifit for the preemption left.
>
> Sorry, that's incorrect. I stated why earlier in this thread and akpm signed
> off on it. With preempt you get ASAP (i.e., as soon as the outermost
> spinlock is done) process scheduling. With hand-coded scheduling points you
> get 'as soon as it happens to hit a scheduling point'.
With preempt it's "as soon as you hit a lock-break point". They're equivalent,
for the inside-lock case, which is where most of the problems and complexity
lie.
> That is not the only benefit, just the most obvious one.
I'd say the most obvious benefit of preempt is that it catches some
of the cases which the explicit schedules do not - the stuff which
the developer didn't test for, and which is outside locks.
How useful this is, is moot.
But we can *make* it useful. I believe that internal preemption is
the foundation to improve 2.5 kernel latency. But first we need
consensus that we *want* linux to be a low-latency kernel.
Do we have that?
If we do, then as I've said before, holding a lock for more than N milliseconds
becomes a bug to be fixed. We can put tools in the hands of testers to
locate those bugs. Easy.
-
So now you can take a look at my latencytest0.42-png number.
Sorry, no LL or -rmap numbers, yet.
As someone reported that O(1)-H7 is somewhat broken, I can confirm that for
the NONE preempted case. Preemption hide it most of the time.
latencytest0.42-png-2.4.18-pre3-O1-aa-VM-22.tar.bz2
2.4.18-pre3
sched-O1-2.4.17-H7.patch
10_vm-22
00_nanosleep-5
bootmem-2.4.17-pre6
read-latency.patch
waitq-2.4.17-mainline-1
plus
all 2.4.18-pre3.pending ReiserFS stuff
latencytest0.42-png-2.4.18-pre3-O1-aa-VM-22-preempt-2.tar.bz2
preempt-kernel-rml-2.4.18-pre3-ingo-2.patch
lock-break-rml-2.4.18-pre1-1.patch
2.4.18-pre3
sched-O1-2.4.17-H7.patch
10_vm-22
00_nanosleep-5
bootmem-2.4.17-pre6
read-latency.patch
waitq-2.4.17-mainline-1
plus
all 2.4.18-pre3.pending ReiserFS stuff
Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Mon, Jan 14, 2002 at 06:03:43AM +0100, Daniel Phillips wrote:
> On January 13, 2002 04:36 pm, Arjan van de Ven wrote:
> > On Sun, Jan 13, 2002 at 04:18:29PM +0100, Roman Zippel wrote:
> >
> > > What somehow got lost in this discussion, that both patches don't
> > > necessarily conflict with each other, they both attack the same problem
> > > with different approaches, which complement each other. I prefer to get
> > > the best of both patches.
> >
> > If you do this (and I've seen the results of doing both at once vs only
> > either of then vs pure) then there's NO benifit for the preemption left.
>
> Sorry, that's incorrect. I stated why earlier in this thread and akpm signed
> off on it. With preempt you get ASAP (i.e., as soon as the outermost
> spinlock is done) process scheduling. With hand-coded scheduling points you
> get 'as soon as it happens to hit a scheduling point'.
>
> That is not the only benefit, just the most obvious one.
My understanding of this situation is as follows:
The pure preempt measurements show some improvement on synthetic
latency benchmarks that have not been shown to have any relationship
to any real application
The LL measurements show _better_ results on similar benchmarks.
Some people find preempt improves "feel"
Some people find LL improves "feel"
The interactions of these improvements with Ingos scheduler, aa mm, and
other recent patches are exceptionally murky
We have one benchmark that shows that kernel compiles run on different
untarred trees show a slight advantage for preempt+Ingo via some
unknown mechanism. This benchmark, aside from its dubious repeatability
tests something that seems to have no relationship to _anything_ at all
by running a huge number of compile processes.
Nobody has answered my question about the conflict between SMP per-cpu caching
and preempt. Since NUMA is apparently the future of MP in the PC world and
the future of Linux servers, it's interesting to consider this tradeoff.
Nobody has answered the question about how to make sure all processes
make progress with preempt.
Nobody has offered a single benchmark of actual application code benefits
from either preempt or LL.
Nobody has clearly explained how to avoid what I claim to be the inevitable
result of preempt -- priority inheritance locks (not just semaphores).
What we have is some "we'll figure that out when we get to it".
It's not even clear how preempt is supposed to interact with SCHED_FIFO.
As far as your "most obvious" "benefit". It's neither obvious that it happens
or obvious that it is a benefit. According to the measurements I've seen, Andrew
reduces latency _more_ than preempt. Andrews argument, as I understand it, is that
the longest latencies are within spinlocks anyways so increasing speed of preempt
outside those locks misses the problem. If he is correct, then if you are correct,
it doesn't matter - preempt is reducing already short latencies.
BTW: there is a detailed explanation of how priority inherit works in Solaris in the
UNIX Internals book. It's worth reading and thinking about.
I'm not at all sure that putting preempt into 2.5 is a bad idea. I think that 2.4
has a long lifetime ahead of it and the debacle that will follow putting preempt into 2.5
will eventually discredit the entire idea for at least a year or two.
But
I think that there are some much more important scheduling issues that are being ignored to
"improve the feel" of DVD playing. The key one is some idea of being able to assure processes
of some rate of progress. This is not classical RT, but it is important to multimedia and
databases and also to some applications we are interested in looking at.
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On January 13, 2002 04:59 pm, Alan Cox wrote:
> > > I disable a single specific interrupt, I don't disable the timer
interrupt.
> > > Your code doesn't seem to handle that.
> >
> > It can if we increment the preempt_count in disable_irq_nosync and
> > decrement it on enable_irq.
>
> A driver that knows about how its irq is handled and that it is sole
> user (eg ISA) may and some do leave it disabled for hours at a time
Good point. Preemption would be disabled for that thread if we mindlessly
shut it off for every irq_disable. For that driver we probably just want to
leave preemption enabled, it can't hurt.
--
Daniel
On Monday 14 January 2002 12:59 am, Daniel Phillips wrote:
> On January 13, 2002 04:59 pm, Alan Cox wrote:
> > > > I disable a single specific interrupt, I don't disable the timer
>
> interrupt.
>
> > > > Your code doesn't seem to handle that.
> > >
> > > It can if we increment the preempt_count in disable_irq_nosync and
> > > decrement it on enable_irq.
> >
> > A driver that knows about how its irq is handled and that it is sole
> > user (eg ISA) may and some do leave it disabled for hours at a time
>
> Good point. Preemption would be disabled for that thread if we mindlessly
> shut it off for every irq_disable. For that driver we probably just want
> to leave preemption enabled, it can't hurt.
Once we return to user space, we can preempt again. If preemption is still
disabled upon return from the syscall, I'd say it's okay to switch it back on
now. :)
Unless I'm missing something fundamental...?
Rob
On January 13, 2002 08:35 pm, J Sloan wrote:
> The problem here is that when people report
> that the low latency patch works better for them
> than the preempt patch, they aren't talking about
> bebnchmarking the time to compile a kernel, they
> are talking about interactive feel and smoothness.
Nobody is claiming the low latency patch works better than
-preempt+lock_break, only that low latency can equal -preempt+lock_break,
which is a claim I'm skeptical of, but oh well.
> I've no agenda other than wanting to see linux
> as an attractive option for the multimedia and
> gaming crowds - and in my experience, the low
> latency patches simply give a much smoother
> feel and a more pleasant experience. Kernel
> compilation time is the farthest thing from my
> mind when e.g. playing Q3A!
You need to read the thread *way* more closely ;-)
> I'd be happy to check out the preempt patch
> again and see if anything's changed, if the
> problem of tux+preempt oopsing has been
> dealt with -
Right, useful.
--
Daniel
On Saturday den 12 January 2002 19.54, Alan Cox wrote:
> Another example is in the network drivers. The 8390 core for one example
> carefully disables an IRQ on the card so that it can avoid spinlocking on
> uniprocessor boxes.
>
> So with pre-empt this happens
>
> driver magic
> disable_irq(dev->irq)
> PRE-EMPT:
> [large periods of time running other code]
> PRE-EMPT:
> We get back and we've missed 300 packets, the serial port sharing
> the IRQ has dropped our internet connection completely.
>
> ["Don't do that then" isnt a valid answer here. If I did hold a lock
> it would be for several milliseconds at a time anyway and would reliably
> trash performance this time]
>
./drivers/net/8390.c
I checked the code ./drivers/net/8390.c - this is how it REALLY looks like...
/* Ugly but a reset can be slow, yet must be protected */
disable_irq_nosync(dev->irq);
spin_lock(&ei_local->page_lock);
/* Try to restart the card. Perhaps the user has fixed something. */
ei_reset_8390(dev);
NS8390_init(dev, 1);
spin_unlock(&ei_local->page_lock);
enable_irq(dev->irq);
This should be mostly OK for the preemptive kernel. Swapping the irq and spin
lock lines should be preferred. But I think that is the case in SMP too...
Suppose two processors does the disable_irq_nosync - unlikely but possible...
One gets the spinlock, the other waits
The first runs through the code, exits the spin lock, enables irq
The second starts running the code - without irq disabled!!!
This would work in both cases.
/* Ugly but a reset can be slow, yet must be protected */
spin_lock(&ei_local->page_lock);
disable_irq_nosync(dev->irq);
/* Try to restart the card. Perhaps the user has fixed something. */
ei_reset_8390(dev);
NS8390_init(dev, 1);
enable_irq(dev->irq);
spin_unlock(&ei_local->page_lock);
/RogerL
--
Roger Larsson
Skellefte?
Sweden
On Mon, Jan 14, 2002 at 06:03:43AM +0100, Daniel Phillips wrote:
> Sorry, that's incorrect. I stated why earlier in this thread and akpm signed
> off on it. With preempt you get ASAP (i.e., as soon as the outermost
> spinlock is done) process scheduling. With hand-coded scheduling points you
> get 'as soon as it happens to hit a scheduling point'.
>
> That is not the only benefit, just the most obvious one.
Big duh. So you get there 1 usec sooner. NOBODY will notice that. NOBODY.
> Re: [2.4.17/18pre] VM and swap - it's really unusable
>
>
> Right. And that is precisely why I created the "mini-ll" patch. To
> give the improved "feel" in a way which is acceptable for merging into
> the 2.4 kernel.
>
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
>
> Here it is again:
>
I did. Not standalone, but in the combination of:
2.4.17+preempt+lock-break+vmscan.c+read_latency
Had to merge/frop some of the changes, as they are already they already
were in lock-break. So far, the stuff works -> no hangs/freezes/oopses.
My goal in applying this stuff is to get better interactivity and
responsiveness on my laptop (320 MB, eithe 2x or no swap).
The biggest improvements I had recently was the patch to vmscan.c by
Martin v. Leuwen and the inclusion of the read_latency stuff from Andrea
(?). That basically removed all the memory problems (cache forcing
excessive swapping out) and IO hangs (vmware doing IO freezing system
for 10s of seconds).
preempt, lock-break and I think mini-ll have further improved the
interactive "feeling". And no, I have no hard data. I am not into
Audio/DVD palyback, so ultra-low worst case latency is not my ultimate
desire. Great VM+IO performance while having great interactivity is :-)
Which probably brings us back to the topic of this thread :-))
Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759
> This should be mostly OK for the preemptive kernel. Swapping the irq an=
> d spin=20
> lock lines should be preferred. But I think that is the case in SMP too=
> =2E..
You deadlock if you swap the two lines over. In this case for pre-empt you
really have to go in and add non pre-emption places to the driver
On Mon, 14 Jan 2002 00:50:54 +0000 (GMT)
Alan Cox <[email protected]> wrote:
> > all of it? You will not be far away in the end from the 'round 4000 I
> > already stated in earlier post.
>
> There are very few places you need to touch to get a massive benefit. Most
> of the kernel already behaves extremely well.
Just a short question: the last (add-on) patch to mini-ll I saw on the list
patches: drivers/net/3c59x.c
drivers/net/8139too.c
drivers/net/eepro100.c
Unfortunately me have neither of those. This would mean I cannot benefit from
_these_ patches, but instead would need _others_ (like tulip or
name-one-of-the-rest-of-the-drivers) to see _some_ effect you tell me I
_should_ see (I currently see _none_). How do you argue then against the
statement: we need patches for /drivers/net/*.c ?? I do not expect 3c59x.c to
be particularly bad in comparison to tulip/*.c or lets say via-rhine.c, do you?
> > So I understand you agree somehow with me in the answer to "what idea
> > is really better?"...
>
> Do you want a clean simple solution or complex elegance ? For 2.4 I
definitely> favour clean and simple. For 2.5 its an open debate
Hm, obviously the ll-patches look simple, but their pure required number makes
me think they are as well stupid as simple. This whole story looks like making
an old mac do real multitasking, just spread around scheduling points
throughout the code ... This is like drilling for water on top of the mountain.
I want the water too, but I state there must be a nice valley somewhere around
...
Regards,
Stephan
> Just a short question: the last (add-on) patch to mini-ll I saw on the list
> patches: drivers/net/3c59x.c
> drivers/net/8139too.c
> drivers/net/eepro100.c
I've seen multiple quite frankly bizarre patches like that. I've not applied
them and don't see the point
On 13 Jan 2002 20:17:11 -0500
Robert Love <[email protected]> wrote:
> For 2.5, however, I tout preempt as the answer. This does not mean just
> preempt. It means a preemptible kernel as a basis for beginning
> low-latency works in manners other than explicit scheduling statements.
Aha, exactly my thoughts...
Regards,
Stephan
Stephan von Krawczynski wrote:
>
> ...
> Unfortunately me have neither of those. This would mean I cannot benefit from
> _these_ patches, but instead would need _others_ (like tulip or
> name-one-of-the-rest-of-the-drivers) to see _some_ effect you tell me I
> _should_ see (I currently see _none_). How do you argue then against the
> statement: we need patches for /drivers/net/*.c ?? I do not expect 3c59x.c to
> be particularly bad in comparison to tulip/*.c or lets say via-rhine.c, do you?
>
In 3c59x.c, probably the biggest problem will be the call to issue_and_wait()
in boomerang_start_xmit(). On a LAN which is experiencing heavy collision rates
this can take as long as 2,000 PCI cycles (it's quite rare, and possibly an
erratum). It is called under at least two spinlocks.
In via-rhine, wait_for_reset() can busywait for up to ten milliseconds.
via_rhine_tx_timeout() calls it from under a spinlock.
In eepro100.c, wait_for_cmd_done() can busywait for one millisecond
and is called multiple times under spinlock.
Preemption will help _some_ of this, but by no means all, or enough.
-
Hi,
[email protected] wrote:
> I mean, that these conversations are not very useful if you don't
> read what the other people write.
Oh, I do read everything that's for me, but it happens I'm not answering
everything, but thanks for your kind reminder.
> Here's a prior response by Andrew to a post by you.
>
> I've heard this so many times, and it just ain't so. The overwhelming
> majority of problem areas are inside locks. All the complexity and
> maintainability difficulties to which you refer exist in the preempt
> patch as well. There just is no difference.
There is a difference. There is of course the maintenance work which have
both approaches in common, every kernel has to be tested for new latency
problems. What differs is the amount of problems that needs fixing, as
preempt can handle the problems outside of locks automatically, but
inserting schedule points for these cases is usually quite simple. (I
leave it open whether these problem are really only the minority, it
doesn't matter much, it only matters that they do exist.)
The remaining problems need to be examined again by either approach.
Breaking up the locks and inserting (implicit or explicit) schedule points
is often the simpler solution, but analyzing the problem and adjusting the
algorithm leads usually to the cleaner solution. Anyway, this is again
pretty much common for both approaches.
There is an additional cost for maintaining the explicit schedule points,
as they mean additional code all over the kernel, which has to be
maintained and to be verified overtime something is changed in that area.
This work has to be done by someone, if the ll patch would be included
into the standard kernel, the burden would be put onto every maintainer of
the systems that were changed. The easy approach by these maintainers
could be of course to just drop these schedule points, but then the ll
maintainer can start from zero and it adds up to the testing costs above.
This additional cost does not exist for all cases that are automatically
handled by preempting.
bye, Roman
Hi,
On Sun, 13 Jan 2002, Rik van Riel wrote:
> > So far I haven't seen any evidence, that preempt introduces any _new_
> > serious problems, so I'd rather like to see to get the best out of
> > both.
>
> Are you seriously suggesting you haven't read a single
> email in this thread yet ?
Could you please be more explicit?
bye, Roman
Daniel Phillips wrote:
>I'd like to add my 'me too' to those who have requested a re-run of this test, building
>the *identical* kernel tree every time, starting from the same initial conditions.
>Maybe that's what you did, but it's not clear from your post.
>
Its obvious its same source code under different kernels. It would have
no sense to do othervise and it would require more effort than just boot
other image and run same script.
Marian
Hi,
On Sun, 13 Jan 2002 [email protected] wrote:
> Nobody has answered my question about the conflict between SMP per-cpu caching
> and preempt. Since NUMA is apparently the future of MP in the PC world and
> the future of Linux servers, it's interesting to consider this tradeoff.
Preempt is a UP feature so far.
> Nobody has answered the question about how to make sure all processes
> make progress with preempt.
The same way as without preempt.
> Nobody has clearly explained how to avoid what I claim to be the inevitable
> result of preempt -- priority inheritance locks (not just semaphores).
> What we have is some "we'll figure that out when we get to it".
So far you haven't given any reason, how preempt should lead to this.
(If I missed something, please explain it in a way a mere mortal can
understand it.)
> It's not even clear how preempt is supposed to interact with SCHED_FIFO.
The same way as without preempt.
More of other FUD deleted, Victor, could you please stop this?
bye, Roman
On Sun, Jan 13, 2002 at 01:11:21PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 10:18, [email protected] wrote:
>
> > No, I use a script which is run in single user mode after a reboot. So
> > there are only a few processes running when I start the script (see
> > attachment) and the jobs should start from the same environment.
> >
> > > What happens when you do the same test, compiling one kernel under multiple
> > > different kernels?
> >
> > That is exactly what I am doing. I even try to my best to have the exact
> > same starting environment ...
>
> So there you go, his testing is accurate. Now we have results that
> preempt works and is best and it is still refuted. Everyone is running
> around with these "ll is best" or "preempt sucks throughput" and that is
assuming the report can be trusted this is not the test where we can
measure a throughput regression, this is a VM intensive test and nothing
else. Swap load.
In short, run top and check you've 100% system load and cpus are never
idle or in userspace, and _then_ it will most certainly get an interesting
benchmark for -preempt throughput.
Furthmore the whole comparison is flawed, just -O(1) is as broken as
mainline w.r.t. the scheduling point, and -aa has the right scheduling
point but not the -O(1) scheduler, so there's no way to compare those
numbers at all. If you want to make any real comparison you should apply
-preempt on top of -aa.
Assuming it is really -preempt that makes the numbers more repetable
(not the fact -O(1) alone has the broken rescheduling points), this
still doesn't proof anything yet, the lower numbers are most certainly
because those tasks getting the page faults get rescheduled faster, -aa
didn't do more cpu work, it just had the cpus more idle than -preempt
apparently, this may be the indication of an important scheduling point
missing somewhere, if somebody could run a lowlatency measurement during
a swap intensive load and send me the offending IP that could probably
be addressed with a one liner.
Andrea
On Sat, Jan 12, 2002 at 01:12:27PM -0800, J Sloan wrote:
>
> Ah - if it stands a chance of going into 2.4,
> I'll test the heck out of it!
> I'll give it the Q3A test, the RtCW test, the
> xine/xmms/dbench tests, and more - glad
> to be of service.
> jjs
> Andrew Morton wrote:
>
> Ed Sweetman wrote:
>
> If you want to test the preempt kernel you're going to need something that
> can find the mean latancy or "time to action" for a particular program or
> all programs being run at the time and then run multiple programs that you
> would find on various peoples' systems. That is the "feel" people talk
> about when they praise the preempt patch.
>
> Right. And that is precisely why I created the "mini-ll" patch. To
> give the improved "feel" in a way which is acceptable for merging into
> the 2.4 kernel.
> And guess what? Nobody has tested the damn thing, so it's going
> nowhere.
> Here it is again:
> --- linux-2.4.18-pre3/fs/buffer.c Fri Dec 21 11:19:14 2001
> +++ linux-akpm/fs/buffer.c Sat Jan 12 12:22:29 2002
> @@ -249,12 +249,19 @@ static int wait_for_buffers(kdev_t dev,
> struct buffer_head * next;
> int nr;
>
> - next = lru_list[index];
> nr = nr_buffers_type[index];
> +repeat:
> + next = lru_list[index];
> while (next && --nr >= 0) {
> struct buffer_head *bh = next;
> next = bh->b_next_free;
>
> + if (dev == NODEV && current->need_resched) {
> + spin_unlock(&lru_list_lock);
> + conditional_schedule();
> + spin_lock(&lru_list_lock);
> + goto repeat;
> + }
> if (!buffer_locke
> d(bh)) {
this introduces possibility of looping indefinitely, this is why I
rejected it while I merged the mini-ll other points into -aa, if you
want to do anything like that at the very least you should roll the head
of the list as well or something like that.
Andrea
> More of other FUD deleted, Victor, could you please stop this?
Insulting people won't make problems go away Roman.
On Sun, Jan 13, 2002 at 01:22:57PM -0500, Robert Love wrote:
> On Sun, 2002-01-13 at 12:42, [email protected] wrote:
>
> > 13-pre5aa1 18-pre2aa2 18-pre3 18-pre3s 18-pre3sp 18-pre3minill
> > j100: 6:59.79 78% 7:07.62 76% * 6:39.55 81% 6:24.79 83% *
> > j100: 7:03.39 77% 8:10.04 66% * 8:07.13 66% 6:21.23 83% *
> > j100: 6:40.40 81% 7:43.15 70% * 6:37.46 81% 6:03.68 87% *
> > j100: 7:45.12 70% 7:11.59 75% * 7:14.46 74% 6:06.98 87% *
> > j100: 6:56.71 79% 7:36.12 71% * 6:26.59 83% 6:11.30 86% *
> >
> > j75: 6:22.33 85% 6:42.50 81% 6:48.83 80% 6:01.61 89% 5:42.66 93% 7:07.56 77%
> > j75: 6:41.47 81% 7:19.79 74% 6:49.43 79% 5:59.82 89% 6:00.83 88% 7:17.15 74%
> > j75: 6:10.32 88% 6:44.98 80% 7:01.01 77% 6:02.99 88% 5:48.00 91% 6:47.48 80%
> > j75: 6:28.55 84% 6:44.21 80% 9:33.78 57% 6:19.83 85% 5:49.07 91% 6:34.02 83%
> > j75: 6:17.15 86% 6:46.58 80% 7:24.52 73% 6:23.50 84% 5:58.06 88% 7:01.39 77%
>
> Again, preempt seems to reign supreme. Where is all the information
those comparison are totally flawed. There's nothing to compare in
there.
minill misses the O(1) scheduler, and -aa has faster vm etc... there's
absolutely nothing to compare in the above numbers, all variables
changes at the same time.
I'm amazed I've to say this, but in short:
1) to compare minill with preempt, apply both patches to 18-pre3, as the
only patch applied (no O(1) in the way of preempt!!!!)
2) to compare -aa with preempt, apply -preempt on top of -aa and see
what difference it makes
If you don't follow exactly those simple rules you will change an huge
amount of variables at the same time, and it will be again impossible to
make any comparison or deduction from the numbers.
Andrea
On Sun, Jan 13, 2002 at 07:32:18PM +0000, Alan Cox wrote:
> > Again, preempt seems to reign supreme. Where is all the information
> > correlating preempt is inferior? To be fair, however, we should bench a
> > mini-ll+s test.
>
> How about some actual latency numbers ?
with an huge rescheduling rate (huge swapout/swapin load) and the
scheduler walking over 100 tasks at each schedule it is insane to
deduct anything from those numbers (-preempt was using O(1)
scheduler!!!!). so please don't make any assumption by just looking at
those numbers.
Andrea
On Mon, 14 Jan 2002 02:04:56 -0800
Andrew Morton <[email protected]> wrote:
> Stephan von Krawczynski wrote:
> >
> > ...
> > Unfortunately me have neither of those. This would mean I cannot benefit
from> > _these_ patches, but instead would need _others_
[...]
>
> In 3c59x.c, probably the biggest problem will be the call to issue_and_wait()
> in boomerang_start_xmit(). On a LAN which is experiencing heavy collision
rates> this can take as long as 2,000 PCI cycles (it's quite rare, and possibly
an> erratum). It is called under at least two spinlocks.
>
> In via-rhine, wait_for_reset() can busywait for up to ten milliseconds.
> via_rhine_tx_timeout() calls it from under a spinlock.
>
> In eepro100.c, wait_for_cmd_done() can busywait for one millisecond
> and is called multiple times under spinlock.
Did I get that right, as long as spinlocked no sense in conditional_schedule()
?
> Preemption will help _some_ of this, but by no means all, or enough.
Maybe we should really try to shorten the lock-times _first_. You mentioned a
way to find the bad guys?
Regards,
Stephan
On Sun, Jan 13, 2002 at 03:04:35PM -0500, Robert Love wrote:
> user system. But things like (ack!) dbench 16 show a marked
> improvement.
please try again on top of -aa, and I've to specify this : benchmarked
in a way that can be trusted and compared, so we can make some use of
this information. This mean with -18pre2aa2 alone and only -preempt on
top of -18pre2aa2.
NOTE: I'd be glad to say "preempt rules", "go preempt", "preempt is
cool" like you as soon as I have some proof it makes _THE_ difference
and that it is worth the mess on SMP (per-cpu, RCU locking, etc...), not
to tell about the other architectures, but at the moment there's only a
number of people running xmms on mainline with the broken scheduling
points and those numbers that cannot be compared in any sane way. I
repeat, I'm not against preempt, I just want to get some real world
proof and measurement and at the moment I think preempt doesn't worth,
but if you give us _any_ real world proof that a that low mean latency
of the order of 10/100 usec matters to get most of the cpu cycles out of
the cpu during trashing (as it could be possible to speculate from the
broken benchmark posted in this thread), and that there's no real
regression with the additional branches in the spin_unlock in 100%
system load, I may change my mind (an of course, only for anything above
2.5, and still I think there are more interesting optimizations to do
rahter than requiring everybody spending lots of time fixing drivers,
auditing, fixing smp, rcu locking etc... but ok if it is obviously good
thing [aka no real regression and only benefits long term] it would be
ok to do it early as well). I'm not particularly worried about the
preempt lock around the per-cpu stuff, that's a cacheline local, and it
could go into the schedule_data like I did for the rcu per-cpu
variables, so they're at zero cacheline cost (RCU_poll patch costs now
only 1 instruction per schedule and zero memory overhead [an incl
instruction precisely]).
> > Benchmarks are well and good, but until we have a solid explanation for
> > the throughput changes which people are seeing, it's risky to claim
> > that there is a general benefit.
>
> I have an explanation. We can schedule quicker off a woken task. When
> an event occurs that allows an I/O-blocked task to run, its time-to-run
> is shorter. Same event/response improvement that helps interactivity.
That's a nice speculation out of a broken comparison, it may be really
the case, there's no way to be sure, before that sum of usec you should
also sum the seconds spent walking the tasklist in the non O(1)
scheduler.
Andrea
Hi,
On Mon, 14 Jan 2002, Alan Cox wrote:
> > More of other FUD deleted, Victor, could you please stop this?
>
> Insulting people won't make problems go away Roman.
I'm really trying to avoid this, I'm more than happy to discuss
theoretical or practical problems _if_ they are backed by arguments,
latter are very thin with Victor. Making pointless claims only triggers
above reaction. If I did really miss a major argument so far, I will
publicly apologize.
bye, Roman
Rob Landley wrote:
>
> On Friday 11 January 2002 09:50 pm, [email protected] wrote:
> > On Fri, Jan 11, 2002 at 03:33:22PM -0500, Robert Love wrote:
> > > On Fri, 2002-01-11 at 07:37, Alan Cox wrote:
> > > The preemptible kernel plus the spinlock cleanup could really take us
> > > far. Having locked at a lot of the long-held locks in the kernel, I am
> > > confident at least reasonable progress could be made.
> > >
> > > Beyond that, yah, we need a better locking construct. Priority
> > > inversion could be solved with a priority-inheriting mutex, which we can
> > > tackle if and when we want to go that route. Not now.
> >
> > Backing the car up to the edge of the cliff really gives us
> > good results. Beyond that, we could jump off the cliff
> > if we want to go that route.
> > Preempt leads to inheritance and inheritance leads to disaster.
>
> I preempt leads to disaster than Linux can't do SMP. Are you saying that's
> the case?
There is a difference. Preempt have the same locking requirements as
SMP, but there's also _timing_ requirements.
> The preempt patch is really "SMP on UP". If pre-empt shows up a problem,
> then it's a problem SMP users will see too. If we can't take advantage of
> the existing SMP locking infrastructure to improve latency and interactive
> feel on UP machines, than SMP for linux DOES NOT WORK.
One example where preempt may break and SMP does not:
Consider driver code. Critical data structures is protected by
spinlocks,
but some of the access to the hardware device itself is outside those
locks (I can prove that the other processors can't get there with
the driver in that state anyway)
Now, hardware access has timing requirements. That works on SMP because
you don't loose the CPU to anything but interrupts, and they are fast.
You get it back almost immediately. The device in question times out
after a much longer interval.
But preempt may decide to run a time-consuming higher priority task in
the
middle of device access, cuasing the hardware to time out and fail.
Hardware access isn't necessarily in a interrupt handler. It may be
done directly in a read/write/ioctl call if the device happens
to be available at the moment.
This is a case where SMP works even though preempt may fail. I don't
know if this is an issue for existing drivers, but it is possible.
Helge Hafting
>
> > All the numbers I've seen show Morton's low latency just works better. Are
> > there other numbers I should look at.
>
> This approach is basically a collection of heuristics. The kernel has been
> profiled and everywhere a latency spike was found, a band-aid was put on it
> (an explicit scheduling point). This doesn't say there aren't other latency
> spikes, just that with the collection of hardware and software being
> benchmarked, the latency spikes that were found have each had a band-aid
> individually applied to them.
>
> This isn't a BAD thing. If the benchmarks used to find latency spikes are at
> all like real-world use, then it helps real-world applications. But of
> COURSE the benchmarks are going to look good, since tuning the kernel to
> those benchmarks is the way the patch was developed!
>
> The majority of the original low latency scheduling point work is handled
> automatically by the SMP on UP kernel. You don't NEED to insert scheduling
> points anywhere you aren't inside a spinlock. So the SMP on UP patch makes
> most of the explicit scheduling point patch go away, accomplishing the same
> thing in a less intrusive manner. (Yes, it makes all kernels act like SMP
> kernels for debugging purposes. But you can turn it off for debugging if you
> want to, that's just another toggle in the magic sysreq menu. And this isn't
> entirely a bad thing: applying the enormous UP userbase to the remaining SMP
> bugs is bound to squeeze out one or two more obscure ones, but those bugs DO
> exist already on SMP.)
>
> However, what's left of the explicit scheduling work is still very useful.
> When you ARE inside a spinlock, you can't just schedule, you have to save
> state, drop the lock(s), schedule, re-acquire the locks, and reload your
> state in case somebody else diddled with the structures you were using. This
> is a lot harder than just scheduling, but breaking up long-held locks like
> this helps SMP scalability, AND helps latency in the SMP-on-UP case.
>
> So the best approach is a combination of the two patches. SMP-on-UP for
> everything outside of spinlocks, and then manually yielding locks that cause
> problems. Both Robert Love and Andrew Morton have come out in favor of each
> other's patches on lkml just in the past few days. The patches work together
> quite well, and each wants to see the other's patch applied.
>
> Rob
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> I'm really trying to avoid this, I'm more than happy to discuss
> theoretical or practical problems _if_ they are backed by arguments,
> latter are very thin with Victor. Making pointless claims only triggers
> above reaction. If I did really miss a major argument so far, I will
> publicly apologize.
You seem to be missing the fact that latency guarantees only work if you
can make progress. If a low priority process is pre-empted owning a
resource (quite likely) then you won't get your good latency. To
handle those cases you get into priority boosting, and all sorts of lock
complexity - so that the task that owns the resource temporarily can borrow
your priority in order that you can make progress at your needed speed.
That gets horrendously complex, and you get huge chains of priority
dependancies including hardware level ones.
The low latency patches don't make that problem go away, but they achieve
equivalent real world latencies up to at least the point you have to do
priority handling of that kind.
Alan
> > In eepro100.c, wait_for_cmd_done() can busywait for one millisecond
> > and is called multiple times under spinlock.
>
> Did I get that right, as long as spinlocked no sense in conditional_schedule()
> ?
No conditional schedule, no pre-emption. You would need to rewrite that code
to do something like try for 100uS then queue a 1 tick timer to retry
asynchronously. That makes the code vastly more complex for an error case and
for some drivers where irq mask is required during reset waits won't help.
Yet again there are basically 1mS limitations buried in the hardware.
>>>>> "yodaiken" == yodaiken <[email protected]> writes:
yodaiken> It's not even clear how preempt is supposed to interact with SCHED_FIFO.
How so ? The POSIX specification is not clear enough or it is not to be followed ?
Regards,
-velco
On Monday 14 January 2002 13:17, Momchil Velikov wrote:
> >>>>> "yodaiken" == yodaiken <[email protected]> writes:
>
> yodaiken> It's not even clear how preempt is supposed to interact with
> SCHED_FIFO.
>
> How so ? The POSIX specification is not clear enough or it is not to be
> followed ?
You can have an rt task block on a lock held by a normal task that was
preempted by a rt task of lower priority. The same problem as with the
sched_idle patches.
Regards
Oliver
On Mon, 2002-01-14 at 06:56, Andrea Arcangeli wrote:
> On Sun, Jan 13, 2002 at 03:04:35PM -0500, Robert Love wrote:
> > user system. But things like (ack!) dbench 16 show a marked
> > improvement.
>
> please try again on top of -aa, and I've to specify this : benchmarked
> in a way that can be trusted and compared, so we can make some use of
> this information. This mean with -18pre2aa2 alone and only -preempt on
> top of -18pre2aa2.
I realize the test isn't directly comparing what we want, so I asked him
for ll+O(1) benchmark, which he gave. Another set would be to do
preempt and ll alone.
Robert Love
Hi,
On Mon, 14 Jan 2002, Alan Cox wrote:
> You seem to be missing the fact that latency guarantees only work if you
> can make progress. If a low priority process is pre-empted owning a
> resource (quite likely) then you won't get your good latency. To
> handle those cases you get into priority boosting, and all sorts of lock
> complexity - so that the task that owns the resource temporarily can borrow
> your priority in order that you can make progress at your needed speed.
> That gets horrendously complex, and you get huge chains of priority
> dependancies including hardware level ones.
Any ll approach so far only addresses a single type of latency - the time
from waking up an important process until it really gets the cpu. What is
not handled by any patch are i/o latencies, that means the average time to
get access to a specific resource. (To be exact breaking up locks modifies
of course i/o latencies, but that's more a side effect.)
I/O latencies are only relevant for this discussion insofar, as to verify
they are not overly harmed by improving scheduling latencies. Preempting
does not modify the behaviour of the scheduler, all it does is to increase
the scheduling frequency. This means it can happen that a low priority
task locks a resource for a longer time, because it's interrupted by
another task. Nethertheless the current scheduler guarantees every process
gets its share of the cpu time(*), so the low priority task will continue
and release the resource within a guaranteed amount of time.
So the worst behaviour I see is that on a loaded system, a low priority
task can hold up another task, if that task should be our interactive
task, the interactivity is of course gone. But this problem is not really
new, as we have no guarantees regarding i/o latencies. So everyone using
any patch should be aware of that it's not a magical tool and for getting
better scheduling latencies, one has to trade something else, but so far
I haven't seen any evidence that it makes something else much worse.
(*) This of course assumes accurate cpu time accounting, but I mentioned
this problem before. On the other hand it's also fixable, the tickless
patch looks most interesting in this regard.
bye, Roman
On Mon, Jan 14, 2002 at 12:14:47PM +0100, Roman Zippel wrote:
> Hi,
>
> On Sun, 13 Jan 2002 [email protected] wrote:
>
> > Nobody has answered my question about the conflict between SMP per-cpu caching
> > and preempt. Since NUMA is apparently the future of MP in the PC world and
> > the future of Linux servers, it's interesting to consider this tradeoff.
>
> Preempt is a UP feature so far.
I think this is a sufficient summary of your engineering approach.
...
> More of other FUD deleted, Victor, could you please stop this?
I guess that Andrew, Alan, Andrea and I all are raising objections that
you ignore because we have some kind of shared bias.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Mon, Jan 14, 2002 at 02:17:46PM +0200, Momchil Velikov wrote:
> >>>>> "yodaiken" == yodaiken <[email protected]> writes:
> yodaiken> It's not even clear how preempt is supposed to interact with SCHED_FIFO.
>
> How so ? The POSIX specification is not clear enough or it is not to be followed ?
POSIX makes no specification of how scheduling classes interact - unless something changed
in the new version.
But more than that, the problem of preemption is much more complex when you have
task that do not share the "goodness fade" with everything else. That is, given a
set of SCHED_OTHER processes at time T0, it is reasonable to design the scheduler so
that there is some D so that by time T0+D each process has become the highest priority
and has received cpu up to either a complete time slice or a I/O block. Linux kind of
has this property now, and I believe that making this more robust and easier to analyze
is going to be an enormously important issue. However, once you add SCHED_FIFO in the
current scheme, this becomes more complex. And with preempt, you cannot even offer the
assurance that once a process gets the cpu it will make _any_ advance at all.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
I forgot the line that says: "Oliver pointed out the immediate problem but .."
On Mon, Jan 14, 2002 at 06:45:48AM -0700, [email protected] wrote:
> On Mon, Jan 14, 2002 at 02:17:46PM +0200, Momchil Velikov wrote:
> > >>>>> "yodaiken" == yodaiken <[email protected]> writes:
> > yodaiken> It's not even clear how preempt is supposed to interact with SCHED_FIFO.
> >
> > How so ? The POSIX specification is not clear enough or it is not to be followed ?
>
> POSIX makes no specification of how scheduling classes interact - unless something changed
> in the new version.
>
> But more than that, the problem of preemption is much more complex when you have
> task that do not share the "goodness fade" with everything else. That is, given a
> set of SCHED_OTHER processes at time T0, it is reasonable to design the scheduler so
> that there is some D so that by time T0+D each process has become the highest priority
> and has received cpu up to either a complete time slice or a I/O block. Linux kind of
> has this property now, and I believe that making this more robust and easier to analyze
> is going to be an enormously important issue. However, once you add SCHED_FIFO in the
> current scheme, this becomes more complex. And with preempt, you cannot even offer the
> assurance that once a process gets the cpu it will make _any_ advance at all.
>
>
>
> --
> ---------------------------------------------------------
> Victor Yodaiken
> Finite State Machine Labs: The RTLinux Company.
> http://www.fsmlabs.com http://www.rtlinux.com
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Mon, Jan 14, 2002 at 12:14:47PM +0100, Roman Zippel wrote:
> On Sun, 13 Jan 2002 [email protected] wrote:
>
> > Nobody has answered the question about how to make sure all processes
> > make progress with preempt.
>
> The same way as without preempt.
>
> More of other FUD deleted, Victor, could you please stop this?
Roman, Victor asks meaningful questions.
On Mon, Jan 14, 2002 at 12:18:57PM +0100, Marian Jancar wrote:
> Daniel Phillips wrote:
>
> >I'd like to add my 'me too' to those who have requested a re-run of this test, building
> >the *identical* kernel tree every time, starting from the same initial conditions.
> >Maybe that's what you did, but it's not clear from your post.
> >
>
> Its obvious its same source code under different kernels. It would have
> no sense to do othervise and it would require more effort than just boot
> other image and run same script.
It's a different tree each time. Same contents. Different
tree.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
> So the worst behaviour I see is that on a loaded system, a low priority
> task can hold up another task, if that task should be our interactive
> task, the interactivity is of course gone. But this problem is not really
> new, as we have no guarantees regarding i/o latencies. So everyone using
> any patch should be aware of that it's not a magical tool and for getting
> better scheduling latencies, one has to trade something else, but so far
> I haven't seen any evidence that it makes something else much worse.
It doesn't make anything better is the issue. Its more complex than ll but
gains nothing
Hi,
On Mon, 14 Jan 2002, Alan Cox wrote:
> It doesn't make anything better is the issue. Its more complex than ll but
> gains nothing
Please see my previous mail about maintenance costs.
bye, Roman
On Mon, 14 Jan 2002, Roman Zippel wrote:
> Any ll approach so far only addresses a single type of latency - the
> time from waking up an important process until it really gets the cpu.
> What is not handled by any patch are i/o latencies, that means the
> average time to get access to a specific resource.
OK, suppose you have three tasks.
A is a SCHED_FIFO task
B is a nice 0 SCHED_OTHER task
C is a nice +19 SCHED_OTHER task
Task B is your standard CPU hog, running all the time, task C has
grabbed an inode semaphore (no spinlock), task A wakes up, preempts
task C, tries to grab the inode semaphore and goes back to sleep.
Now task A has to wait for task B to give up the CPU before task C
can run again and release the semaphore.
Without preemption task C would not have been preempted and it would
have released the lock much sooner, meaning task A could have gotten
the resource earlier.
Using the low latency patch we'd insert some smart code into the
algorithm so task A also releases the lock before rescheduling.
Before you say this thing never happens in practice, I ran into
this thing in real life with the SCHED_IDLE patch. In fact, this
problem was so severe it convinced me to abandon SCHED_IDLE ;))
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
Hi,
On Mon, 14 Jan 2002 [email protected] wrote:
> > > Nobody has answered my question about the conflict between SMP per-cpu caching
> > > and preempt. Since NUMA is apparently the future of MP in the PC world and
> > > the future of Linux servers, it's interesting to consider this tradeoff.
> >
> > Preempt is a UP feature so far.
>
> I think this is a sufficient summary of your engineering approach.
Would you please care to explain, what the hell you want?
Preempt on SMP has more problems than you mention above, so that the scope
of my arguments only included UP. Sorry, if I missed something, but
preempt on SMP is an entirely different dicussion.
> > More of other FUD deleted, Victor, could you please stop this?
>
> I guess that Andrew, Alan, Andrea and I all are raising objections that
> you ignore because we have some kind of shared bias.
No, your sparse use of arguments makes the difference.
bye, Roman
On 20020114 Stephan von Krawczynski wrote:
>
>Hm, obviously the ll-patches look simple, but their pure required number makes
>me think they are as well stupid as simple. This whole story looks like making
>an old mac do real multitasking, just spread around scheduling points
Yup. That remind me of...
Would there be any kernel call every driver is doing just to hide there
a conditional_schedule() so everyone does it even without knowledge of it ?
Just like Apple put the SystemTask() inside GetNextEvent()...
--
J.A. Magallon # Let the source be with you...
mailto:[email protected]
Mandrake Linux release 8.2 (Cooker) for i586
Linux werewolf 2.4.18-pre3-beo #5 SMP Sun Jan 13 02:14:04 CET 2002 i686
Hi,
On Mon, 14 Jan 2002 [email protected] wrote:
> is going to be an enormously important issue. However, once you add SCHED_FIFO in the
> current scheme, this becomes more complex. And with preempt, you cannot even offer the
> assurance that once a process gets the cpu it will make _any_ advance at all.
I'm not sure if I understand you correctly, but how is this related to
preempt? A SCHED_FIFO tasks only delays SCHED_OTHER tasks, but it doesn't
consume their time slice, so the remaining tasks still get their
(previously assigned) time at the cpu, until all tasks have consumed
their share.
bye, Roman
"J.A. Magallon" wrote:
>
> On 20020114 Stephan von Krawczynski wrote:
> >
> >Hm, obviously the ll-patches look simple, but their pure required number makes
> >me think they are as well stupid as simple. This whole story looks like making
> >an old mac do real multitasking, just spread around scheduling points
>
> Yup. That remind me of...
> Would there be any kernel call every driver is doing just to hide there
> a conditional_schedule() so everyone does it even without knowledge of it ?
> Just like Apple put the SystemTask() inside GetNextEvent()...
Well the preempt patch sort of does this in every spin_unlock*() .....
This is getting silly ... feeback like "ll is better than PK", "feels
smooth", "is reponsive", "my kernel
compile is faster than yours", etc. is not getting us any closer to the
"how" of making a better kernel.
What's the goal? How should SMP and NUMA behave? How is success measured?
It would be good to be very clear on the ultimate purpose before making
radical changes. All of
these changes are dancing around some vague concept of
reponsiveness...so define it!
These comments seem to set a better tone for this thread, perhaps we can
concentrate on _useful_ debate
around some well defined goal.
[email protected] wrote:
> The key one is some idea of being able to assure processes
> of some rate of progress. This is not classical RT, but it is important to multimedia and
> databases and also to some applications we are interested in looking at.
Andrew Morton wrote:
> But we can **make** it useful. I believe that internal preemption is
> the foundation to improve 2.5 kernel latency. But first we need
> consensus that we **want** linux to be a low-latency kernel.
>
> Do we have that?
>
> If we do, then as I've said before, holding a lock for more than N milliseconds
> becomes a bug to be fixed. We can put tools in the hands of testers to
> locate those bugs. Easy.
>
On Mon, Jan 14, 2002 at 08:38:54AM -0500, Robert Love wrote:
> On Mon, 2002-01-14 at 06:56, Andrea Arcangeli wrote:
> > On Sun, Jan 13, 2002 at 03:04:35PM -0500, Robert Love wrote:
> > > user system. But things like (ack!) dbench 16 show a marked
> > > improvement.
> >
> > please try again on top of -aa, and I've to specify this : benchmarked
> > in a way that can be trusted and compared, so we can make some use of
> > this information. This mean with -18pre2aa2 alone and only -preempt on
> > top of -18pre2aa2.
>
> I realize the test isn't directly comparing what we want, so I asked him
> for ll+O(1) benchmark, which he gave. Another set would be to do
^^ actually mini-ll
right (I was still in the middle of the backlog of my emails, so I
didn't know he just produced the mini-ll+O(1)). The mini-ll+O(1) shows
that -preempt is still a bit faster (as expected not much faster
anymore). The reason it is faster it is probably really the sum of few
usec latency of userspace cpu cycles that you save. However given the
small difference in numbers in this patological case (-j1 obviously
cannot take advantage of the few usec less of reduced latency) still
makes me to think it doesn't worth the pain and the complexity, or at
least somebody should also proof that it doesn't visibly drop
performance in a 100% cpu bound _system_ (not user) time load (ala
pagecache_lock collision testcase with sendfile etc..), in general with
a single thread in the system.
Andrea
Hi,
On Mon, 14 Jan 2002, Rik van Riel wrote:
> Without preemption task C would not have been preempted and it would
> have released the lock much sooner, meaning task A could have gotten
> the resource earlier.
Define "much sooner", nobody disputes that low priority tasks can be
delayed, that's actually the purpose of both patches.
> Using the low latency patch we'd insert some smart code into the
> algorithm so task A also releases the lock before rescheduling.
Could you please show me that "smart code"?
> Before you say this thing never happens in practice, I ran into
> this thing in real life with the SCHED_IDLE patch. In fact, this
> problem was so severe it convinced me to abandon SCHED_IDLE ;))
SCHED_IDLE is something completely different than preeempt. Rik, do I
really have to explain the difference?
bye, Roman
On Mon, Jan 14, 2002 at 03:56:05PM +0100, Roman Zippel wrote:
> Hi,
>
> On Mon, 14 Jan 2002 [email protected] wrote:
>
> > is going to be an enormously important issue. However, once you add SCHED_FIFO in the
> > current scheme, this becomes more complex. And with preempt, you cannot even offer the
> > assurance that once a process gets the cpu it will make _any_ advance at all.
>
> I'm not sure if I understand you correctly, but how is this related to
> preempt?
It's pretty subtle. If there is no preempt, processes don't get preempted.
If there is preempt, they can be preempted. Amazing isn't it?
> Re: [2.4.17/18pre] VM and swap - it's really unusable
>
>
> Ken,
>
> Attached is an update to my previous vmscan.patch.2.4.17.c
>
> Version "d" fixes a BUG due to a race in the old code _and_
> is much less agressive at cache_shrinkage or conversely more
> willing to swap out but not as much as the stock kernel.
>
> It continues to work well wrt to high vm pressure.
>
> Give it a whirl to see if it changes your "-j" symptoms.
>
> If you like you can change the one line in the patch
> from "DEF_PRIORITY" which is "6" to progressively smaller
> values to "tune" whatever kind of swap_out behaviour you
> like.
>
> Martin
>
Martin,
looking at the "d" version, I have one question on the piece that calls
swap_out:
@@ -521,6 +524,9 @@
}
spin_unlock(&pagemap_lru_lock);
+ if (max_mapped <= 0 && (nr_pages > 0 || priority <
DEF_PRIORITY))
+ swap_out(priority, gfp_mask, classzone);
+
return nr_pages;
}
Curious on the conditions where swap_out is actually called, I added a
printk and found actaully cases where you call swap_out when nr_pages is
already 0. What sense does that make? I would have thought that
shrink_cache had done its job in that case.
shrink_cache: 24 page-request, 0 pages-to swap, max_mapped=-1599,
max_scan=4350, priority=5
shrink_cache: 24 page-request, 0 pages-to swap, max_mapped=-487,
max_scan=4052, priority=5
shrink_cache: 29 page-request, 0 pages-to swap, max_mapped=-1076,
max_scan=1655, priority=5
shrink_cache: 2 page-request, 0 pages-to swap, max_mapped=-859,
max_scan=820, priority=5
Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759
>>>>> "Oliver" == Oliver Neukum <[email protected]> writes:
Oliver> On Monday 14 January 2002 13:17, Momchil Velikov wrote:
>> >>>>> "yodaiken" == yodaiken <[email protected]> writes:
>>
yodaiken> It's not even clear how preempt is supposed to interact with
>> SCHED_FIFO.
>>
>> How so ? The POSIX specification is not clear enough or it is not to be
>> followed ?
Oliver> You can have an rt task block on a lock held by a normal task that was
Oliver> preempted by a rt task of lower priority. The same problem as with the
Oliver> sched_idle patches.
This can happen with a non-preemptible kernel too. And it has nothing to
do with scheduling policy.
On January 14, 2002 01:46 am, Bill Davidsen wrote:
> Finally, I doubt that any of this will address my biggest problem with
> Linux, which is that as memory gets cheap a program doing significant disk
> writing can get buffers VERY full (perhaps a while CD worth) before the
> kernel decides to do the write, at which point the system becomes
> non-responsive for seconds at a time while the disk light comes on and
> stays on. That's another problem, and I did play with some patches this
> weekend without making myself really happy :-( Another topic,
> unfortunately.
Patience, the problem is understood and there will be a fix in the 2.5
timeframe.
--
Daniel
>>>>> "yodaiken" == yodaiken <[email protected]> writes:
yodaiken> current scheme, this becomes more complex. And with preempt, you cannot even offer the
yodaiken> assurance that once a process gets the cpu it will make _any_ advance at all.
So? It either shouldn't have got the CPU anyway (maybe it CPU is
needed for other things) or user's priority setup is seriously borked.
The scheduling policies, algorithms, mechanisms, whatever ... do not
guarantee schedulability by themselves.
On January 14, 2002 06:09 am, Andrew Morton wrote:
> Daniel Phillips wrote:
> I believe that internal preemption is
> the foundation to improve 2.5 kernel latency. But first we need
> consensus that we *want* linux to be a low-latency kernel.
>
> Do we have that?
You have it from me, for what it's worth ;-)
> If we do, then as I've said before, holding a lock for more than N
> milliseconds becomes a bug to be fixed. We can put tools in the hands of
> testers to locate those bugs. Easy.
Perhaps not a bug, but bad-acting. Just as putting a huge object on the
stack is not necessarily a bug, but deserves a quick larting nonetheless.
--
Daniel
FWIW, there appears to be a difference in throughput
and latency between 2.4.18pre3-low-latency and
2.4.18pre3-preempt+lockbreak.
2.4.18pre3pelb = preempt+lockbreak
2.4.18pre3ll = low-latency
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
On January 14, 2002 12:33 am, J Sloan wrote:
> Dieter N?tzel wrote:
> >You told me that TUX show some problems with preempt before. What
> >problems? Are they TUX specific?
>
> On a kernel with both tux and preempt, upon
> access to the tux webserver the kernel oopses
> and tux dies...
Ah yes, I suppose this is because TUX uses per-cpu data as a replacement
for spinlocks. Patches that use per-cpu shared data have to be
preempt-aware. Ingo didn't know this when he wrote TUX since preempt didn't
exist at that time and didn't even appear to be on the horizon. He's
certainly aware of it now.
> OTOH the low latency patch plays quite well
> with tux. As said, I have no anti-preempt agenda,
> I just need for whatever solution I use to work,
> and not crash programs and services we use.
Right, and of course that requires testing - sometimes a lot of it. This one
is a 'duh' that escaped notice. temporarily. It probably would have been
caught sooner if we'd started serious testing/discussion sooner.
--
Daniel
On Mon, Jan 14, 2002 at 11:39, Andrea Arcangeli wrote:
> On Sun, Jan 13, 2002 at 01:22:57PM -0500, Robert Love wrote:
> > Again, preempt seems to reign supreme. Where is all the information
>
> those comparison are totally flawed. There's nothing to compare in
> there.
>
> minill misses the O(1) scheduler, and -aa has faster vm etc... there's
> absolutely nothing to compare in the above numbers, all variables
> changes at the same time.
>
> I'm amazed I've to say this, but in short:
>
> 1) to compare minill with preempt, apply both patches to 18-pre3, as the
> only patch applied (no O(1) in the way of preempt!!!!)
> 2) to compare -aa with preempt, apply -preempt on top of -aa and see
> what difference it makes
Oh Andrea,
I know your -aa VM is _GREAT_. I've used it all the time when I have the muse
to apply your vm-XX patch "by hand" to the "current" tree.
If you only get to the point and _send_ the requested patch set to Marcelo...
All (most) of my preempt Test were running with your -aa VM and I saw the
speed up with your VM _AND_ preempt especially for latency (interactivity,
system start time and latencytest0.42-png). O(1) gave additional "smoothness"
What should I run for you?
Below are the dbench 32 (yes, I know...) numbers for 2.4.18-pre3-VM-22 and
2.4.18-pre3-VM-22-preempt+lock-break.
Sorry, both with O(1)...
2.4.18-pre3
sched-O1-2.4.17-H7.patch
10_vm-22
00_nanosleep-5
bootmem-2.4.17-pre6
read-latency.patch
waitq-2.4.17-mainline-1
plus
all 2.4.18-pre3.pending ReiserFS stuff
dbench/dbench> time ./dbench 32
32 clients started
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+..........................................................................................................................................................................................................................................................++....................................................................................................++...................................+................................+...+...+........................................+..+......+................+...........++....................++++...++..++.....+...+....+........+...+++++********************************
Throughput 41.5565 MB/sec (NB=51.9456 MB/sec 415.565 MBit/sec)
14.860u 48.320s 1:41.66 62.1% 0+0k 0+0io 938pf+0w
preempt-kernel-rml-2.4.18-pre3-ingo-2.patch
lock-break-rml-2.4.18-pre1-1.patch
2.4.18-pre3
sched-O1-2.4.17-H7.patch
10_vm-22
00_nanosleep-5
bootmem-2.4.17-pre6
read-latency.patch
waitq-2.4.17-mainline-1
plus
all 2.4.18-pre3.pending ReiserFS stuff
dbench/dbench> time ./dbench 32
32 clients started
..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+...................................+.....................................................................+...............................................................................................................................+.....+........................................+........................+............................................................................+...........................................................+..............+...................+........+.......+...............+...............+.....+..................+..+......+...++.........+....+..+...+....+......+.....................................+.+..+.......++********************************
Throughput 47.0049 MB/sec (NB=58.7561 MB/sec 470.049 MBit/sec)
14.280u 49.370s 1:30.88 70.0% 0+0k 0+0io 939pf+0w
Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science
University of Hamburg
Department of Computer Science
@home: [email protected]
On Mon, Jan 14, 2002 at 05:53:15PM +0100, Dieter N?tzel wrote:
> On Mon, Jan 14, 2002 at 11:39, Andrea Arcangeli wrote:
> > On Sun, Jan 13, 2002 at 01:22:57PM -0500, Robert Love wrote:
> > > Again, preempt seems to reign supreme. Where is all the information
> >
> > those comparison are totally flawed. There's nothing to compare in
> > there.
> >
> > minill misses the O(1) scheduler, and -aa has faster vm etc... there's
> > absolutely nothing to compare in the above numbers, all variables
> > changes at the same time.
> >
> > I'm amazed I've to say this, but in short:
> >
> > 1) to compare minill with preempt, apply both patches to 18-pre3, as the
> > only patch applied (no O(1) in the way of preempt!!!!)
> > 2) to compare -aa with preempt, apply -preempt on top of -aa and see
> > what difference it makes
>
> Oh Andrea,
>
> I know your -aa VM is _GREAT_. I've used it all the time when I have the muse
> to apply your vm-XX patch "by hand" to the "current" tree.
> If you only get to the point and _send_ the requested patch set to Marcelo...
I need to finish one thing in the next two days, so it won't be before
Thursday probably, sorry.
>
> All (most) of my preempt Test were running with your -aa VM and I saw the
> speed up with your VM _AND_ preempt especially for latency (interactivity,
> system start time and latencytest0.42-png). O(1) gave additional "smoothness"
>
> What should I run for you?
>
> Below are the dbench 32 (yes, I know...) numbers for 2.4.18-pre3-VM-22 and
> 2.4.18-pre3-VM-22-preempt+lock-break.
> Sorry, both with O(1)...
>
> 2.4.18-pre3
> sched-O1-2.4.17-H7.patch
> 10_vm-22
> 00_nanosleep-5
> bootmem-2.4.17-pre6
> read-latency.patch
> waitq-2.4.17-mainline-1
> plus
> all 2.4.18-pre3.pending ReiserFS stuff
>
> dbench/dbench> time ./dbench 32
> 32 clients started
> ..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+..........................................................................................................................................................................................................................................................++....................................................................................................++...................................+................................+...+...+........................................+..+......+................+...........++....................++++...++..++.....+...+....+........+...+++++********************************
> Throughput 41.5565 MB/sec (NBQ.9456 MB/sec 415.565 MBit/sec)
> 14.860u 48.320s 1:41.66 62.1% 0+0k 0+0io 938pf+0w
>
>
> preempt-kernel-rml-2.4.18-pre3-ingo-2.patch
> lock-break-rml-2.4.18-pre1-1.patch
> 2.4.18-pre3
> sched-O1-2.4.17-H7.patch
> 10_vm-22
> 00_nanosleep-5
> bootmem-2.4.17-pre6
> read-latency.patch
> waitq-2.4.17-mainline-1
> plus
> all 2.4.18-pre3.pending ReiserFS stuff
>
> dbench/dbench> time ./dbench 32
> 32 clients started
> ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................+...................................+.....................................................................+...............................................................................................................................+.....+........................................+........................+............................................................................+...........................................................+..............+...................+........+.......+...............+...............+.....+..................+..+......+...++.........+....+..+...+....+......+.....................................+.+..+.......++********************************
> Throughput 47.0049 MB/sec (NBX.7561 MB/sec 470.049 MBit/sec)
> 14.280u 49.370s 1:30.88 70.0% 0+0k 0+0io 939pf+0w
It would also be nice to see what changes by replacing lock-break and
preempt-kernel with 00_lowlatency-fixes-4. Also you should have a look
at lock-break and port the same breaking point on top of
lowlatency-fixes-4, but just make sure lock-break doesn't introduce the
usual live locks that I keep seeing over the time again and again,
I'_ve_ to reject some of those lowlat stuff because of that, at the very
least one variable on the stack should be used so we keep going at the
second/third/whatever pass, lock-break seems just a big live-lock thing.
Also a pass with only preempt would be interesting. You should also run
more than one pass for each kernel (I always suggest 3) to be sure there
are no suprious results.
thanks,
Andrea
> Oliver> You can have an rt task block on a lock held by a normal task that was
> Oliver> preempted by a rt task of lower priority. The same problem as with the
> Oliver> sched_idle patches.
>
> This can happen with a non-preemptible kernel too. And it has nothing to
> do with scheduling policy.
So why bother adding pre-emption. As you keep saying - it doesnt
gain anything
Alan
> > stays on. That's another problem, and I did play with some patches this
> > weekend without making myself really happy :-( Another topic,
> > unfortunately.
>
> Patience, the problem is understood and there will be a fix in the 2.5
> timeframe.
Without a fix in the 2.4 timeframe everyone has to run 2.2. That strikes
me as decidedly non optimal. If you are having VM problems try both the
Andrea -aa and the Rik rmap-11b patches (*not together*) and report back
On January 14, 2002 02:45 pm, [email protected] wrote:
> POSIX makes no specification of how scheduling classes interact - unless something changed
> in the new version.
>
> But more than that, the problem of preemption is much more complex when you have
> task that do not share the "goodness fade" with everything else. That is, given a
> set of SCHED_OTHER processes at time T0, it is reasonable to design the scheduler so
> that there is some D so that by time T0+D each process has become the highest priority
> and has received cpu up to either a complete time slice or a I/O block. Linux kind of
> has this property now, and I believe that making this more robust and easier to analyze
> is going to be an enormously important issue. However, once you add SCHED_FIFO in the
> current scheme, this becomes more complex. And with preempt, you cannot even offer the
> assurance that once a process gets the cpu it will make _any_ advance at all.
So the prediction here is that SCHED_FIFO + preempt can livelock some set of correctly
designed processes, is that it? I don't see exactly how that could happen, though that
may simply mean I didn't read closely enough.
--
Daniel
> >> How so ? The POSIX specification is not clear enough or it is not to be
> >> followed ?
>
> Oliver> You can have an rt task block on a lock held by a normal task that
> was Oliver> preempted by a rt task of lower priority. The same problem as
> with the Oliver> sched_idle patches.
>
> This can happen with a non-preemptible kernel too. And it has nothing to
> do with scheduling policy.
It can happen if you sleep with a lock held.
It can not happen at random points in the code.
Thus there is a relation to preemption in kernel mode.
To cure that problem tasks holding a lock would have to be given
the highest priority of all tasks blocking on that lock. The semaphore
code would get much more complex, even in the succesful code path,
which would hurt a lot.
If on the other hand sleeping in kernel mode is explicit, you can simply
give any task being woken up a timeslice and the scheduling requirements
are met. If that should be a problem.
Regards
Oliver
On Mon, Jan 14, 2002 at 07:43:59PM +0100, Daniel Phillips wrote:
> On January 14, 2002 10:09 am, [email protected] wrote:
> > UNIX generally tries to ensure liveness. So you know that
> > cat lkarchive | grep feel | wc
> > will complete and not just that, it will run pretty reasonably because
> > for UNIX _every_ process is important and gets cpu and IO time.
> > When you start trying to add special low latency tasks, you endanger
> > liveness. And preempt is especially corrosive because one of the
> > mechanisms UNIX uses to assure liveness is to make sure that once a
> > process starts it can do a significant chunk of work.
>
> You're claiming that preemption by nature is not Unix-like?
Kernel preemption is not traditionally part of UNIX.
>
> --
> Daniel
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On January 14, 2002 10:09 am, [email protected] wrote:
> UNIX generally tries to ensure liveness. So you know that
> cat lkarchive | grep feel | wc
> will complete and not just that, it will run pretty reasonably because
> for UNIX _every_ process is important and gets cpu and IO time.
> When you start trying to add special low latency tasks, you endanger
> liveness. And preempt is especially corrosive because one of the
> mechanisms UNIX uses to assure liveness is to make sure that once a
> process starts it can do a significant chunk of work.
You're claiming that preemption by nature is not Unix-like?
--
Daniel
Daniel Phillips wrote:
> On January 14, 2002 10:09 am, [email protected] wrote:
>
>>UNIX generally tries to ensure liveness. So you know that
>> cat lkarchive | grep feel | wc
>>will complete and not just that, it will run pretty reasonably because
>>for UNIX _every_ process is important and gets cpu and IO time.
>>When you start trying to add special low latency tasks, you endanger
>>liveness. And preempt is especially corrosive because one of the
>>mechanisms UNIX uses to assure liveness is to make sure that once a
>>process starts it can do a significant chunk of work.
>>
>
> You're claiming that preemption by nature is not Unix-like?
Unix started out life as a _time-sharing_ OS. It never claimed to
be preemptive or real time. For those, you waited a while, then
got to run MACH.
----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. [email protected] -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- Change is inevitable, except from a vending machine. -
----------------------------------------------------------------------
On Mon, 2002-01-14 at 10:02, J.A. Magallon wrote:
> Yup. That remind me of...
> Would there be any kernel call every driver is doing just to hide there
> a conditional_schedule() so everyone does it even without knowledge of it ?
> Just like Apple put the SystemTask() inside GetNextEvent()...
It's not nearly that easy. If it were, we would all certainly switch to
the preemptive kernel design, and preempt whenever and wherever we
needed.
Instead, we have to worry about reentrancy and thus can not preempt
inside critical regions (denoted by spinlocks). So we can't have
preempt there, and have more work to do -- thus this discussion.
Robert Love
Hi,
[email protected] wrote:
> > > is going to be an enormously important issue. However, once you add SCHED_FIFO in the
> > > current scheme, this becomes more complex. And with preempt, you cannot even offer the
> > > assurance that once a process gets the cpu it will make _any_ advance at all.
> >
> > I'm not sure if I understand you correctly, but how is this related to
> > preempt?
>
> It's pretty subtle. If there is no preempt, processes don't get preempted.
> If there is preempt, they can be preempted. Amazing isn't it?
I just can't win against such brilliant argumentation, I'm out.
bye, Roman
Robert Lowery wrote:
>
> >I question this because it is too risky to apply. There is no way any
> >distribution or production system could ever consider applying the
> >preempt kernel and ship it in its next kernel update 2.4. You never know
> >if a driver will deadlock because it is doing a test and set bit busy
> >loop by hand instead of using spin_lock and you cannot audit all the
> >device drivers out there.
>
> Quick question from a kernel newbie.
>
> Could this audit be partially automated by the Stanford Checker? or would
> there
> be too many false positives from other similar looping code?
>
> -Robert
Sounds like a REALLY good thing (tm) to me. How do we get them
interested?
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Arjan van de Ven wrote:
>
> "J.A. Magallon" wrote:
> >
> > On 20020114 Stephan von Krawczynski wrote:
> > >
> > >Hm, obviously the ll-patches look simple, but their pure required number makes
> > >me think they are as well stupid as simple. This whole story looks like making
> > >an old mac do real multitasking, just spread around scheduling points
> >
> > Yup. That remind me of...
> > Would there be any kernel call every driver is doing just to hide there
> > a conditional_schedule() so everyone does it even without knowledge of it ?
> > Just like Apple put the SystemTask() inside GetNextEvent()...
>
> Well the preempt patch sort of does this in every spin_unlock*() .....
> -
Gosh, not really. The nature of the preempt patch is to allow
preemption on completion of the interrupt that put the contending task
back in the run list. This can not be done if a spin lock is held, so
yes, there is a test on exit from the spin lock, but the point is that
this is only needed when the lock is release, not in unlocked code.
The utility of most of the ll patches is that they address the problem
within locked regions. This is why there is a lock-break patch that is
designed to augment the preempt patch. It picks up several of the long
held spin locks and pops out of them early to allow preemption, and then
relocks and continues (after picking up the pieces, of course).
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
On Mon, 2002-01-14 at 09:35, Rik van Riel wrote:
> OK, suppose you have three tasks.
>
> A is a SCHED_FIFO task
> B is a nice 0 SCHED_OTHER task
> C is a nice +19 SCHED_OTHER task
>
> Task B is your standard CPU hog, running all the time, task C has
> grabbed an inode semaphore (no spinlock), task A wakes up, preempts
> task C, tries to grab the inode semaphore and goes back to sleep.
>
> Now task A has to wait for task B to give up the CPU before task C
> can run again and release the semaphore.
>
> Without preemption task C would not have been preempted and it would
> have released the lock much sooner, meaning task A could have gotten
> the resource earlier.
>
> Using the low latency patch we'd insert some smart code into the
> algorithm so task A also releases the lock before rescheduling.
>
> Before you say this thing never happens in practice, I ran into
> this thing in real life with the SCHED_IDLE patch. In fact, this
> problem was so severe it convinced me to abandon SCHED_IDLE ;))
This isn't related. The problem you described can happen nearly as
easily on a non-preemptible system. We have plenty of semaphores held
across schedules and there is no reason to single out ones that acquire
and release the semaphore in short, non-preemptible, sequences. We
always have this "problem."
SCHED_IDLE is much different, as you know, because the SCHED_IDLE task
holding the lock can _never_ get scheduled if there is a CPU hog on the
system! With the preemptive case, we only worry about an increase in
this period, which is at the expense of fairness in running higher
priority tasks. But I think you know this ...
Robert Love
Alan Cox wrote:
>
> > I'm really trying to avoid this, I'm more than happy to discuss
> > theoretical or practical problems _if_ they are backed by arguments,
> > latter are very thin with Victor. Making pointless claims only triggers
> > above reaction. If I did really miss a major argument so far, I will
> > publicly apologize.
>
> You seem to be missing the fact that latency guarantees only work if you
> can make progress. If a low priority process is pre-empted owning a
> resource (quite likely) then you won't get your good latency. To
> handle those cases you get into priority boosting, and all sorts of lock
> complexity - so that the task that owns the resource temporarily can borrow
> your priority in order that you can make progress at your needed speed.
> That gets horrendously complex, and you get huge chains of priority
> dependancies including hardware level ones.
>
It would be useful to define the scope and design guidelines of a "real
time task". Obviously, if it tries to perform filesystem or network
I/O it can block for a long time. If it acquires VFS locks it can suffer
bad priority inversion.
I have all along assumed that a well-designed RT application would delegate
all these operations to SCHED_OTHER worker processes, probably via shared
memory/shared mappings. So in the simplest case, you'd have a SCHED_FIFO
task which talks to the hardware, and which has a helper task which reads
and writes stuff from and to disk. With sufficient buffering and readahead
to cover the worst case IO latencies.
If this is generally workable, then it means that the areas of possible
priority inversion are quite small - basically device driver read/write
functions. The main remaining area where priority inversion can
happen is in the page allocator. I'm experimenting/thinking about giving
non-SCHED_OTHER tasks a modified form of atomic allocation to defeat this.
-
Bill Davidsen wrote:
>
> On Wed, 9 Jan 2002, Kent Borg wrote:
>
> > How does all this fit into doing a tick-less kernel?
> >
> > There is something appealing about doing stuff only when there is
> > stuff to do, like: respond to input, handle some device that becomes
> > ready, or let another process run for a while. Didn't IBM do some
> > nice work on this for Linux? (*Was* it nice work?) I was under the
> > impression that the current kernel isn't that far from being tickless.
> >
> > A tickless kernel would be wonderful for battery powered devices that
> > could literally shut off when there be nothing to do, and it seems it
> > would (trivially?) help performance on high end power hogs too.
> >
> > Why do we have regular HZ ticks? (Other than I think I remember Linus
> > saying that he likes them.)
>
I put a patch on sourceforge as part of the high-res-timers
investigation the implemented a tick less kernel with instrumentation.
It turns out to be overload prone, mostly do to the need to start and
stop a "slice" timer on each schedule() call. I, for one, think this
issue is dead and rightly so. The patch is still there for those who
want to try it. See signature for URL.
> Feel free to quantify the savings over the current setup with max power
> saving enabled in the kernel. I just don't see how "wonderful" it would
> be, given that an idle system currently uses very little battery if you
> setup the options to save power.
>
> --
> bill davidsen <[email protected]>
> CTO, TMR Associates, Inc
> Doing interesting things with little computers since 1979.
>
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
On Mon, 2002-01-14 at 13:04, Oliver Neukum wrote:
> It can happen if you sleep with a lock held.
> It can not happen at random points in the code.
> Thus there is a relation to preemption in kernel mode.
>
> To cure that problem tasks holding a lock would have to be given
> the highest priority of all tasks blocking on that lock. The semaphore
> code would get much more complex, even in the succesful code path,
> which would hurt a lot.
No, this isn't needed. This same problem would occur without
preemption. Our semaphores now have locking rules such that we aren't
going to have blatant priority inversion like this (1 holds A needs B, 2
holds B needs A).
When priority inversion begins to become a problem is if we intend to
start turning existing spinlocks into semaphores. There the locking
rules are weaker, and thus we would need to do priority inheriting. But
that's not now.
Robert Love
On Mon, 2002-01-14 at 13:39, [email protected] wrote:
> > You're claiming that preemption by nature is not Unix-like?
>
> Kernel preemption is not traditionally part of UNIX.
True original AT&T was non-preemptible, but it also didn't originally
have paging. Today, Solaris, IRIX, latest BSD (via BSDng), etc. are all
preemptible kernels.
Ask Core whether SMPng in FreeBSD 5.0 will include preempt, I think they
are still debating.
Robert Love
Daniel Phillips wrote:
>On January 14, 2002 12:33 am, J Sloan wrote:
>
>>Dieter N?tzel wrote:
>>
>>>You told me that TUX show some problems with preempt before. What
>>>problems? Are they TUX specific?
>>>
>>On a kernel with both tux and preempt, upon
>>access to the tux webserver the kernel oopses
>>and tux dies...
>>
>
>Ah yes, I suppose this is because TUX uses per-cpu data as a replacement
>for spinlocks. Patches that use per-cpu shared data have to be
>preempt-aware. Ingo didn't know this when he wrote TUX since preempt didn't
>exist at that time and didn't even appear to be on the horizon. He's
>certainly aware of it now.
>
I am looking forward to testing out the new code
;-)
>>OTOH the low latency patch plays quite well
>>with tux. As said, I have no anti-preempt agenda,
>>I just need for whatever solution I use to work,
>>and not crash programs and services we use.
>>
>
>Right, and of course that requires testing - sometimes a lot of it. This one
>is a 'duh' that escaped notice. temporarily. It probably would have been
>caught sooner if we'd started serious testing/discussion sooner.
>
Well I'm glad to hear that - I had been doing a lot of
preempt testing on my boxes, up until the time I started
using tux widely. When I told Robert of the tux/preempt
incompatibilties, he mentioned the per-cpu shared data
and said something to the effect that the tux problems
did not surprise him. I didn't get the feeling that tux was
high on his list of priorities.
Hopefully that is not the case after all -
Regards,
jjs
On Monday 14 January 2002 21:09, Robert Love wrote:
> On Mon, 2002-01-14 at 13:04, Oliver Neukum wrote:
> > It can happen if you sleep with a lock held.
> > It can not happen at random points in the code.
> > Thus there is a relation to preemption in kernel mode.
> >
> > To cure that problem tasks holding a lock would have to be given
> > the highest priority of all tasks blocking on that lock. The semaphore
> > code would get much more complex, even in the succesful code path,
> > which would hurt a lot.
>
> No, this isn't needed. This same problem would occur without
> preemption. Our semaphores now have locking rules such that we aren't
> going to have blatant priority inversion like this (1 holds A needs B, 2
> holds B needs A).
No this is a good old deadlock.
The problem with preemption and SCHED_FIFO is, that due to SCHED_FIFO
you have no guarantee that any task will make any progress at all.
Thus a semaphore could basically be held forever.
That can happen without preemption only if you do something that
might block.
Regards
Oliver
Stephan von Krawczynski wrote:
>
> On Mon, 14 Jan 2002 02:04:56 -0800
> Andrew Morton <[email protected]> wrote:
>
> > Stephan von Krawczynski wrote:
> > >
> > > ...
> > > Unfortunately me have neither of those. This would mean I cannot benefit
> from> > _these_ patches, but instead would need _others_
> [...]
> >
> > In 3c59x.c, probably the biggest problem will be the call to issue_and_wait()
> > in boomerang_start_xmit(). On a LAN which is experiencing heavy collision
> rates> this can take as long as 2,000 PCI cycles (it's quite rare, and possibly
> an> erratum). It is called under at least two spinlocks.
> >
> > In via-rhine, wait_for_reset() can busywait for up to ten milliseconds.
> > via_rhine_tx_timeout() calls it from under a spinlock.
> >
> > In eepro100.c, wait_for_cmd_done() can busywait for one millisecond
> > and is called multiple times under spinlock.
>
> Did I get that right, as long as spinlocked no sense in conditional_schedule()
> ?
>
> > Preemption will help _some_ of this, but by no means all, or enough.
>
> Maybe we should really try to shorten the lock-times _first_. You mentioned a
> way to find the bad guys?
Apply the preempt patch and then the preempt-stats patch. Follow
instructions that come with the stats patch. It will report on the
longest preempt disable times since the last report. You need to
provide a load that will exercise the bad code, but it will tell you
which, where, and how bad. Note: it measures preempt off time, NOT how
long it took to get to some task, i.e. it does not depend on requesting
preemption at the worst possible time.
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Andrea Arcangeli wrote:
>
> > --- linux-2.4.18-pre3/fs/buffer.c Fri Dec 21 11:19:14 2001
> > +++ linux-akpm/fs/buffer.c Sat Jan 12 12:22:29 2002
> > @@ -249,12 +249,19 @@ static int wait_for_buffers(kdev_t dev,
> > struct buffer_head * next;
> > int nr;
> >
> > - next = lru_list[index];
> > nr = nr_buffers_type[index];
> > +repeat:
> > + next = lru_list[index];
> > while (next && --nr >= 0) {
> > struct buffer_head *bh = next;
> > next = bh->b_next_free;
> >
> > + if (dev == NODEV && current->need_resched) {
> > + spin_unlock(&lru_list_lock);
> > + conditional_schedule();
> > + spin_lock(&lru_list_lock);
> > + goto repeat;
> > + }
> > if (!buffer_locke
> > d(bh)) {
>
> this introduces possibility of looping indefinitely, this is why I
> rejected it while I merged the mini-ll other points into -aa, if you
> want to do anything like that at the very least you should roll the head
> of the list as well or something like that.
I ended up deciding that the `NODEV' check here avoids livelocks.
Unless, of course, the scheduling pressure is so high that we can't
even run a few statements. I which case the interrupt load will be so
high that the machine stops anyway. Possibly it needs to check `refile'
as well.
A technique I frequently use in the full-ll patch is to only reschedule
after we've executed the loop (say) 16 times before dropping out. This
assures that forward progress is made. There's a test mode in the full
ll patch - in this mode, it *always* assumes that need_resched is true.
If the patch runs OK in this mode without livelocking, we know that it
can't livelock.
Anyway, I'll revisit this. It is a "must fix". wait_for_buffers() is
possibly the worst cause of latency in the kernel. The usual scenario
is where kupdate has written 10,000 buffers and then sleeps. Next time
it wakes, it has 10,000 clean, unlocked buffers to move from BUF_LOCKED
onto BUF_CLEAN. It does this with lru_list_lock held.
-
On Mon, 2002-01-14 at 15:22, Oliver Neukum wrote:
> > No, this isn't needed. This same problem would occur without
> > preemption. Our semaphores now have locking rules such that we aren't
> > going to have blatant priority inversion like this (1 holds A needs B, 2
> > holds B needs A).
>
> No this is a good old deadlock.
> The problem with preemption and SCHED_FIFO is, that due to SCHED_FIFO
> you have no guarantee that any task will make any progress at all.
> Thus a semaphore could basically be held forever.
> That can happen without preemption only if you do something that
> might block.
Well, semaphores block. And we have these races right now with
SCHED_FIFO tasks. I still contend preempt does not change the nature of
the problem and it certainly doesn't introduce a new one.
Robert Love
> I have all along assumed that a well-designed RT application would delegate
> all these operations to SCHED_OTHER worker processes, probably via shared
> memory/shared mappings. So in the simplest case, you'd have a SCHED_FIFO
> task which talks to the hardware, and which has a helper task which reads
> and writes stuff from and to disk. With sufficient buffering and readahead
> to cover the worst case IO latencies.
A real RT task has hard guarantees and to all intents and purposes you may
deem the system failed if it ever misses one (arguably if you cannot verify
it will never miss one).
The stuff we care about is things like DVD players which tangle with
sockets, pipes, X11, memory allocation, and synchronization between multiple
hardware devices all running at slightly incorrect clocks.
Alan Cox wrote:
>
> > I have all along assumed that a well-designed RT application would delegate
> > all these operations to SCHED_OTHER worker processes, probably via shared
> > memory/shared mappings. So in the simplest case, you'd have a SCHED_FIFO
> > task which talks to the hardware, and which has a helper task which reads
> > and writes stuff from and to disk. With sufficient buffering and readahead
> > to cover the worst case IO latencies.
>
> A real RT task has hard guarantees and to all intents and purposes you may
> deem the system failed if it ever misses one (arguably if you cannot verify
> it will never miss one).
We know that :) Here, "RT" means "Linux-RT": something which is non-SCHED_OTHER,
and which we'd prefer didn't completely suck.
> The stuff we care about is things like DVD players which tangle with
> sockets, pipes, X11, memory allocation, and synchronization between multiple
> hardware devices all running at slightly incorrect clocks.
Well, that's my point. A well-designed DVD player would have two processes.
One which tangles with the sockets, pipes, disks, etc, and which feeds data into
and out of the SCHED_FIFO task via a shared, mlocked memory region.
What I'm trying to develop here is a set of guidelines which will allow
application developers to design these programs with a reasonable
degree of success.
-
> Well, that's my point. A well-designed DVD player would have two processes.
> One which tangles with the sockets, pipes, disks, etc, and which feeds data into
> and out of the SCHED_FIFO task via a shared, mlocked memory region.
>
> What I'm trying to develop here is a set of guidelines which will allow
> application developers to design these programs with a reasonable
> degree of success.
What about the X server 8)
Given 1mS and vague fairness DVD is more than acceptable
Alan Cox wrote:
>>>stays on. That's another problem, and I did play with some patches this
>>>weekend without making myself really happy :-( Another topic,
>>>unfortunately.
>>>
>>Patience, the problem is understood and there will be a fix in the 2.5
>>timeframe.
>>
>
>Without a fix in the 2.4 timeframe everyone has to run 2.2. That strikes
>me as decidedly non optimal. If you are having VM problems try both the
>Andrea -aa and the Rik rmap-11b patches (*not together*) and report back
>
Easiest is to grab 2.4.17 and apply 2.4.18pre2 and 2.4.18pre2-aa2 -
pre2-aa2 has all the fixes and tweaks I had been doing by hand.
cu
jjs
>
Stephan von Krawczynski wrote:
>
> Just a short question: the last (add-on) patch to mini-ll I saw on the
It was for full-ll, not for mini-ll.
> patches: drivers/net/3c59x.c
> drivers/net/8139too.c
> drivers/net/eepro100.c
>
> Unfortunately me have neither of those. This would mean I cannot benefit
> from _these_ patches, but instead would need _others_ (like tulip or
> name-one-of-the-rest-of-the-drivers) to see _some_ effect you tell me I
> _should_ see (I currently see _none_). How do you argue then against the
> statement: we need patches for /drivers/net/*.c ?? I do not expect
> 3c59x.c to be particularly bad in comparison to tulip/*.c or lets say
> via-rhine.c, do you?
I also checked the tulip driver (which is the one I use at home) and didn't
find need for "fixing" there. I will definitely take a closer look at that
driver in future.
WLAN drivers seem to need some hacking, but I'm not very interested in that
area. I think WLAN is one big security hole that noone should be using...
- Jussi Laako
--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers
Alan Cox wrote:
>
> > > In eepro100.c, wait_for_cmd_done() can busywait for one millisecond
> > > and is called multiple times under spinlock.
> >
> > Did I get that right, as long as spinlocked no sense in
> > conditional_schedule()
>
> No conditional schedule, no pre-emption. You would need to rewrite that
> code to do something like try for 100uS then queue a 1 tick timer to
> retry asynchronously. That makes the code vastly more complex for an
> error case and for some drivers where irq mask is required during reset
> waits won't help.
That wait_for_cmd_done() and similar functions in other drivers are called
let's say 3 times in interrupt handler or spinlocked routine and 20 times in
non-interrupts disabled nor spinlocked functions.
Spinlocked reqions are usually protected by spin_lock_irqsave().
So the code reads
if (!spin_is_locked(sl))
conditional_schedule();
This doesn't make the whole problem go away, but could make the situation a
little bit better for most of the time?
- Jussi Laako
--
PGP key fingerprint: 161D 6FED 6A92 39E2 EB5B 39DD A4DE 63EB C216 1E4B
Available at PGP keyservers
>>>>> "Alan" == Alan Cox <[email protected]> writes:
Oliver> You can have an rt task block on a lock held by a normal task that was
Oliver> preempted by a rt task of lower priority. The same problem as with the
Oliver> sched_idle patches.
>>
>> This can happen with a non-preemptible kernel too. And it has nothing to
>> do with scheduling policy.
Alan> So why bother adding pre-emption. As you keep saying - it doesnt
Alan> gain anything
Nope. I don't. I said (at least in the above) it didn't hurt.
One can consider a non-preemptible kernel as a special kind of
priority inversion, preemptible kernel will eliminate _that_ case of
priority inversion.
Regards,
-velco
> Well, semaphores block. And we have these races right now with
> SCHED_FIFO tasks. I still contend preempt does not change the nature of
> the problem and it certainly doesn't introduce a new one.
But it does:
down(&sem);
do_something_that_cannot_block();
up(&sem);
Will stop a SCHED_FIFO task for a definite amount of time. Only
until it returns from the kernel to user space at worst.
If do_something_that_cannot_block() can be preempted, a SCHED_FIFO
task can block indefinitely long on the semaphore, because you have
no guarantee that the scheduler will ever again select the the preempted task.
In fact it must never again select the preempted task as long as there's
another runnable SCHED_FIFO task.
Regards
Oliver
On Tue, Jan 15, 2002 at 12:34:01AM +0200, Momchil Velikov wrote:
> One can consider a non-preemptible kernel as a special kind of
> priority inversion, preemptible kernel will eliminate _that_ case of
> priority inversion.
The problem here is that priority means something very different in
a time-shared system than in a hard real-time system. And even in real-time
systems, as Walpole and colleagues have pointed out, priority doesn't
really capture much of what is needed for good scheduling.
In a general purpose system, priorities are dynamic and "fair".
The priority of even the lowliest process increases while it waits
for time. In a raw real-time system, the low priority process can sit
forever and should wait until no higher priority thread needs the
processor. So it's absurd to talk of priority inversion in a non RT
system. When a low priority process is delaying a higher priority task
for reasons of fairness, increased throughput, or any other valid
objective, that is not a scheduling error.
>
> Regards,
> -velco
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
Google is your fried ;)
I believe Dawson Engler [email protected] is the man to contact. More
info about their project can be found at http://hands.stanford.edu/linux/
You would probably need to give him an example of the bad code you are
trying to catch
-Robert
> -----Original Message-----
> From: george anzinger [mailto:[email protected]]
> Sent: Tuesday, January 15, 2002 6:52 AM
> To: Robert Lowery
> Cc: [email protected]; [email protected]
> Subject: Re: [2.4.17/18pre] VM and swap - it's really unusable
>
>
> Robert Lowery wrote:
> >
> > >I question this because it is too risky to apply. There is
> no way any
> > >distribution or production system could ever consider applying the
> > >preempt kernel and ship it in its next kernel update 2.4.
> You never know
> > >if a driver will deadlock because it is doing a test and
> set bit busy
> > >loop by hand instead of using spin_lock and you cannot
> audit all the
> > >device drivers out there.
> >
> > Quick question from a kernel newbie.
> >
> > Could this audit be partially automated by the Stanford
> Checker? or would
> > there
> > be too many false positives from other similar looping code?
> >
> > -Robert
> Sounds like a REALLY good thing (tm) to me. How do we get them
> interested?
> --
> George [email protected]
> High-res-timers: http://sourceforge.net/projects/high-res-timers/
> Real time sched: http://sourceforge.net/projects/rtsched/
>
[email protected] wrote:
>
> On Sat, Jan 12, 2002 at 02:00:17PM -0500, Ed Sweetman wrote:
> >
> >
> > > On Sat, Jan 12, 2002 at 09:52:09AM -0700, [email protected] wrote:
> > > > On Sat, Jan 12, 2002 at 04:07:14PM +0100, [email protected] wrote:
> > > > > I did my usual compile testings (untar kernel archive, apply patches,
> > > > > make -j<value> ...
> > > >
> > > > If I understand your test,
> > > > you are testing different loads - you are compiling kernels that may
> > differ
> > > > in size and makefile organization, not to mention different layout on
> > the
> > > > file system and disk.
> >
> > Can someone tell me why we're "testing" the preempt kernel by running
> > make -j on a build? What exactly is this going to show us? The only thing
> > i can think of is showing us that throughput is not damaged when you want to
> > run single apps by using preempt. You dont get to see the effects of the
> > kernel preemption because all the damn thing is doing is preempting itself.
> >
> > If you want to test the preempt kernel you're going to need something that
> > can find the mean latancy or "time to action" for a particular program or
> > all programs being run at the time and then run multiple programs that you
> > would find on various peoples' systems. That is the "feel" people talk
> > about when they praise the preempt patch. make -j'ing something and not
> > testing anything else but that will show you nothing important except "does
> > throughput get screwed by the preempt patch." Perhaps checking the
> > latencies on a common program on people's systems like mozilla or konqueror
> > while doing a 'make -j N bzImage' would be a better idea.
>
> That's the second test I am normally running. Just running xmms while
> doing the kernel compile. I just wanted to check if the system slows
> down because of preemption but instead it compiled the kernel even
> faster :-)
This sort of thing is nice to hear, but, it does show up a problem in
the non-preempt kernel. That preemption improves compile performance
implies that the kernel is not doing the right thing during a normal
compile and that preemption, to some extent, corrects the problem. But
preemption adds the overhead of additional context switches. It would
be nice to know where the time is coming from. I.e. lets assume that
the actual compile takes about the same amount of execution time with or
without preemption. Then for the preemptable kernel to do the job
faster something else must go up, idle time perhaps. If this is the
case, then there is some place in the kernel that is wasting cpu time
and that is preemptable and the preemptable patch is moving this idle
time to the idle process.
What ever the reason, while I do want to promote preemption, I think we
should look at this issue and, at the very least, explain it.
> But so far I was not able to test the latency and furthermore
> it is very difficult to "measure" skipping of xmms ...
>
> > > Ouch, I assumed this wasn't the case indeed.
>
> Sorry for not answering immedeatly but I am compiling the same kernel
> source with the same .config and everything I could think of being the
> same! I even do a 'rm -rf linux' after every run and untar the same
> sources *every* time.
>
> Regards,
>
> Jogi
>
> --
>
> Well, yeah ... I suppose there's no point in getting greedy, is there?
>
> << Calvin & Hobbes >>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
[email protected] wrote:
>
> On Sat, Jan 12, 2002 at 02:25:03PM +0100, Roman Zippel wrote:
> > Hi,
> >
> > [email protected] wrote:
> >
> > > > > SCHED_FIFO leads to
> > > > > niced app 1 in K mode gets Sem A
> > > > > SCHED_FIFO app prempts and blocks on Sem A
> > > > > whoops! app 2 in K more preempts niced app 1
> > > >
> > > > Please explain what's different without the preempt patch.
> > >
> > > See that "preempt" in line 2 . Linux does not
> > > preempt kernel mode processes otherwise. The beauty of the
> > > non-preemptive kernel is that "in K mode every process makes progress"
> > > and even the "niced app" will complete its use of SemA and
> > > release it in one run.
> >
> > The point of using semaphores is that one can sleep while holding them,
> > whether this is forced by preemption or voluntary makes no difference.
>
> No. The point of using semaphores is that one can sleep while
> _waiting_ for the resource. Sleeping while holding semaphores is
> a different kettle of lampreys entirely.
> And it makes a very big difference
> A:
> get sem on memory pool
> do something horrible to pool
> release sem on memory pool
>
> In a preemptive kernel this can cause a deadlock. In a non
> preemptive it cannot. You are correct in that
> B:
> get sem on memory pool
> do potentially blocking operations
> release sem
> is also dangerous - but I don't think that helps your case.
> To fix B, we can enforce a coding rule - one of the reasons why
> we have all those atomic ops in the kernel is to be able to
> avoid this problem.
> To fix A in a preemptive kernel we need to start messing about with
> priorities and that's a major error.
> "The current kernel has too many places where processes
> can sleep while holding semaphores so we should always have the
> potential of blocking with held semaphores" is, to me, a backwards
> argument.
>
> > > If you have a reasonably fair scheduler you
> > > can make very useful analysis with Linux now of the form
> > >
> > > Under 50 active proceses in the system means that in every
> > > 2 second interval every process
> > > will get at least 10ms of time to run.
> > >
> > > That's a very valuable property and it goes away in a preemptive kernel
> > > to get you something vague.
> >
> > How is that changed? AFAIK inserting more schedule points does not
> > change the behaviour of the scheduler. The niced app will still get its
> > time.
>
> How many times can an app be preempted? In a non preempt kernel
> is can be preempted during user mode at timer frequency and no more
Uh, it can be and is preempted in user mode by ANY interrupt, be it
keyboard, serial, lan, disc, etc. The kernel looks for need_resched at
the end of ALL interrupts, not just the timer interrupt.
> and it cannot be preempted during kernel mode. So
> while(1){
> read mpeg data
> process
> write bitmap
> }
>
> Assuming Andrew does not get too ambitious about read/write granularity, once this
> process is scheduled on a non-preempt system it will always make progress. The non
> preempt kernel says, "your kernel request will complete - if we have resources".
> A preempt kernel says: "well, if nobody more important activates you get some time"
> Now you do the analysis based on the computation of "goodness" to show that there is
> a bound on preemption count during an execution of this process. I don't want to
> have to think that hard.
> Let's suppose the Gnome desktop constantly creates and
> destroys new fresh i/o bound tasks to do something. So with the old fashioned non
> preempt (ignoring Andrew) we get
> wait no more than 1 second
> I'm scheduled and start a read
> wait no more than one second
> I'm scheduled and in user mode for at least 10milliseconds
> wait no more than 1 second
> I'm scheduled and do my write
> ...
> with preempt we get
> wait no more than 1 second
> I'm scheduled and start a read
> I'm preempted
> read not done
> come back for 2 microseconds
> preempted again
> haven't issued the damn read request yet
> ok a miracle happens, I finish the read request
> go to usermode and an interrupt happens
> well it would be stupid to have a goodness
> function in a preempt kernel that lets a low
> priority task finish its time slice so preempt
> ...
>
> >
> > > So your argument is that I'm advocating Andrew Morton's patch which
> > > reduces latencies more than the preempt patch because I have a
> > > financial interest in not reducing latencies? Subtle.
> >
> > Andrew's patch requires constant audition and Andrew can't a