2001-07-18 18:36:44

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] swap usage of high memory (fwd)

Hi Alan, Linus,

Dave found a stupid bug in the swapin code, leading to
bad balancing problems in the VM.

I suspect marcelo's zone VM hack could even go away
with this patch applied ;)

Rik
---------- Forwarded message ----------
Date: Wed, 18 Jul 2001 13:15:07 -0500
From: Dave McCracken <[email protected]>
To: Rik van Riel <[email protected]>
Cc: [email protected]
Subject: Patch for swap usage of high memory


This patch fixes the problem where pages allocated for swap space reads
will not be allocated from high memory.

Rik, could you please forward this to the kernel mailing list? I am
temporarily unable to reach it directly due to ECN problems.

Thanks,
Dave McCracken

--------

--- linux-2.4.6/mm/swap_state.c Mon Jun 11 21:15:27 2001
+++ linux-2.4.6-mm/mm/swap_state.c Wed Jul 18 12:56:01 2001
@@ -226,7 +226,7 @@
if (found_page)
goto out_free_swap;

- new_page = alloc_page(GFP_USER);
+ new_page = alloc_page(GFP_HIGHUSER);
if (!new_page)
goto out_free_swap; /* Out of memory */

--------

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059



2001-07-18 21:23:55

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)



On Wed, 18 Jul 2001, Rik van Riel wrote:

> Hi Alan, Linus,
>
> Dave found a stupid bug in the swapin code, leading to
> bad balancing problems in the VM.
>
> I suspect marcelo's zone VM hack could even go away
> with this patch applied ;)

Rik,

Could you please stop blaming my code with NO reason and understand that
there is a FUNDAMENTAL problem here ?

I don't understand why you're doing this, really.



2001-07-19 01:10:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)



On Wed, 18 Jul 2001, Rik van Riel wrote:

> Hi Alan, Linus,
>
> Dave found a stupid bug in the swapin code, leading to
> bad balancing problems in the VM.
>
> I suspect marcelo's zone VM hack could even go away
> with this patch applied ;)

Rik,

Still able to trigger the problem with the GFP_HIGHUSER patch applied.



2001-07-19 11:12:33

by Matti Aarnio

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)

....
> Rik, could you please forward this to the kernel mailing list?
> I am temporarily unable to reach it directly due to ECN problems.
....

Dave (and others),

The "ECN problem" is unidirectional.

VGER can't reach you when VGER calls your email servers,
but your email servers can reach VGER! (Or vger's MX backup.)

That is, you CAN send to VGER even if VGER can't send to you!

This is due to the way how the ECN handshake is done.

> Thanks,
> Dave McCracken

/Matti Aarnio

2001-07-19 12:15:16

by Ryan Sweet

[permalink] [raw]
Subject: up kernel stable, but smp kernel randomly reboots - nfsroot - asus cur_dls


I posted previously about having problems with random reboots on nfsroot
nodes across kernels 2.2.18 - 2.4.6 (all kernels exhibit the same
problem - after X amount of time, where x is usually < 24 hours, the
system just reboots).

When I run the systems with uniprocessor kernels, the problem does not
occur.

When the smp kernel is booted with noapic, the apic errors go away. Other
posts I read about smp apic problems seemed to indicate that they received
hundreds of messages in a short period of time - I was getting maybe seven
or eight over the course of several hours.

I can not locate any references on the net to others having trouble with
SMP in asus cur_dls boards or with the ServerWorks chipset.

Is it possible that there is some interaction between smp and nfsroot and
cur_dls that is causing the problem (all of my other cur_dls boards are
using a local disk)? I've tried wrapping my head around the the nfs code
to search for smp specific problems, and while I understand a lot more of
it now than I did before, it is still mostly beyond my immediate
comprehension.

Is it possible that this is a power/cpu voltage problem? If so, would a
ups be a solution?

Is is possible that the whole batch of 10 motherboards
is broken somehow (we have oodles of other asus cur_dls smp systems that
don't have problems, just this cluster)?

Are there any suggestions as to further troubleshooting options?

I am working on booting with a tftp downloaded ramdisk as the root, to
eliminate nfsroot from the equation, but I am skeptical as to whether this
will actually help anything.

regards,
-ryan

--
Ryan Sweet <[email protected]>
Atos Origin Engineering Services
http://www.aoes.nl

2001-07-19 14:11:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)

On Wed, 18 Jul 2001, Marcelo Tosatti wrote:

> Still able to trigger the problem with the GFP_HIGHUSER patch applied.

Hrrm, maybe the fact that the free target in the DMA zone is
four times higher than in the other zones has something to do
with the unbalance...

I'll try to fix the cause of this out-of-balance zone pressure
thing when I get back to Curitiba after the weekend. Your new
code to deal with the problem when it happens looks nice, btw.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-19 16:48:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)


On Thu, 19 Jul 2001, Rik van Riel wrote:
>
> On Wed, 18 Jul 2001, Marcelo Tosatti wrote:
>
> > Still able to trigger the problem with the GFP_HIGHUSER patch applied.
>
> Hrrm, maybe the fact that the free target in the DMA zone is
> four times higher than in the other zones has something to do
> with the unbalance...

No, the free target is higher for the DMA zone just to make the small zone
not deplete as easily. It might make the problem slightly easier to
trigger, but I think the bassic problem is real - some zones inherently
have higher pressure on them, and those zones do need to be aged faster.

Note that most people don't see this very much, because there are happily
not that many cases where the 16MB DMA limit matters any more. These days
you're more likely to start seeing the NORMAL vs HIGHMEM zone issues,
where the NORMAL zone just automatically has more pressure because a lot
of things like the icache/dcache can only be allocated from there.

Note that the unfair aging (apart from just being a natural requirement of
higher allocation pressure) actually has some other advantages too: it
ends up being aload balancing thing. Sure, it might throw out some things
that get "unfairly" treated, but once we bring them in again we have a
better chance of bringing them into a zone that _isn't_ under pressure.

So unfair eviction can actually end up being a natural solution to
different memory pressure too (it obviously only works if the memory
pressure isn't _too_ one-sided - if the great majority of allocations all
_have_ to be to the pressure zone, the other zones obviously have no way
to accept any of the extra pressure regardless of how hard they'd try).

Linus

2001-07-19 17:03:27

by Richard Gooch

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)

Linus Torvalds writes:
> Note that the unfair aging (apart from just being a natural
> requirement of higher allocation pressure) actually has some other
> advantages too: it ends up being aload balancing thing. Sure, it
> might throw out some things that get "unfairly" treated, but once we
> bring them in again we have a better chance of bringing them into a
> zone that _isn't_ under pressure.

What about moving data to zones with free pages? That would save I/O.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-07-19 17:24:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)

On Thu, 19 Jul 2001, Linus Torvalds wrote:

> Note that the unfair aging (apart from just being a natural requirement of
> higher allocation pressure) actually has some other advantages too: it
> ends up being aload balancing thing. Sure, it might throw out some things
> that get "unfairly" treated, but once we bring them in again we have a
> better chance of bringing them into a zone that _isn't_ under pressure.
>
> So unfair eviction can actually end up being a natural solution to
> different memory pressure too

Note the difference between unfair aging and unfair eviction.

Unfair eviction is needed and is no problem because with
fair aging this will lead to a large surplus of inactive
pages in less loaded zones, increasing the chances that
future allocations will end up in those zones.

Unfair aging, OTOH, throws away that information, making
it harder for the system to get the pressure across the
zones equal again.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-07-19 17:25:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)


On Thu, 19 Jul 2001, Richard Gooch wrote:
> Linus Torvalds writes:
> > Note that the unfair aging (apart from just being a natural
> > requirement of higher allocation pressure) actually has some other
> > advantages too: it ends up being aload balancing thing. Sure, it
> > might throw out some things that get "unfairly" treated, but once we
> > bring them in again we have a better chance of bringing them into a
> > zone that _isn't_ under pressure.
>
> What about moving data to zones with free pages? That would save I/O.

Well, remember that we _are_ talking about pages that have been aged (just
a bit more aggressively than some other pages), and are not being used.
Dropping them may well be the right thing to do, and migrating them is
potentially very costly indeed (and can cause oscillating patterns etc
horror-schenarios).

Yes, true page migration might eventually be something we have to start
thinking about for NUMA machines, but I'd really really prefer just about
any alternative. Getting a good balance would be _much_ preferable to
having to take out the sledgehammer..

Linus

2001-07-19 17:41:19

by Richard Gooch

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)

Linus Torvalds writes:
>
> On Thu, 19 Jul 2001, Richard Gooch wrote:
> > Linus Torvalds writes:
> > > Note that the unfair aging (apart from just being a natural
> > > requirement of higher allocation pressure) actually has some other
> > > advantages too: it ends up being aload balancing thing. Sure, it
> > > might throw out some things that get "unfairly" treated, but once we
> > > bring them in again we have a better chance of bringing them into a
> > > zone that _isn't_ under pressure.
> >
> > What about moving data to zones with free pages? That would save I/O.
>
> Well, remember that we _are_ talking about pages that have been aged
> (just a bit more aggressively than some other pages), and are not
> being used.

Well, under memory pressure, those pages may be considered "old" but
in fact could be needed again soon.

> Dropping them may well be the right thing to do, and migrating them
> is potentially very costly indeed (and can cause oscillating
> patterns etc horror-schenarios).

If you move them, preserving the age and making sure not to evict
younger pages in the new zone, that should avoid the oscillations,
should it not?

Besides, I've seen plenty of oscillations when paging to/from the swap
device :-(

> Yes, true page migration might eventually be something we have to
> start thinking about for NUMA machines, but I'd really really prefer
> just about any alternative. Getting a good balance would be _much_
> preferable to having to take out the sledgehammer..

I agree that a good balancing algorithm is required. I'm just
suggesting that if you get to the point where you *have* to evict a
page from a zone, instead of just tossing it out, move it to another
zone if you can.

Note that I'm not necessarily suggesting writing dirty pages to
another zone instead of to swap. I was originally thinking of just
moving "clean" pages (i.e. those that can be freed without the need to
schedule I/O) so that potential subsequent I/O to pull them back in
may be avoided. Doing proper page migration is a more complex step
that needs further consideration.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-07-20 18:51:53

by Dirk Wetter

[permalink] [raw]
Subject: Re: [PATCH] swap usage of high memory (fwd)


Hey Marcelo,

thx for your great work! our 4gb system are working way better
now. i am running ac5 (without your inactive_plenty() patch
on top of that) on allmost all (see below) of our big boxes.

also, it looks like the CPU affinity thing bought us also a little
something as far as i was told, which is surprising to me, since we
run normally 2 jobs on the big 4GB SMP machines.

typically top is now like on an ac5 kernel:

18:55pm up 1 day, 12:34, 3 users, load average: 2.08, 2.04, 2.00
64 processes: 60 sleeping, 3 running, 1 zombie, 0 stopped
CPU0 states: 87.0% user, 12.0% system, 87.1% nice, 0.0% idle
CPU1 states: 88.0% user, 11.2% system, 88.0% nice, 0.0% idle
Mem: 4057200K av, 3983816K used, 73384K free, 0K shrd, 2524K buff
Swap: 14337736K av, 1230256K used, 13107480K free 270296K cached

PID USER PRI NI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMAND
19038 usersid 15 4 2328M 470M 1.8G 214M 0M R N 89.4 46.8 463:07 ceqsim
19048 usersid 15 4 2328M 469M 1.8G 214M 0M R N 88.5 46.9 462:45 ceqsim
17925 usersid 9 0 824 40 784 592 49 S 11.9 0.0 30:31 top
24257 dirkw 14 0 1056 0 1056 828 57 R 11.9 0.0 0:01 top
1 root 8 0 76 12 64 64 4 S 0.0 0.0 0:22 init
2 root 8 0 0 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU0
4 root 19 19 0 0 0 0 0 SWN 0.0 0.0 0:00 ksoftirqd_CPU1
5 root 9 0 0 0 0 0 0 SW 0.0 0.0 59:12 kswapd
6 root 9 0 0 0 0 0 0 SW 0.0 0.0 14:10 kreclaimd
7 root 9 0 0 0 0 0 0 SW 0.0 0.0 0:12 bdflush
8 root 9 0 0 0 0 0 0 SW 0.0 0.0 1:16 kupdated

sar reports the system time - with some exceptions- to be better than top
does:

08:00:01 CPU %user %nice %system %idle
[..]
12:40:02 all 0.05 95.56 3.22 1.17
13:00:01 all 0.05 83.96 14.66 1.32
13:20:01 all 0.07 97.22 1.44 1.27
13:40:01 all 0.34 45.45 10.11 44.10
14:01:34 all 0.12 2.07 90.87 6.94
14:21:34 all 0.10 0.00 94.67 5.23
14:41:34 all 0.37 13.97 8.11 77.55
15:00:00 all 0.13 73.42 7.48 18.97
15:20:00 all 0.14 92.57 4.84 2.44
15:40:00 all 0.15 94.93 4.26 0.65
16:00:02 all 0.14 93.31 4.44 2.12
16:20:02 all 0.14 93.74 4.18 1.94
16:40:02 all 0.15 94.26 4.28 1.31
17:00:04 all 0.13 94.68 4.27 0.92
17:20:04 all 0.14 92.51 4.50 2.85
17:40:04 all 0.14 93.75 4.42 1.69
18:00:07 all 0.13 90.18 4.59 5.11
18:20:07 all 0.14 93.77 4.19 1.90
18:40:07 all 0.12 91.83 3.99 4.06

also i added the swap_state/GFP_HIGHUSER fix from Dave McCracken.
according to the poor statistics i have - two overnight jobs
only - my impression is that this helped, too (i think that were
the exactly same jobs as above):

[..]
03:20:01 all 0.05 93.49 2.22 4.23
03:40:01 all 0.06 96.98 1.72 1.24
04:00:01 all 0.06 95.08 1.79 3.07
04:20:01 all 0.05 96.95 1.22 1.78
04:40:01 all 0.06 94.59 1.56 3.79
05:00:01 all 0.06 94.37 1.86 3.71
05:20:01 all 0.06 96.32 1.32 2.30
05:40:01 all 0.06 94.62 1.84 3.48
06:00:02 all 0.07 96.17 1.42 2.34
06:20:02 all 0.06 94.17 1.61 4.15
06:40:02 all 0.06 96.46 1.92 1.56
07:00:01 all 0.05 92.75 1.53 5.67
07:20:01 all 0.05 95.25 1.67 3.03
07:40:01 all 0.05 94.34 1.97 3.65
08:00:00 all 0.05 94.93 1.20 3.81

08:00:00 CPU %user %nice %system %idle
08:20:00 all 0.06 93.97 1.83 4.14
08:40:00 all 0.16 96.57 1.67 1.60
09:00:01 all 0.06 94.15 1.49 4.30
09:20:01 all 0.06 95.71 1.67 2.56
09:40:01 all 0.07 95.04 1.89 3.00


i haven't got your patched vmstat running yet. i guess i'll do that
later and send in the logs. also the rsync/[di]cache issue i saw
previously needs some attention and log collection on my side.

forget about the pm i sent you yesterday, i found out myself ;) that
2.4.7-pre8 doesn't include any of your recent vm patches :-( (i am not
subscribed to lkml). the zoned patch however applies clean also to
the 2.4.7-pre8-xfs kernel i am using on one particular machine.

i am leaving in a very few hours, catching my plane back home :)
according to the major kernel improvements for us achieved during
the last 1.5 weeks, to extrapolating things, i suspect to read
next week something about Linux' world domination in every
newspaper ;-) ok, i won't be disappointed if that's not going to
happen, i can still solace myself with good german beer. :))


cheers & thx guys!

~dirkw



On Wed, 18 Jul 2001, Marcelo Tosatti wrote:

>
>
> On Wed, 18 Jul 2001, Rik van Riel wrote:
>
> > Hi Alan, Linus,
> >
> > Dave found a stupid bug in the swapin code, leading to
> > bad balancing problems in the VM.
> >
> > I suspect marcelo's zone VM hack could even go away
> > with this patch applied ;)
>
> Rik,
>
> Still able to trigger the problem with the GFP_HIGHUSER patch applied.
>
>
>