2001-10-09 14:06:36

by Marcelo Tosatti

[permalink] [raw]
Subject: pre6 VM issues


Hi,

I've been testing pre6 (actually its pre5 a patch which Linus sent me
named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
out some problems. First of all, we need to throttle normal allocators
more often and/or update the low memory limits for normal allocators to a
saner value. I already said I think allowing everybody to eat up to
"freepages.min" is too low for a default.

I've got atomic memory failures with _22GB_ of swap free (32GB total):

eth0: can't fill rx buffer (force 0)!

Another issue is the damn fork() special case. Its failing in practice:

bash: fork: Cannot allocate memory

Also with _LOTS_ of swap free. (gigs of them)

Linus, we can introduce a "__GFP_FAIL" flag to be used by _everyone_ which
wants to do higher order allocations as an optimization (eg allocate big
scatter-gather tables or whatever). Or do you prefer to make the fork()
allocation a separate case ?

I'll take a closer look at the code now and make the throttling/limits to
what I think is saner for a default.



2001-10-09 14:10:46

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, Marcelo Tosatti wrote:

>
> Hi,
>
> I've been testing pre6 (actually its pre5 a patch which Linus sent me
> named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found

I haven't woke up correctly yet, I guess.

I mean its pre6 with a patch named "p5p6" which Linus sent me.

2001-10-09 14:17:06

by BALBIR SINGH

[permalink] [raw]
Subject: Re: pre6 VM issues

Most of the traditional unices maintained a pool for each subsystem
(this is really useful when u have the memory to spare), so not matter
what they use memory only from their pool (and if needed peek outside),
but nobody else used the memory from the pool.

I have seen cases where, I have run out of physical memory on my system,
so I try to log in using the serial console, but since the serial driver
does get_free_page (this most likely fails) and the driver complains back.
So, I had suggested a while back that important subsystems should maintain
their own pool (it will take a new thread to discuss the right size of
each pool).

Why can't Linux follow the same approach? especially on systems with a lot
of memory.


Balbir

Marcelo Tosatti wrote:

>Hi,
>
>I've been testing pre6 (actually its pre5 a patch which Linus sent me
>named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
>out some problems. First of all, we need to throttle normal allocators
>more often and/or update the low memory limits for normal allocators to a
>saner value. I already said I think allowing everybody to eat up to
>"freepages.min" is too low for a default.
>
>I've got atomic memory failures with _22GB_ of swap free (32GB total):
>
> eth0: can't fill rx buffer (force 0)!
>
>Another issue is the damn fork() special case. Its failing in practice:
>
>bash: fork: Cannot allocate memory
>
>Also with _LOTS_ of swap free. (gigs of them)
>
>Linus, we can introduce a "__GFP_FAIL" flag to be used by _everyone_ which
>wants to do higher order allocations as an optimization (eg allocate big
>scatter-gather tables or whatever). Or do you prefer to make the fork()
>allocation a separate case ?
>
>I'll take a closer look at the code now and make the throttling/limits to
>what I think is saner for a default.
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-09 14:23:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, BALBIR SINGH wrote:

> Most of the traditional unices maintained a pool for each subsystem
> (this is really useful when u have the memory to spare), so not matter
> what they use memory only from their pool (and if needed peek outside),
> but nobody else used the memory from the pool.
>
> I have seen cases where, I have run out of physical memory on my system,
> so I try to log in using the serial console, but since the serial driver
> does get_free_page (this most likely fails) and the driver complains back.
> So, I had suggested a while back that important subsystems should maintain
> their own pool (it will take a new thread to discuss the right size of
> each pool).
>
> Why can't Linux follow the same approach? especially on systems with a lot
> of memory.

There is nothing which avoids us from doing that (there is one reserved
pool I remeber right now: the highmem bounce buffering pool, but that one
is a special case due to the way Linux does IO in high memory and its only
needed on _real_ emergencies --- it will be removed in 2.5, I hope).

In general, its a better approach to share the memory and have a unified
pool. If a given subsystem is not using its own "reversed" memory, another
subsystems can use it.

The problem we are seeing now can be fixed even without the reserved
pools.


2001-10-09 14:32:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 10:44:37AM -0200, Marcelo Tosatti wrote:
>
> Hi,
>
> I've been testing pre6 (actually its pre5 a patch which Linus sent me
> named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
> out some problems. First of all, we need to throttle normal allocators
> more often and/or update the low memory limits for normal allocators to a
> saner value. I already said I think allowing everybody to eat up to
> "freepages.min" is too low for a default.
>
> I've got atomic memory failures with _22GB_ of swap free (32GB total):
>
> eth0: can't fill rx buffer (force 0)!
>
> Another issue is the damn fork() special case. Its failing in practice:
>
> bash: fork: Cannot allocate memory
>
> Also with _LOTS_ of swap free. (gigs of them)
>
> Linus, we can introduce a "__GFP_FAIL" flag to be used by _everyone_ which
> wants to do higher order allocations as an optimization (eg allocate big
> scatter-gather tables or whatever). Or do you prefer to make the fork()
> allocation a separate case ?
>
> I'll take a closer look at the code now and make the throttling/limits to
> what I think is saner for a default.

I've also finished last night to fix all highmem troubles that I could
reproduce on 128mbyte with highmem emulation, I'm confidetn it will work
fine on real highmem too now, I hope to get access soon to some highmem
machine too to test it.

I guess you're not interested to test my patches since they're not in
the mainline direction though.

Andrea

2001-10-09 14:35:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, Andrea Arcangeli wrote:

> I guess you're not interested to test my patches since they're not in
> the mainline direction though.

Why they are not in the mainline direction ?

Are they hackish ?

2001-10-09 14:36:56

by BALBIR SINGH

[permalink] [raw]
Subject: Re: pre6 VM issues

Marcelo Tosatti wrote:

>
>On Tue, 9 Oct 2001, BALBIR SINGH wrote:
>
>>Most of the traditional unices maintained a pool for each subsystem
>>(this is really useful when u have the memory to spare), so not matter
>>what they use memory only from their pool (and if needed peek outside),
>>but nobody else used the memory from the pool.
>>
>>I have seen cases where, I have run out of physical memory on my system,
>>so I try to log in using the serial console, but since the serial driver
>>does get_free_page (this most likely fails) and the driver complains back.
>>So, I had suggested a while back that important subsystems should maintain
>>their own pool (it will take a new thread to discuss the right size of
>>each pool).
>>
>>Why can't Linux follow the same approach? especially on systems with a lot
>>of memory.
>>
>
>There is nothing which avoids us from doing that (there is one reserved
>pool I remeber right now: the highmem bounce buffering pool, but that one
>is a special case due to the way Linux does IO in high memory and its only
>needed on _real_ emergencies --- it will be removed in 2.5, I hope).
>
>In general, its a better approach to share the memory and have a unified
>pool. If a given subsystem is not using its own "reversed" memory, another
>subsystems can use it.
>
>The problem we are seeing now can be fixed even without the reserved
>pools.
>
I agree that is the fair and nice thing to do, but I was talking about reserving
memory for device vs sharing it with a user process, user processes can wait,
their pages can even be swapped out if needed. But for a device that is not willing
to wait (GFP_ATOMIC) say in an interrupt context, this might be a issue.


Anyway, how do you plan to solve this ?
Balbir

>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-09 14:43:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 11:13:07AM -0200, Marcelo Tosatti wrote:
>
>
> On Tue, 9 Oct 2001, Andrea Arcangeli wrote:
>
> > I guess you're not interested to test my patches since they're not in
> > the mainline direction though.
>
> Why they are not in the mainline direction ?
>
> Are they hackish ?

IMHO the other way around, first of all I'm not using the infinite loop,
and I dropped a few bits ready for doing a few different things in the
next days like selecting the process to kill in function of the
allocation rate and by collecting away exclusive pages in get_swap_page
etc...

I'll release the stuff soon as usual in separate patches easily readable
and mergeable, because as said I cannot find anything wrong anymore in
the allocator with my testing resources. Of course I'd really like if
you could test it on the 16G box, but as said it won't test the approch
to the allocator faliure fixes that is been implemented in mainline
which I understood you're working on at the moment.

Andrea

2001-10-09 14:43:16

by BALBIR SINGH

[permalink] [raw]
Subject: Re: pre6 VM issues

BALBIR SINGH wrote:

>>
>> There is nothing which avoids us from doing that (there is one reserved
>> pool I remeber right now: the highmem bounce buffering pool, but that
>> one
>> is a special case due to the way Linux does IO in high memory and its
>> only
>> needed on _real_ emergencies --- it will be removed in 2.5, I hope).
>>
>> In general, its a better approach to share the memory and have a unified
>> pool. If a given subsystem is not using its own "reversed" memory,
>> another
>> subsystems can use it.
>>
>> The problem we are seeing now can be fixed even without the reserved
>> pools.
>>
> I agree that is the fair and nice thing to do, but I was talking about
> reserving
> memory for device vs sharing it with a user process, user processes
> can wait,
> their pages can even be swapped out if needed. But for a device that
> is not willing
> to wait (GFP_ATOMIC) say in an interrupt context, this might be a issue.


>
>
> Anyway, how do you plan to solve this ?
> Balbir


I did not realize that highmem was causing this problem you were facing,
anyway my argument
about the pools still holds.

Balbir

>
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
>
>
>------------------------------------------------------------------------
>
>----------------------------------------------------------------------------------------------------------------------
>Information transmitted by this E-MAIL is proprietary to Wipro and/or its Customers and
>is intended for use only by the individual or entity to which it is
>addressed, and may contain information that is privileged, confidential or
>exempt from disclosure under applicable law. If you are not the intended
>recipient or it appears that this mail has been forwarded to you without
>proper authority, you are notified that any use or dissemination of this
>information in any manner is strictly prohibited. In such cases, please
>notify us immediately at mailto:[email protected] and delete this mail
>from your records.
>----------------------------------------------------------------------------------------------------------------------
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-09 14:45:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 08:07:19PM +0530, BALBIR SINGH wrote:
> their pages can even be swapped out if needed. But for a device that is not willing
> to wait (GFP_ATOMIC) say in an interrupt context, this might be a issue.

There's just a reserved pool for atomic allocations. See the __GFP_WAIT
check in __alloc_pages.

Andrea

2001-10-09 14:45:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2001 at 10:44:37AM -0200, Marcelo Tosatti wrote:
> >
> > Hi,
> >
> > I've been testing pre6 (actually its pre5 a patch which Linus sent me
> > named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
> > out some problems. First of all, we need to throttle normal allocators
> > more often and/or update the low memory limits for normal allocators to a
> > saner value. I already said I think allowing everybody to eat up to
> > "freepages.min" is too low for a default.
> >
> > I've got atomic memory failures with _22GB_ of swap free (32GB total):
> >
> > eth0: can't fill rx buffer (force 0)!
> >
> > Another issue is the damn fork() special case. Its failing in practice:
> >
> > bash: fork: Cannot allocate memory
> >
> > Also with _LOTS_ of swap free. (gigs of them)
> >
> > Linus, we can introduce a "__GFP_FAIL" flag to be used by _everyone_ which
> > wants to do higher order allocations as an optimization (eg allocate big
> > scatter-gather tables or whatever). Or do you prefer to make the fork()
> > allocation a separate case ?
> >
> > I'll take a closer look at the code now and make the throttling/limits to
> > what I think is saner for a default.
>
> I've also finished last night to fix all highmem troubles that I could
> reproduce on 128mbyte with highmem emulation, I'm confidetn it will work
> fine on real highmem too now, I hope to get access soon to some highmem
> machine too to test it.
>
> I guess you're not interested to test my patches since they're not in
> the mainline direction though.

Ah, I forgot something: Even if I'm not interested in the patches the 16GB
machine is available to the community. If you (or any other VM people who
need the machine) want access, just tell me.

2001-10-09 14:45:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, BALBIR SINGH wrote:

> Marcelo Tosatti wrote:
>
> >
> >On Tue, 9 Oct 2001, BALBIR SINGH wrote:
> >
> >>Most of the traditional unices maintained a pool for each subsystem
> >>(this is really useful when u have the memory to spare), so not matter
> >>what they use memory only from their pool (and if needed peek outside),
> >>but nobody else used the memory from the pool.
> >>
> >>I have seen cases where, I have run out of physical memory on my system,
> >>so I try to log in using the serial console, but since the serial driver
> >>does get_free_page (this most likely fails) and the driver complains back.
> >>So, I had suggested a while back that important subsystems should maintain
> >>their own pool (it will take a new thread to discuss the right size of
> >>each pool).
> >>
> >>Why can't Linux follow the same approach? especially on systems with a lot
> >>of memory.
> >>
> >
> >There is nothing which avoids us from doing that (there is one reserved
> >pool I remeber right now: the highmem bounce buffering pool, but that one
> >is a special case due to the way Linux does IO in high memory and its only
> >needed on _real_ emergencies --- it will be removed in 2.5, I hope).
> >
> >In general, its a better approach to share the memory and have a unified
> >pool. If a given subsystem is not using its own "reversed" memory, another
> >subsystems can use it.
> >
> >The problem we are seeing now can be fixed even without the reserved
> >pools.
> >
> I agree that is the fair and nice thing to do, but I was talking about reserving
> memory for device vs sharing it with a user process, user processes can wait,
> their pages can even be swapped out if needed. But for a device that is not willing
> to wait (GFP_ATOMIC) say in an interrupt context, this might be a issue.
>
>
> Anyway, how do you plan to solve this ?

I plan to have saner limits for atomic allocations for 2.4. For the corner
cases, we can make then those limits tunable.

For 2.5, I guess we'll need some scheme for those corner cases, since they
will probably become more common (think about gigabit ethernet, etc).

I'm not sure yet which one will be used. Ben ([email protected]) has a nice
scheme ready for reservation. But thats 2.5 only anyway.

2001-10-09 14:50:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 10:44:37AM -0200, Marcelo Tosatti wrote:
>
> Hi,
>
> I've been testing pre6 (actually its pre5 a patch which Linus sent me
> named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
> out some problems. First of all, we need to throttle normal allocators
> more often and/or update the low memory limits for normal allocators to a
> saner value. I already said I think allowing everybody to eat up to
> "freepages.min" is too low for a default.
>
> I've got atomic memory failures with _22GB_ of swap free (32GB total):
>
> eth0: can't fill rx buffer (force 0)!
>
> Another issue is the damn fork() special case. Its failing in practice:
>
> bash: fork: Cannot allocate memory
>
> Also with _LOTS_ of swap free. (gigs of them)

It could be just fragmentation but the fact it doesn't happen in
non-highmem pretty much shows that shows the memory balancing isn't
doing the right thing, you hide the problem with the infinite loop for
non atomic order 0 allocations and that's just broken, as best it will
be slower in collecting the right pages away.

My approch shouldn't fail so easily in fork despite I'm not looping in
fork either, because I'm trying to do better decisions since the first
place in the memory balancing, I don't wait the infinite loop to
eventually collect away the right pages.

Andrea

2001-10-09 14:53:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 11:23:24AM -0200, Marcelo Tosatti wrote:
> machine is available to the community. If you (or any other VM people who
> need the machine) want access, just tell me.

I'd like to get a login. I think my project is been approved and we'll
get soon an additional machine to test (that doesn't hurt), but in the
meantime I'd be just interested to run some test on real highmem of
course.

Andrea

2001-10-09 14:56:07

by BALBIR SINGH

[permalink] [raw]
Subject: Re: pre6 VM issues

Andrea Arcangeli wrote:

>On Tue, Oct 09, 2001 at 08:07:19PM +0530, BALBIR SINGH wrote:
>
>>their pages can even be swapped out if needed. But for a device that is not willing
>>to wait (GFP_ATOMIC) say in an interrupt context, this might be a issue.
>>
>
>There's just a reserved pool for atomic allocations. See the __GFP_WAIT
>check in __alloc_pages.
>
I apologize for my ignorance on this
Balbir

>
>Andrea
>




Attachments:
Wipro_Disclaimer.txt (853.00 B)

2001-10-09 14:57:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues


On Tue, 9 Oct 2001, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2001 at 10:44:37AM -0200, Marcelo Tosatti wrote:
> >
> > Hi,
> >
> > I've been testing pre6 (actually its pre5 a patch which Linus sent me
> > named "prewith 16GB of RAM (thanks to OSDLabs for that), and I've found
> > out some problems. First of all, we need to throttle normal allocators
> > more often and/or update the low memory limits for normal allocators to a
> > saner value. I already said I think allowing everybody to eat up to
> > "freepages.min" is too low for a default.
> >
> > I've got atomic memory failures with _22GB_ of swap free (32GB total):
> >
> > eth0: can't fill rx buffer (force 0)!
> >
> > Another issue is the damn fork() special case. Its failing in practice:
> >
> > bash: fork: Cannot allocate memory
> >
> > Also with _LOTS_ of swap free. (gigs of them)
>
> It could be just fragmentation but the fact it doesn't happen in
> non-highmem pretty much shows that shows the memory balancing isn't
> doing the right thing, you hide the problem with the infinite loop for
> non atomic order 0 allocations and that's just broken, as best it will
> be slower in collecting the right pages away.
>
> My approch shouldn't fail so easily in fork despite I'm not looping in
> fork either, because I'm trying to do better decisions since the first
> place in the memory balancing, I don't wait the infinite loop to
> eventually collect away the right pages.

The problem may well be in the memory balancing Andrea, but I'm not trying
to hide it with the infinite loop.

The infinite loop is just a guarantee that we'll have a reliable way of
throttling the allocators which can block. Not doing the infinite loop is
just way too fragile IMO and it is _prone_ to fail in intensive
loads.

If the problem is the highmem balancing, I'll love to get your fixes and
integrate with the infinite loop logic, which is a separated (related,
yes, but separate) thing.

2001-10-09 15:40:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 11:34:47AM -0200, Marcelo Tosatti wrote:
> The problem may well be in the memory balancing Andrea, but I'm not trying
> to hide it with the infinite loop.

I assumed fixing the oom faliures with highmem was the main reason of
the infinite loop.

> The infinite loop is just a guarantee that we'll have a reliable way of
> throttling the allocators which can block. Not doing the infinite loop is

Throttling have nothing to do with the infinite loop.

> just way too fragile IMO and it is _prone_ to fail in intensive
> loads.

It is too fragile if the vm is doing the wrong actions and so we must
loop over and over again before it finally does the right thing.

If allocation fails that's a nice feedback that tell us "the memory
balancing is at least inefficient in doing the right thing, looping
would only waste more cache and more time for the allocation".

Think a list where pages can be only freeable or unfreeable. Now scan
_all_ the pages and free all the freeable ones. Finished. If it failed
and it couldn't free anything it means there was nothing to free so
we're oom. How can that be "fragile"?

In real life it isn't as simple as that, there's some "race" effect
caming from the schedules in between, there are multiple lists, there's
swapout etc... so it's a little more complex than just "freeable" and
"unfreeable" and a single list, but it can be done, 2.2 does that too,
if we loop over and over again and we do no progress in the right
direction I prefer to know about that via an allocation faliure rather
than by just getting sucking performance. Also an allocation faliure is
a minor problem compared to a deadlock that the infinite loop cannot
prevent.

> If the problem is the highmem balancing, I'll love to get your fixes and
> integrate with the infinite loop logic, which is a separated (related,
> yes, but separate) thing.

The infinite loop shouldn't do anything except introducing the deadlock
after that (otherwise it means I failed :), but you're free to go in
your direction if you think it's the right one of course (like I'm free
to go in my direction since I think it's the right one).

Andrea

2001-10-09 16:30:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: pre6 VM issues



On Tue, 9 Oct 2001, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2001 at 11:34:47AM -0200, Marcelo Tosatti wrote:
> > The problem may well be in the memory balancing Andrea, but I'm not trying
> > to hide it with the infinite loop.
>
> I assumed fixing the oom faliures with highmem was the main reason of
> the infinite loop.
>
> > The infinite loop is just a guarantee that we'll have a reliable way of
> > throttling the allocators which can block. Not doing the infinite loop is
>
> Throttling have nothing to do with the infinite loop.

Sorry but the infinite loop does throttles page reclamation until there
is enough memory for the process allocating memory to go on.

> > just way too fragile IMO and it is _prone_ to fail in intensive
> > loads.
>
> It is too fragile if the vm is doing the wrong actions and so we must
> loop over and over again before it finally does the right thing.
>
> If allocation fails that's a nice feedback that tell us "the memory
> balancing is at least inefficient in doing the right thing, looping
> would only waste more cache and more time for the allocation".
>
> Think a list where pages can be only freeable or unfreeable. Now scan
> _all_ the pages and free all the freeable ones. Finished. If it failed
> and it couldn't free anything it means there was nothing to free so
> we're oom. How can that be "fragile"?

That is fragile IMHO, Andrea.

The infinite loop is simple, reliable logic which shows to works (as long
as the OOM killer is working correctly).

> In real life it isn't as simple as that, there's some "race" effect
> caming from the schedules in between, there are multiple lists, there's
> swapout etc... so it's a little more complex than just "freeable" and
> "unfreeable" and a single list, but it can be done, 2.2 does that too,
> if we loop over and over again and we do no progress in the right
> direction I prefer to know about that via an allocation faliure rather
> than by just getting sucking performance. Also an allocation faliure is
> a minor problem compared to a deadlock that the infinite loop cannot
> prevent.

If the OOM killer is doing its job correctly, a deadlock will not happen.

> > If the problem is the highmem balancing, I'll love to get your fixes and
> > integrate with the infinite loop logic, which is a separated (related,
> > yes, but separate) thing.
>
> The infinite loop shouldn't do anything except introducing the deadlock
> after that (otherwise it means I failed :), but you're free to go in
> your direction if you think it's the right one of course (like I'm free
> to go in my direction since I think it's the right one).

Sure. :)

2001-10-09 16:50:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: pre6 VM issues

On Tue, Oct 09, 2001 at 01:08:14PM -0200, Marcelo Tosatti wrote:
> Sorry but the infinite loop does throttles page reclamation until there
> is enough memory for the process allocating memory to go on.

If you think we're missing throttling and you add the infinite loop,
yes, you'll hide the lack of throttling by looping at full cpu speed
rather than using the cpu for more useful things, but that doesn't mean
that the looping in itself is adding throttling, a loop can't add
throttling.

> > > just way too fragile IMO and it is _prone_ to fail in intensive
> > > loads.
> >
> > It is too fragile if the vm is doing the wrong actions and so we must
> > loop over and over again before it finally does the right thing.
> >
> > If allocation fails that's a nice feedback that tell us "the memory
> > balancing is at least inefficient in doing the right thing, looping
> > would only waste more cache and more time for the allocation".
> >
> > Think a list where pages can be only freeable or unfreeable. Now scan
> > _all_ the pages and free all the freeable ones. Finished. If it failed
> > and it couldn't free anything it means there was nothing to free so
> > we're oom. How can that be "fragile"?
>
> That is fragile IMHO, Andrea.

Mind to explain "why"? Of course you can't because it isn't fragile,
period.

If you have a list and the elements that can be freeable or unfreeable,
and you scan the whole with all the locks held and you free everything
freeable that you find in your way, you know that if you didn't free
anything after the scan completed, it means you're oom.

As said real world is more complex, but the example above is really
obvious.

> The infinite loop is simple, reliable logic which shows to works (as long
> as the OOM killer is working correctly).

The infinite loop adds oom deadlocks and hides the real problems in the
memory balancing.

> If the OOM killer is doing its job correctly, a deadlock will not happen.

I quote my first email about pre4 (I think I CC'ed you too):

".. think if the oom-selected task is looping trying to free memory, it
won't care about the signal you sent to it .."

and that was just a simple case, there are more problems, the above one
can be esaily fixed with a simple check for signal pending within the
loop, that is currently still missing and that you seems not to care to
add even after I mentioned this exact problem as soon as pre4 is been
released (that I didn't fixed because I'm not using the loop and because
there would be other problems and I don't need the loop just to detect
oom).

I think it's useless to keep discussing this, not matter what I say and
the problem I'm raising, you will keep thinking the loop is the right
way as far I can see.

Andrea

2001-10-09 17:08:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: pre6 VM issues


On Tue, 9 Oct 2001, Andrea Arcangeli wrote:
>
> If you think we're missing throttling and you add the infinite loop,
> yes, you'll hide the lack of throttling by looping at full cpu speed
> rather than using the cpu for more useful things, but that doesn't mean
> that the looping in itself is adding throttling, a loop can't add
> throttling.

The loop means that the return value of "try_to_free_pages()" basically
becomes meaningless _except_ as a way of telling kswapd that "we are now
having trouble freeing pages, maybe you should check if something should
be killed".

Which means that "try_to_free_pages()" has more freedom in doing whatever
it is it wants to do - it knows that real allocations will call it again
(after having checked whether the process can die).

> > > Think a list where pages can be only freeable or unfreeable. Now scan
> > > _all_ the pages and free all the freeable ones. Finished. If it failed
> > > and it couldn't free anything it means there was nothing to free so
> > > we're oom. How can that be "fragile"?
> >
> > That is fragile IMHO, Andrea.
>
> Mind to explain "why"? Of course you can't because it isn't fragile,
> period.

It's fragile because it means that to be true, the try_to_free_pages()
logic _has_to_guarantee_ that it looked at every single page. Going
through every list that ages _twice_ to get rid of potential accessed
bits.

For example, it means that if there are lots of pages that just happen to
be locked due to having pending write-outs on them, you will return OOM.
Even if the system isn't out of memory - it's only temporarily locked, and
what try_to_free_pages() should have done is probably to wait on a page.

HOWEVER, you cannot afford to wait on a single page with your approach,
because if you wait for pages that you notice are locked, _together_ with
the requirement that you have to go through every single list twice, you'd
be totally screwed, and people might wait for a really long time.

So what do you do? You never wait at all, and just skip locked pages.
Which means that your loop can never throttle, and because you refuse to
see the light about the "endless loop", you can never really even _start_
throttling on IO without adding more and more special cases.

> The infinite loop adds oom deadlocks and hides the real problems in the
> memory balancing.

You've not shown that to be true. Look at the code, tell us how it
deadlocks.

> I quote my first email about pre4 (I think I CC'ed you too):
>
> ".. think if the oom-selected task is looping trying to free memory, it
> won't care about the signal you sent to it .."

Look again, and read the emails we've sent you.

You refuse to listen, and that's the problem. Check the PF_MEMALLOC logic,
and stop blathering about things that you do not understand.

Linus