These set of patches provide a TCP/IP emergency communication mechanism that
could be used to guarantee high priority communications over a critical socket
to succeed even under very low memory conditions that last for a couple of
minutes. It uses the critical page pool facility provided by Matt's patches
that he posted recently on lkml.
http://lkml.org/lkml/2005/12/14/34/index.html
This mechanism provides a new socket option SO_CRITICAL that can be used to
mark a socket as critical. A critical connection used for emergency
communications has to be established and marked as critical before we enter
the emergency condition.
It uses the __GFP_CRITICAL flag introduced in the critical page pool patches
to indicate an allocation request as critical and should be satisfied from the
critical page pool if required. In the send path, this flag is passed with all
allocation requests that are made for a critical socket. But in the receive
path we do not know if a packet is critical or not until we receive it and
find the socket that it is destined to. So we treat all the allocation
requests in the receive path as critical.
The critical page pool patches also introduces a global flag
'system_in_emergency' that is used to indicate an emergency situation(could be
a low memory condition). When this flag is set any incoming packets that belong
to non-critical sockets are dropped as soon as possible in the receive path.
This is necessary to prevent incoming non-critical packets to consume memory
from critical page pool.
I would appreciate any feedback or comments on this approach.
Thanks
Sridhar
> I would appreciate any feedback or comments on this approach.
Maybe I'm missing something but wouldn't you need an own critical
pool (or at least reservation) for each socket to be safe against deadlocks?
Otherwise if a critical sockets needs e.g. 2 pages to finish something
and 2 critical sockets are active they can each steal the last pages
from each other and deadlock.
-Andi
On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote:
> > I would appreciate any feedback or comments on this approach.
>
> Maybe I'm missing something but wouldn't you need an own critical
> pool (or at least reservation) for each socket to be safe against deadlocks?
>
> Otherwise if a critical sockets needs e.g. 2 pages to finish something
> and 2 critical sockets are active they can each steal the last pages
> from each other and deadlock.
Here we are assuming that the pre-allocated critical page pool is big enough
to satisfy the requirements of all the critical sockets.
In the current critical page pool implementation, there is also a limitation
that only order-0 allocations(single page) are supported. I think in the
networking send/receive patch, the only place where multi-page allocs are
requested is in the drivers if the MTU > PAGESIZE. But i guess the drivers
are getting updated to avoid > order-0 allocations.
Also during the emergency, we free the memory allocated for non-critical
packets as quickly as possible so that it can be re-used for critical
allocations.
Thanks
Sridhar
> Here we are assuming that the pre-allocated critical page pool is big enough
> to satisfy the requirements of all the critical sockets.
That seems like a lot of assumptions. Is it really better than the
existing GFP_ATOMIC which works basically the same? It has a lot
more users that compete true, but likely the set of GFP_CRITICAL users
would grow over time too and it would develop the same problem.
I think if you really want to attack this problem and improve
over the GFP_ATOMIC "best effort in smaller pool" approach you should
probably add real reservations. And then really do a lot of testing
to see if it actually helps.
-Andi
> It has a lot
> more users that compete true, but likely the set of GFP_CRITICAL users
> would grow over time too and it would develop the same problem.
No, because the critical set is determined by the user (by setting
the socket flag).
The receive side has some things marked as "critical" until we
have processed enough to check the socket flag, but then they should
be released. Those short-lived allocations and frees are more or less
0 net towards the pool.
Certainly, it wouldn't work very well if every socket is
marked as "critical", but with an adequate pool for the workload, I
expect it'll work as advertised (esp. since it'll usually be only one
socket associated with swap management that'll be critical).
+-DLS
On 12/14/05, Sridhar Samudrala <[email protected]> wrote:
>
> These set of patches provide a TCP/IP emergency communication mechanism that
> could be used to guarantee high priority communications over a critical socket
> to succeed even under very low memory conditions that last for a couple of
> minutes. It uses the critical page pool facility provided by Matt's patches
> that he posted recently on lkml.
> http://lkml.org/lkml/2005/12/14/34/index.html
>
> This mechanism provides a new socket option SO_CRITICAL that can be used to
> mark a socket as critical. A critical connection used for emergency
So now everyone writing commercial apps for Linux are going to set
SO_CRITICAL on sockets in their apps so their apps can "survive better
under pressure than the competitors aps" and clueless programmers all
over are going to think "cool, with this I can make my app more
important than everyone elses, I'm going to use this". When everyone
and his dog starts to set this, what's the point?
> communications has to be established and marked as critical before we enter
> the emergency condition.
>
> It uses the __GFP_CRITICAL flag introduced in the critical page pool patches
> to indicate an allocation request as critical and should be satisfied from the
> critical page pool if required. In the send path, this flag is passed with all
> allocation requests that are made for a critical socket. But in the receive
> path we do not know if a packet is critical or not until we receive it and
> find the socket that it is destined to. So we treat all the allocation
> requests in the receive path as critical.
>
> The critical page pool patches also introduces a global flag
> 'system_in_emergency' that is used to indicate an emergency situation(could be
> a low memory condition). When this flag is set any incoming packets that belong
> to non-critical sockets are dropped as soon as possible in the receive path.
Hmm, so if I fire up an app that has SO_CRITICAL set on a socket and
can then somehow put a lot of memory pressure on the machine I can
cause traffic on other sockets to be dropped.. hmmm.. sounds like
something to play with to create new and interresting DoS attacks...
> This is necessary to prevent incoming non-critical packets to consume memory
> from critical page pool.
>
> I would appreciate any feedback or comments on this approach.
>
To be a little serious, it sounds like something that could be used to
cause trouble and something that will lose its usefulness once enough
people start using it (for valid or invalid reasons), so what's the
point...
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
Jesper Juhl wrote:
> To be a little serious, it sounds like something that could be used to
> cause trouble and something that will lose its usefulness once enough
> people start using it (for valid or invalid reasons), so what's the
> point...
It could easily be a user-configurable option in an application. If
DOS is a real concern, only let this work for root users...
Ben
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
Jesper Juhl wrote:
> On 12/14/05, Sridhar Samudrala <[email protected]> wrote:
>
>>These set of patches provide a TCP/IP emergency communication mechanism that
>>could be used to guarantee high priority communications over a critical socket
>>to succeed even under very low memory conditions that last for a couple of
>>minutes. It uses the critical page pool facility provided by Matt's patches
>>that he posted recently on lkml.
>> http://lkml.org/lkml/2005/12/14/34/index.html
>>
>>This mechanism provides a new socket option SO_CRITICAL that can be used to
>>mark a socket as critical. A critical connection used for emergency
>
>
> So now everyone writing commercial apps for Linux are going to set
> SO_CRITICAL on sockets in their apps so their apps can "survive better
> under pressure than the competitors aps" and clueless programmers all
> over are going to think "cool, with this I can make my app more
> important than everyone elses, I'm going to use this". When everyone
> and his dog starts to set this, what's the point?
>
>
I don't think the initial patches that Matt did were intended for what
you are describing.
When I had the conversation with Matt at KS, the problem we were trying
to solve was "Memory pressure with network attached swap space".
I came up with the idea that I think Matt has implemented.
Letting the OS choose which are "critical" TCP/IP sessions is fine. But
letting an application choose is a recipe for disaster.
James
On Wed, 2005-12-14 at 20:49 +0000, James Courtier-Dutton wrote:
> Jesper Juhl wrote:
> > On 12/14/05, Sridhar Samudrala <[email protected]> wrote:
> >
> >>These set of patches provide a TCP/IP emergency communication mechanism that
> >>could be used to guarantee high priority communications over a critical socket
> >>to succeed even under very low memory conditions that last for a couple of
> >>minutes. It uses the critical page pool facility provided by Matt's patches
> >>that he posted recently on lkml.
> >> http://lkml.org/lkml/2005/12/14/34/index.html
> >>
> >>This mechanism provides a new socket option SO_CRITICAL that can be used to
> >>mark a socket as critical. A critical connection used for emergency
> >
> >
> > So now everyone writing commercial apps for Linux are going to set
> > SO_CRITICAL on sockets in their apps so their apps can "survive better
> > under pressure than the competitors aps" and clueless programmers all
> > over are going to think "cool, with this I can make my app more
> > important than everyone elses, I'm going to use this". When everyone
> > and his dog starts to set this, what's the point?
> >
> >
>
> I don't think the initial patches that Matt did were intended for what
> you are describing.
> When I had the conversation with Matt at KS, the problem we were trying
> to solve was "Memory pressure with network attached swap space".
> I came up with the idea that I think Matt has implemented.
> Letting the OS choose which are "critical" TCP/IP sessions is fine. But
> letting an application choose is a recipe for disaster.
We could easily add capable(CAP_NET_ADMIN) check to allow this option to
be set only by privileged users.
Thanks
Sridhar
Sridhar Samudrala wrote:
> On Wed, 2005-12-14 at 20:49 +0000, James Courtier-Dutton wrote:
>
>>Jesper Juhl wrote:
>>
>>>On 12/14/05, Sridhar Samudrala <[email protected]> wrote:
>>>
>>>
>>>>These set of patches provide a TCP/IP emergency communication mechanism that
>>>>could be used to guarantee high priority communications over a critical socket
>>>>to succeed even under very low memory conditions that last for a couple of
>>>>minutes. It uses the critical page pool facility provided by Matt's patches
>>>>that he posted recently on lkml.
>>>> http://lkml.org/lkml/2005/12/14/34/index.html
>>>>
>>>>This mechanism provides a new socket option SO_CRITICAL that can be used to
>>>>mark a socket as critical. A critical connection used for emergency
>>>
>>>
>>>So now everyone writing commercial apps for Linux are going to set
>>>SO_CRITICAL on sockets in their apps so their apps can "survive better
>>>under pressure than the competitors aps" and clueless programmers all
>>>over are going to think "cool, with this I can make my app more
>>>important than everyone elses, I'm going to use this". When everyone
>>>and his dog starts to set this, what's the point?
>>>
>>>
>>
>>I don't think the initial patches that Matt did were intended for what
>>you are describing.
>>When I had the conversation with Matt at KS, the problem we were trying
>>to solve was "Memory pressure with network attached swap space".
>>I came up with the idea that I think Matt has implemented.
>>Letting the OS choose which are "critical" TCP/IP sessions is fine. But
>>letting an application choose is a recipe for disaster.
>
>
> We could easily add capable(CAP_NET_ADMIN) check to allow this option to
> be set only by privileged users.
>
> Thanks
> Sridhar
>
Sridhar,
Have you actually thought about what would happen in a real world senario?
There is no real world requirement for this sort of user land feature.
In memory pressure mode, you don't care about user applications. In
fact, under memory pressure no user applications are getting scheduled.
All you care about is swapping out memory to achieve a net gain in free
memory, so that the applications can then run ok again.
James
James Courtier-Dutton wrote:
> Have you actually thought about what would happen in a real world senario?
> There is no real world requirement for this sort of user land feature.
> In memory pressure mode, you don't care about user applications. In
> fact, under memory pressure no user applications are getting scheduled.
> All you care about is swapping out memory to achieve a net gain in free
> memory, so that the applications can then run ok again.
Low 'ATOMIC' memory is different from the memory that user space typically
uses, so just because you can't allocate an SKB does not mean you are swapping
out user-space apps.
I have an app that can have 2000+ sockets open. I would definately like to make
the management and other important sockets have priority over others in my app...
Ben
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
On Wed, 2005-12-14 at 14:39 -0800, Ben Greear wrote:
> James Courtier-Dutton wrote:
>
> > Have you actually thought about what would happen in a real world senario?
> > There is no real world requirement for this sort of user land feature.
> > In memory pressure mode, you don't care about user applications. In
> > fact, under memory pressure no user applications are getting scheduled.
> > All you care about is swapping out memory to achieve a net gain in free
> > memory, so that the applications can then run ok again.
>
> Low 'ATOMIC' memory is different from the memory that user space typically
> uses, so just because you can't allocate an SKB does not mean you are swapping
> out user-space apps.
>
> I have an app that can have 2000+ sockets open. I would definately like to make
> the management and other important sockets have priority over others in my app...
The scenario we are trying to address is also a management connection between the
nodes of a cluster and a server that manages the swap devices accessible by all the
nodes of the cluster. The critical connection is supposed to be used to exchange
status notifications of the swap devices so that failover can happen and propagated
to all the nodes as quickly as possible. The management apps will be pinned into
memory so that they are not swapped out.
As such the traffic that flows over the critical sockets is not high but should
not stall even if we run into a memory constrained situation. That is the reason
why we would like to have a pre-allocated critical page pool which could be used
when we run out of ATOMIC memory.
Thanks
Sridhar
James Courtier-Dutton wrote:
> When I had the conversation with Matt at KS, the problem we were trying
> to solve was "Memory pressure with network attached swap space".
s/swap space/writable filesystems/
You can hit these problems even if you have no swap. Too much of the
memory becomes filled with dirty pages needing writeback -- then you lose
your NFS server's ARP entry at the wrong moment. If you have a local disk
to swap to the machine will recover after a little bit of grinding, otherwise
it's all pretty much over.
The big problem is that as long as there's network I/O coming in it's
likely that pages you free (as the VM gets more and more desperate about
dropping the few remaining non-dirty pages) will get used for sockets
that AREN'T helping you recover RAM. You really need to be able to tell
the whole network stack "we're in really rough shape here; ignore all RX
work unless it's going to help me get write ACKs back from my {NFS,iSCSI}
server" My understanding is that is what this patchset is trying to
accomplish.
-Mitch
On Wed, Dec 14, 2005 at 09:55:45AM -0800, Sridhar Samudrala wrote:
> On Wed, 2005-12-14 at 10:22 +0100, Andi Kleen wrote:
> > > I would appreciate any feedback or comments on this approach.
> >
> > Maybe I'm missing something but wouldn't you need an own critical
> > pool (or at least reservation) for each socket to be safe against deadlocks?
> >
> > Otherwise if a critical sockets needs e.g. 2 pages to finish something
> > and 2 critical sockets are active they can each steal the last pages
> > from each other and deadlock.
>
> Here we are assuming that the pre-allocated critical page pool is big enough
> to satisfy the requirements of all the critical sockets.
Not a good assumption. A system can have between 1-1000 iSCSI
connections open and we certainly don't want to preallocate enough
room for 1000 connections to make progress when we might only have one
in use.
I think we need a global receive pool and per-socket send pools.
--
Mathematics is the supreme nostalgia of our time.
From: Matt Mackall <[email protected]>
Date: Wed, 14 Dec 2005 19:39:37 -0800
> I think we need a global receive pool and per-socket send pools.
Mind telling everyone how you plan to make use of the global receive
pool when the allocation happens in the device driver and we have no
idea which socket the packet is destined for? What should be done for
non-local packets being routed? The device drivers allocate packets
for the entire system, long before we know who the eventually received
packets are for. It is fully anonymous memory, and it's easy to
design cases where the whole pool can be eaten up by non-local
forwarded packets.
I truly dislike these patches being discussed because they are a
complete hack, and admittedly don't even solve the problem fully. I
don't have any concrete better ideas but that doesn't mean this stuff
should go into the tree.
I think GFP_ATOMIC memory pools are more powerful than they are given
credit for. There is nothing preventing the implementation of dynamic
GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in"
in response to hitting those water marks.
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 19:39:37 -0800
>
> > I think we need a global receive pool and per-socket send pools.
>
> Mind telling everyone how you plan to make use of the global receive
> pool when the allocation happens in the device driver and we have no
> idea which socket the packet is destined for? What should be done for
> non-local packets being routed? The device drivers allocate packets
> for the entire system, long before we know who the eventually received
> packets are for. It is fully anonymous memory, and it's easy to
> design cases where the whole pool can be eaten up by non-local
> forwarded packets.
There needs to be two rules:
iff global memory critical flag is set
- allocate from the global critical receive pool on receive
- return packet to global pool if not destined for a socket with an
attached send mempool
I think this will provide the desired behavior, though only
probabilistically. That is, we can fill the global receive pool with
uninteresting packets such that we're forced to drop critical ACKs,
but the boring packets will eventually be discarded as we walk up the
stack and we'll eventually have room to receive retried ACKs.
> I truly dislike these patches being discussed because they are a
> complete hack, and admittedly don't even solve the problem fully. I
> don't have any concrete better ideas but that doesn't mean this stuff
> should go into the tree.
Agreed. I'm fairly convinced a full fix is doable, if you make a
couple assumptions (limited fragmentation), but will unavoidably be
less than pretty as it needs to cross some layers.
> I think GFP_ATOMIC memory pools are more powerful than they are given
> credit for. There is nothing preventing the implementation of dynamic
> GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in"
> in response to hitting those water marks.
There are two problems with GFP_ATOMIC. The first is that its users
don't pre-state their worst-case usage, which means sizing the pool to
reliably avoid deadlocks is impossible. The second is that there
aren't any guarantees that GFP_ATOMIC allocations are actually
critical in the needed-to-make-forward-VM-progress sense or will be
returned to the pool in a timely fashion.
So I do think we need a distinct pool if we want to tackle this
problem. Though it's probably worth mentioning that Linus was rather
adamantly against even trying at KS.
--
Mathematics is the supreme nostalgia of our time.
From: Matt Mackall <[email protected]>
Date: Wed, 14 Dec 2005 21:02:50 -0800
> There needs to be two rules:
>
> iff global memory critical flag is set
> - allocate from the global critical receive pool on receive
> - return packet to global pool if not destined for a socket with an
> attached send mempool
This shuts off a router and/or firewall just because iSCSI or NFS peed
in it's pants. Not really acceptable.
> I think this will provide the desired behavior
It's not desirable.
What if iSCSI is protected by IPSEC, and the key management daemon has
to process a security assosciation expiration and negotiate a new one
in order for iSCSI to further communicate with it's peer when this
memory shortage occurs? It needs to send packets back and forth with
the remove key management daemon in order to do this, but since you
cut it off with this critical receive pool, the negotiation will never
succeed.
This stuff won't work. It's not a generic solution and that's
why it has more holes than swiss cheese. :-)
On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 19:39:37 -0800
>
> > I think we need a global receive pool and per-socket send pools.
>
> Mind telling everyone how you plan to make use of the global receive
> pool when the allocation happens in the device driver and we have no
> idea which socket the packet is destined for? What should be done for
In theory one could use multiple receive queue on intelligent enough
NIC with the NIC distingushing the sockets.
But that would be still a nasty "you need advanced hardware FOO to avoid
subtle problem Y" case. Also it would require lots of driver hacking.
And most NICs seem to have limits on the size of the socket tables for this, which
means you would end up in a "only N sockets supported safely" situation,
with N likely being quite small on common hardware.
I think the idea of the original poster was that just freeing non critical packets
after a short time again would be good enough, but I'm a bit sceptical
on that.
> I truly dislike these patches being discussed because they are a
> complete hack, and admittedly don't even solve the problem fully. I
I agree.
> I think GFP_ATOMIC memory pools are more powerful than they are given
> credit for. There is nothing preventing the implementation of dynamic
Their main problem is that they are used too widely and in a lot
of situations that aren't really critical.
-Andi
David S. Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 21:02:50 -0800
>
>
>>There needs to be two rules:
>>
>>iff global memory critical flag is set
>>- allocate from the global critical receive pool on receive
>>- return packet to global pool if not destined for a socket with an
>> attached send mempool
>
>
> This shuts off a router and/or firewall just because iSCSI or NFS peed
> in it's pants. Not really acceptable.
>
But that should only happen (shut off a router and/or firewall) in cases
where we now completely deadlock and never recover, including shutting off
the router and firewall, because they don't have enough memory to recv
packets either.
>
>>I think this will provide the desired behavior
>
>
> It's not desirable.
>
> What if iSCSI is protected by IPSEC, and the key management daemon has
> to process a security assosciation expiration and negotiate a new one
> in order for iSCSI to further communicate with it's peer when this
> memory shortage occurs? It needs to send packets back and forth with
> the remove key management daemon in order to do this, but since you
> cut it off with this critical receive pool, the negotiation will never
> succeed.
>
I guess IPSEC would be a critical socket too, in that case. Sure
there is nothing we can do if the daemon insists on allocating lots
of memory...
> This stuff won't work. It's not a generic solution and that's
> why it has more holes than swiss cheese. :-)
True it will have holes. I think something that is complementary and
would be desirable is to simply limit the amount of in-flight writeout
that things like NFS allows (or used to allow, haven't checked for a
while and there were noises about it getting better).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Wed, 14 Dec 2005 21:23:09 -0800 (PST)
"David S. Miller" <[email protected]> wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 21:02:50 -0800
>
> > There needs to be two rules:
> >
> > iff global memory critical flag is set
> > - allocate from the global critical receive pool on receive
> > - return packet to global pool if not destined for a socket with an
> > attached send mempool
>
> This shuts off a router and/or firewall just because iSCSI or NFS peed
> in it's pants. Not really acceptable.
>
> > I think this will provide the desired behavior
>
> It's not desirable.
>
> What if iSCSI is protected by IPSEC, and the key management daemon has
> to process a security assosciation expiration and negotiate a new one
> in order for iSCSI to further communicate with it's peer when this
> memory shortage occurs? It needs to send packets back and forth with
> the remove key management daemon in order to do this, but since you
> cut it off with this critical receive pool, the negotiation will never
> succeed.
>
> This stuff won't work. It's not a generic solution and that's
> why it has more holes than swiss cheese. :-)
Also, all this stuff is just a band aid because linux OOM behavior is so
fucked up. The VM system just lets the user dig themselves into a huge
over commit, then we get into trying to change every other system to
compensate. How about cutting things off earlier, and not falling
off the cliff? How about pushing out pages to swap earlier when memory
pressure starts to get noticed. Then you can free those non-dirty pages
to make progress. Too many of the VM decisions seem to be made in favor
of keep-it-in-memory benchmark situations.
On Wed, Dec 14, 2005 at 09:23:09PM -0800, David S. Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 21:02:50 -0800
>
> > There needs to be two rules:
> >
> > iff global memory critical flag is set
> > - allocate from the global critical receive pool on receive
> > - return packet to global pool if not destined for a socket with an
> > attached send mempool
>
> This shuts off a router and/or firewall just because iSCSI or NFS peed
> in it's pants. Not really acceptable.
That'll happen now anyway.
> > I think this will provide the desired behavior
>
> It's not desirable.
>
> What if iSCSI is protected by IPSEC, and the key management daemon has
> to process a security assosciation expiration and negotiate a new one
> in order for iSCSI to further communicate with it's peer when this
> memory shortage occurs? It needs to send packets back and forth with
> the remove key management daemon in order to do this, but since you
> cut it off with this critical receive pool, the negotiation will never
> succeed.
Ok, encapsulation completely ruins the idea.
--
Mathematics is the supreme nostalgia of our time.
On Thu, 15 Dec 2005 06:42:45 +0100
Andi Kleen <[email protected]> wrote:
> On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
> > From: Matt Mackall <[email protected]>
> > Date: Wed, 14 Dec 2005 19:39:37 -0800
> >
> > > I think we need a global receive pool and per-socket send pools.
> >
> > Mind telling everyone how you plan to make use of the global receive
> > pool when the allocation happens in the device driver and we have no
> > idea which socket the packet is destined for? What should be done for
>
> In theory one could use multiple receive queue on intelligent enough
> NIC with the NIC distingushing the sockets.
>
> But that would be still a nasty "you need advanced hardware FOO to avoid
> subtle problem Y" case. Also it would require lots of driver hacking.
>
> And most NICs seem to have limits on the size of the socket tables for this, which
> means you would end up in a "only N sockets supported safely" situation,
> with N likely being quite small on common hardware.
>
> I think the idea of the original poster was that just freeing non critical packets
> after a short time again would be good enough, but I'm a bit sceptical
> on that.
>
> > I truly dislike these patches being discussed because they are a
> > complete hack, and admittedly don't even solve the problem fully. I
>
> I agree.
>
> > I think GFP_ATOMIC memory pools are more powerful than they are given
> > credit for. There is nothing preventing the implementation of dynamic
>
> Their main problem is that they are used too widely and in a lot
> of situations that aren't really critical.
Most of the use of GFP_ATOMIC is by stuff that could fail but can't
sleep waiting for memory. How about adding a GFP_NORMAL for allocations
while holding a lock.
#define GFP_NORMAL (__GFP_NOMEMALLOC)
Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in
places where the error paths are reasonable.
On Wed, 14 Dec 2005, David S. Miller wrote:
> From: Matt Mackall <[email protected]>
> Date: Wed, 14 Dec 2005 19:39:37 -0800
>
> > I think we need a global receive pool and per-socket send pools.
>
> Mind telling everyone how you plan to make use of the global receive
> pool when the allocation happens in the device driver and we have no
> idea which socket the packet is destined for? What should be done for
> non-local packets being routed? The device drivers allocate packets
> for the entire system, long before we know who the eventually received
> packets are for. It is fully anonymous memory, and it's easy to
> design cases where the whole pool can be eaten up by non-local
> forwarded packets.
>
> I truly dislike these patches being discussed because they are a
> complete hack, and admittedly don't even solve the problem fully. I
> don't have any concrete better ideas but that doesn't mean this stuff
> should go into the tree.
>
> I think GFP_ATOMIC memory pools are more powerful than they are given
> credit for. There is nothing preventing the implementation of dynamic
> GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in"
> in response to hitting those water marks.
Does this mean that you are OK with having a mechanism to mark the
sockets as critical and dropping the non critical packets under
emergency, but you do not like having a separate critical page pool.
Instead, you seem to be suggesting in_emergency to be set dynamically
when we are about to run out of ATOMIC memory. Is this right?
Thanks
Sridhar
From: Sridhar Samudrala <[email protected]>
Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
> Instead, you seem to be suggesting in_emergency to be set dynamically
> when we are about to run out of ATOMIC memory. Is this right?
Not when we run out, but rather when we reach some low water mark, the
"critical sockets" would still use GFP_ATOMIC memory but only
"critical sockets" would be allowed to do so.
But even this has faults, consider the IPSEC scenerio I mentioned, and
this applies to any kind of encapsulation actually, even simple
tunneling examples can be concocted which make the "critical socket"
idea fail.
The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
tunneling allocations critical, and... and..." well you have
GFP_ATOMIC then my friend.
In short, these "seperate page pool" and "critical socket" ideas do
not work and we need a different solution, I'm sorry folks spent so
much time on them, but they are heavily flawed.
On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote:
> From: Sridhar Samudrala <[email protected]>
> Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
>
> > Instead, you seem to be suggesting in_emergency to be set dynamically
> > when we are about to run out of ATOMIC memory. Is this right?
>
> Not when we run out, but rather when we reach some low water mark, the
> "critical sockets" would still use GFP_ATOMIC memory but only
> "critical sockets" would be allowed to do so.
>
> But even this has faults, consider the IPSEC scenerio I mentioned, and
> this applies to any kind of encapsulation actually, even simple
> tunneling examples can be concocted which make the "critical socket"
> idea fail.
>
> The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
> tunneling allocations critical, and... and..." well you have
> GFP_ATOMIC then my friend.
>
> In short, these "seperate page pool" and "critical socket" ideas do
> not work and we need a different solution, I'm sorry folks spent so
> much time on them, but they are heavily flawed.
maybe it should be approached from the other side; having a way to mark
connections as low priority (say incoming http connections to your
webserver) or as non-critical/expendable would give the "normal"
GFP_ATOMIC ones a better chance in case of overload/DDOS etc. It's not
going to solve the VM deadlock issue wrt iscsi/nfs; however it might be
useful in the "survive slashdot" sense...
> Also, all this stuff is just a band aid because linux OOM behavior is so
> fucked up.
In our internal discussions, characterizing this as "OOM" came
up a lot, and I don't think of it as that at all. OOM is exactly what the
scheme is trying to avoid!
The actual situation we have in mind is a swap device management system
in a cluster where a remote system tells you (via socket communication to
a user-land management app) that a swap device is going to fail over and
it'd be a good idea not to do anything that requires paging out or
swapping for a short period of time. The socket communication must work,
but the system is not at all out of memory, and the important point is
that it never will be if you limit allocations to those things that are
required for the critical socket to work (and nothing/little else).
Receiver side allocations are unavoidable, because you don't know
if you can drop the packet or not until you look at it. Some
infrastructure
must work. But everything else can fail or succeed based on ordinary churn
in ordinary memory pools, until the "in_emergency" condition has passed.
The critical socket(s) simply have to be out of the zero-sum game
for the rest of the allocations, because those are the (only) path to
getting a working swap device again.
If you're out of memory without a network mechanism to get you more,
this doesn't do anything for you (and it isn't intended to). And if you
mark any socket that isn't going to get you failed over or otherwise
get you more swap, it isn't going to help you, either. It isn't a priority
scheme for low-memory, it's a failover mechanism that relies on
networking.
There are exactly 2 priorities: critical (as in "you might as well crash
if
these aren't satisfied") and everything else.
Doing other, more general things that handle low memory, or OOM, or
identified
priorities are great, but the problem we're interested in solving here is
really just about making socket communication work when the alternative is
a completely dead system. I think these patches do that in a reasonable
way.
A better solution would be great, too, if there is one. :-)
+-DLS
On Dec 15, 2005, at 03:21, David S. Miller wrote:
> Not when we run out, but rather when we reach some low water mark,
> the "critical sockets" would still use GFP_ATOMIC memory but only
> "critical sockets" would be allowed to do so.
>
> But even this has faults, consider the IPSEC scenerio I mentioned,
> and this applies to any kind of encapsulation actually, even simple
> tunneling examples can be concocted which make the "critical
> socket" idea fail.
>
> The knee jerk reaction is "mark IPSEC's sockets critical, and mark
> the tunneling allocations critical, and... and..." well you have
> GFP_ATOMIC then my friend.
>
> In short, these "seperate page pool" and "critical socket" ideas do
> not work and we need a different solution, I'm sorry folks spent so
> much time on them, but they are heavily flawed.
What we really need in the kernel is a more fine-grained memory
priority system with PI, similar in concept to what's being done to
the scheduler in some of the RT patchsets. Currently we have a very
black-and-white memory subsystem; when we go OOM, we just start
killing processes until we are no longer OOM. Perhaps we should have
some way to pass memory allocation priorities throughout the kernel,
including a "this request has X priority", "this request will help
free up X pages of RAM", and "drop while dirty under certain OOM to
free X memory using this method".
The initial benefit would be that OOM handling would become more
reliable and less of a special case. When we start to run low on
free pages, it might be OK to kill the SETI@home process long before
we OOM if such action might prevent the OOM. Likewise, you might be
able to flag certain file pages as being "less critical", such that
the kernel can kill a process and drop its dirty pages for files in /
tmp. Or the kernel might do a variety of other things just by
failing new allocations with low priority and forcing existing
allocations with low priority to go away using preregistered handlers.
When processes request memory through any subsystem, their memory
priority would be passed through the kernel layers to the allocator,
along with any associated information about how to free the memory in
a low-memory condition. As a result, I could configure my database
to have a much higher priority than SETI@home (or boinc or whatever),
so that when the database server wants to fill memory with clean DB
cache pages, the kernel will kill SETI@home for it's memory, even if
we could just leave some DB cache pages unfaulted.
Questions? Comments? "This is a terrible idea that should never have
seen the light of day"? Both constructive and destructive criticism
welcomed! (Just please keep the language clean! :-D)
Cheers,
Kyle Moffett
--
Q: Why do programmers confuse Halloween and Christmas?
A: Because OCT 31 == DEC 25.
From: David Stevens <[email protected]>
Date: Thu, 15 Dec 2005 00:44:52 -0800
> In our internal discussions
I really wish this hadn't been discussed internally before being
implemented. Any such internal discussions are lost completely upon
the community that ends up reviewing such a core and invasive patch
such as this one.
> The critical socket(s) simply have to be out of the zero-sum game
> for the rest of the allocations, because those are the (only) path to
> getting a working swap device again.
The core fault of the critical socket idea is that it is painfully
simple to create a tree of dependant allocations that makes the
critical pool useless. IPSEC and tunnels are simple examples.
The idea to mark, for example, IPSEC key management daemon's sockets
as critical is flawed, because the key management daemon could hit a
swap page over the iSCSI device. Don't even start with the idea to
lock the IPSEC key management daemon into ram with mlock().
Tunnels are similar, and realistic nesting cases can be shown that
makes sizing via a special pool simply unfeasible, and whats more
there are no sockets involved.
Sockets do not exist in an allocation vacuum, they need to talk over
routes, and there are therefore many types of auxiliary data
associated with sending a packet besides the packet itself. All you
need is a routing change of some type and you're going to start
burning GFP_ATOMIC allocations on the next packet send.
I think making GFP_ATOMIC better would be wise. Alan's ideas harping
from the old 2.0.x/2.2.x NFS days could use some consideration as well.
> When processes request memory through any subsystem, their memory
> priority would be passed through the kernel layers to the allocator,
> along with any associated information about how to free the memory in
> a low-memory condition. As a result, I could configure my database
> to have a much higher priority than SETI@home (or boinc or whatever),
> so that when the database server wants to fill memory with clean DB
> cache pages, the kernel will kill SETI@home for it's memory, even if
> we could just leave some DB cache pages unfaulted.
Iirc most of the freeing happens in process context anyways,
so process priority information is already available. At least
for CPU cost it might even be taken into account during schedules
(Freeing can take up quite a lot of CPU time)
The problem with GFP_ATOMIC is though that someone else needs
to free the memory in advance for you because you cannot
do it yourself.
(you could call it a kind of "parasite" in the normally
very cooperative society of memory allocators ...)
That would mess up your scheme too. The priority
cannot be expressed because it's more a case of
"somewhen someone in the future might need it"
>
> Questions? Comments? "This is a terrible idea that should never have
> seen the light of day"? Both constructive and destructive criticism
> welcomed! (Just please keep the language clean! :-D)
This won't help for this problem here - even with perfect
priorities you could still get into situations where you
can't make any progress if progress needs more memory.
Only preallocating or prereservation can help you out of
that trap.
-Andi
"David S. Miller" <[email protected]> wrote on 12/15/2005 12:58:05 AM:
> From: David Stevens <[email protected]>
> Date: Thu, 15 Dec 2005 00:44:52 -0800
>
> > In our internal discussions
>
> I really wish this hadn't been discussed internally before being
> implemented. Any such internal discussions are lost completely upon
> the community that ends up reviewing such a core and invasive patch
> such as this one.
I think those were more informal and less extensive than the
impression I gave you. I mean simply bouncing around incomplete
ideas and discussing some of the potential issues before coming
up with a prototype solution, which is intended to be the starting
point for community discussions (and the KS discussions, too). "OOM"
came up immediately (even when naming the problem), and it isn't how
I ever saw it.
The patches, of course, are intended to NOT be invasive, or any
more than they need to be, and they are not "the" solution, but
"a" solution. A completely different one that solves the problem
is just as good to me.
+-DLS
Mitchell Blank Jr wrote:
> James Courtier-Dutton wrote:
>
>>When I had the conversation with Matt at KS, the problem we were trying
>>to solve was "Memory pressure with network attached swap space".
>
>
> s/swap space/writable filesystems/
>
> You can hit these problems even if you have no swap. Too much of the
> memory becomes filled with dirty pages needing writeback -- then you lose
> your NFS server's ARP entry at the wrong moment. If you have a local disk
> to swap to the machine will recover after a little bit of grinding, otherwise
> it's all pretty much over.
>
> The big problem is that as long as there's network I/O coming in it's
> likely that pages you free (as the VM gets more and more desperate about
> dropping the few remaining non-dirty pages) will get used for sockets
> that AREN'T helping you recover RAM. You really need to be able to tell
> the whole network stack "we're in really rough shape here; ignore all RX
> work unless it's going to help me get write ACKs back from my {NFS,iSCSI}
> server" My understanding is that is what this patchset is trying to
> accomplish.
>
> -Mitch
>
>
You are using the wrong hammer to crack your nut.
You should instead approach your problem of why the ARP entry gets lost.
For example, you could give as critical priority to your TCP session,
but that still won't cure your ARP problem.
I would suggest that the best way to cure your arp problem, is to
increase the time between arp cache refreshes.
James
>
> You are using the wrong hammer to crack your nut.
> You should instead approach your problem of why the ARP entry gets lost.
> For example, you could give as critical priority to your TCP session,
> but that still won't cure your ARP problem.
> I would suggest that the best way to cure your arp problem, is to
> increase the time between arp cache refreshes.
or turn it around entirely: all traffic is considered important
unless... and have a bunch of non-critical sockets (like http requests)
be marked non-critical.
On Thursday 15 December 2005 19:55, Kyle Moffett wrote:
> On Dec 15, 2005, at 03:21, David S. Miller wrote:
> > Not when we run out, but rather when we reach some low water mark,
> > the "critical sockets" would still use GFP_ATOMIC memory but only
> > "critical sockets" would be allowed to do so.
> >
> > But even this has faults, consider the IPSEC scenerio I mentioned,
> > and this applies to any kind of encapsulation actually, even simple
> > tunneling examples can be concocted which make the "critical
> > socket" idea fail.
> >
> > The knee jerk reaction is "mark IPSEC's sockets critical, and mark
> > the tunneling allocations critical, and... and..." well you have
> > GFP_ATOMIC then my friend.
> >
> > In short, these "seperate page pool" and "critical socket" ideas do
> > not work and we need a different solution, I'm sorry folks spent so
> > much time on them, but they are heavily flawed.
>
> What we really need in the kernel is a more fine-grained memory
> priority system with PI, similar in concept to what's being done to
> the scheduler in some of the RT patchsets. Currently we have a very
> black-and-white memory subsystem; when we go OOM, we just start
> killing processes until we are no longer OOM. Perhaps we should have
> some way to pass memory allocation priorities throughout the kernel,
> including a "this request has X priority", "this request will help
> free up X pages of RAM", and "drop while dirty under certain OOM to
> free X memory using this method".
>
> The initial benefit would be that OOM handling would become more
> reliable and less of a special case. When we start to run low on
> free pages, it might be OK to kill the SETI@home process long before
> we OOM if such action might prevent the OOM. Likewise, you might be
> able to flag certain file pages as being "less critical", such that
> the kernel can kill a process and drop its dirty pages for files in /
> tmp. Or the kernel might do a variety of other things just by
> failing new allocations with low priority and forcing existing
> allocations with low priority to go away using preregistered handlers.
>
> When processes request memory through any subsystem, their memory
> priority would be passed through the kernel layers to the allocator,
> along with any associated information about how to free the memory in
> a low-memory condition. As a result, I could configure my database
> to have a much higher priority than SETI@home (or boinc or whatever),
> so that when the database server wants to fill memory with clean DB
> cache pages, the kernel will kill SETI@home for it's memory, even if
> we could just leave some DB cache pages unfaulted.
>
> Questions? Comments? "This is a terrible idea that should never have
> seen the light of day"? Both constructive and destructive criticism
> welcomed! (Just please keep the language clean! :-D)
I have some basic process-that-called the memory allocator link in the -ck
tree already which alters how aggressively memory is reclaimed according to
priority. It does not affect out of memory management but that could be added
to said algorithm; however I don't see much point at the moment since oom is
still an uncommon condition but regular memory allocation is routine.
Cheers,
Con
On Dec 15, 2005, at 04:04, Andi Kleen wrote:
>> When processes request memory through any subsystem, their memory
>> priority would be passed through the kernel layers to the
>> allocator, along with any associated information about how to free
>> the memory in a low-memory condition. As a result, I could
>> configure my database to have a much higher priority than
>> SETI@home (or boinc or whatever), so that when the database server
>> wants to fill memory with clean DB cache pages, the kernel will
>> kill SETI@home for it's memory, even if we could just leave some
>> DB cache pages unfaulted.
>
> Iirc most of the freeing happens in process context anyways, so
> process priority information is already available. At least for CPU
> cost it might even be taken into account during schedules (Freeing
> can take up quite a lot of CPU time)
>
> The problem with GFP_ATOMIC is though that someone else needs to
> free the memory in advance for you because you cannot do it yourself.
>
> (you could call it a kind of "parasite" in the normally very
> cooperative society of memory allocators ...)
>
> That would mess up your scheme too. The priority cannot be
> expressed because it's more a case of
> "somewhen someone in the future might need it"
Well, that's currently expressed as a reserved pool with watermarks,
so with a PI system you would have a single pool with some collection
of reservation watermarks with various priorities. I'm not sure what
the best data-structure would be, probably some sort of ordered
priority tree. When allocating or freeing memory, the code would
check the watermark data (which has some summary statistics so you
don't need to check the whole tree each time); if any of the
watermarks are too low with relative priority taken into account, you
fail the allocation or move pages into the pool.
>> Questions? Comments? "This is a terrible idea that should never
>> have seen the light of day"? Both constructive and destructive
>> criticism welcomed! (Just please keep the language clean! :-D)
>
> This won't help for this problem here - even with perfect
> priorities you could still get into situations where you can't make
> any progress if progress needs more memory.
Well the point would be that the priorities could force a more-
extreme and selective OOM (maybe even dropping dirty pages for
noncritical filesystems if necessary!), or handle the situation
described with the IPSec daemon and IPSec network traffic (IPSec
would inherit the increased memory priority, and when it tries to do
networking, its send path and the global receive path would inherit
that increased priority as well.
Naturally this is all still in the vaporware stage, but I think that
if implemented the concept might at least improve the OOM/low-memory
situation considerably. Starting to fail allocations for the cluster
programs (including their kernel allocations) well before failing
them for the swap-fallback tool would help the original poster, and I
imagine various tweaked priorities would make true OOM-deadlock far
less likely.
Cheers,
Kyle Moffett
--
When you go into court you either want a very, very, very bright line
or you want the stomach to outlast the other guy in trench warfare.
If both sides are reasonable, you try to stay _out_ of court in the
first place.
-- Rob Landley
On Dec 15, 2005, at 07:45, Con Kolivas wrote:
> I have some basic process-that-called the memory allocator link in
> the -ck tree already which alters how aggressively memory is
> reclaimed according to priority. It does not affect out of memory
> management but that could be added to said algorithm; however I
> don't see much point at the moment since oom is still an uncommon
> condition but regular memory allocation is routine.
My thought would be to generalize the two special cases of writeback
of dirty pages or dropping of clean pages under memory pressure and
OOM to be the same general case. When you are trying to free up
pages, it may be permissible to drop dirty mbox pages and kill the
postfix process writing them in order to satisfy allocations for the
mission-critical database server. (Or maybe it's the other way
around). If a large chunk of the allocated pages have priorities and
lossless/lossy free functions, then the kernel can be much more
flexible and configurable about what to do when running low on RAM.
Cheers,
Kyle Moffett
--
I lost interest in "blade servers" when I found they didn't throw
knives at people who weren't supposed to be in your machine room.
-- Anthony de Boer
On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote:
> >
> > You are using the wrong hammer to crack your nut.
> > You should instead approach your problem of why the ARP entry gets lost.
> > For example, you could give as critical priority to your TCP session,
> > but that still won't cure your ARP problem.
> > I would suggest that the best way to cure your arp problem, is to
> > increase the time between arp cache refreshes.
>
> or turn it around entirely: all traffic is considered important
> unless... and have a bunch of non-critical sockets (like http requests)
> be marked non-critical.
The big hole punched by DaveM is that of dependencies: a http tcp
connection is tied to ICMP or the IPSEC example given; so you need a lot
more intelligence than just what your app is knowledgeable about at its
level.
You cant really do this shit at the socket level. You need to do it much
earlier.
At runtime, when lower memory thresholds gets crossed, you kick
classification of what packets need to be dropped using something along
the lines of statefull/connection tracking. When things get better you
undo.
cheers,
jamal
On Thursday 15 December 2005 23:58, Kyle Moffett wrote:
> On Dec 15, 2005, at 07:45, Con Kolivas wrote:
> > I have some basic process-that-called the memory allocator link in
> > the -ck tree already which alters how aggressively memory is
> > reclaimed according to priority. It does not affect out of memory
> > management but that could be added to said algorithm; however I
> > don't see much point at the moment since oom is still an uncommon
> > condition but regular memory allocation is routine.
>
> My thought would be to generalize the two special cases of writeback
> of dirty pages or dropping of clean pages under memory pressure and
> OOM to be the same general case. When you are trying to free up
> pages, it may be permissible to drop dirty mbox pages and kill the
> postfix process writing them in order to satisfy allocations for the
> mission-critical database server. (Or maybe it's the other way
> around). If a large chunk of the allocated pages have priorities and
> lossless/lossy free functions, then the kernel can be much more
> flexible and configurable about what to do when running low on RAM.
Indeed the implementation I currently have is lightweight to say the least but
I really didn't think bloating struct page was worth it since the memory cost
would be prohibitive, but would allow all sorts of priority effects and vm
scheduling to be possible. That is, struct page could have an extra entry
keeping track of the highest priority of the process that used it and use
that to determine further eviction etc.
Cheers,
Con
On Thu, 2005-12-15 at 08:00 -0500, jamal wrote:
> On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote:
> > >
> > > You are using the wrong hammer to crack your nut.
> > > You should instead approach your problem of why the ARP entry gets lost.
> > > For example, you could give as critical priority to your TCP session,
> > > but that still won't cure your ARP problem.
> > > I would suggest that the best way to cure your arp problem, is to
> > > increase the time between arp cache refreshes.
> >
> > or turn it around entirely: all traffic is considered important
> > unless... and have a bunch of non-critical sockets (like http requests)
> > be marked non-critical.
>
> The big hole punched by DaveM is that of dependencies: a http tcp
> connection is tied to ICMP or the IPSEC example given; so you need a lot
> more intelligence than just what your app is knowledgeable about at its
> level.
yeah well sort of. You're right of course, but that also doesn't mean
you can't give hints from the other side. Like "data for this socked is
NOT critical important". It gets tricky if you only do it for OOM stuff;
because then that one ACK packet could cause a LOT of memory to be
freed, and as such can be important for the system even if the socket
isn't.
> Naturally this is all still in the vaporware stage, but I think that
> if implemented the concept might at least improve the OOM/low-memory
> situation considerably. Starting to fail allocations for the cluster
> programs (including their kernel allocations) well before failing
> them for the swap-fallback tool would help the original poster, and I
> imagine various tweaked priorities would make true OOM-deadlock far
> less likely.
The problem is that deadlocks can happen even without anybody
running out of virtual memory. The deadlocks GFP_CRITICAL
was supposed to handle are deadlocks while swapping out data
because the swapping on some devices needs more memory by itself.
This happens long before anything is running into a true oom.
It's just that the memory cleaning stage cannot make progress
anymore.
Your proposal isn't addressing this problem at all I think.
Handling true OOM is a quite different issue.
-Andi
On Thu, 2005-15-12 at 14:07 +0100, Arjan van de Ven wrote:
> On Thu, 2005-12-15 at 08:00 -0500, jamal wrote:
> > The big hole punched by DaveM is that of dependencies: a http tcp
> > connection is tied to ICMP or the IPSEC example given; so you need a lot
> > more intelligence than just what your app is knowledgeable about at its
> > level.
>
> yeah well sort of. You're right of course, but that also doesn't mean
> you can't give hints from the other side. Like "data for this socked is
> NOT critical important". It gets tricky if you only do it for OOM stuff;
> because then that one ACK packet could cause a LOT of memory to be
> freed, and as such can be important for the system even if the socket
> isn't.
>
true - but thats _just one input_ into a complex policy decision
process. The other is clearly VM realizing some type of threshold has
been crossed. The output being a policy decision of what to drop - which
gets very interesting if one looks at it being as fine grained as "drop
ACKS".
The fallacy in the proposed solution is that it simplisticly ties
the decision to VM input and the network level input to sockets; as in
the example of sockets doing http requests.
Methinks what is needed is something which keeps state and takes input
from the sockets and the VM and then runs some algorithm to decide what
needs to be the final policy that gets installed at the low level kernel
(tc classifier level or hardware). Sockets provide hints that they are
critical. The box admin could override what is important.
cheers,
jamal
On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote:
> From: Sridhar Samudrala <[email protected]>
> Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
>
> > Instead, you seem to be suggesting in_emergency to be set dynamically
> > when we are about to run out of ATOMIC memory. Is this right?
>
> Not when we run out, but rather when we reach some low water mark, the
> "critical sockets" would still use GFP_ATOMIC memory but only
> "critical sockets" would be allowed to do so.
>
> But even this has faults, consider the IPSEC scenerio I mentioned, and
> this applies to any kind of encapsulation actually, even simple
> tunneling examples can be concocted which make the "critical socket"
> idea fail.
>
> The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
> tunneling allocations critical, and... and..." well you have
> GFP_ATOMIC then my friend.
I would like to mention another reason why we need to have a new
GFP_CRITICAL flag for an allocation request. When we are in emergency,
even the GFP_KERNEL allocations for a critical socket should not
sleep. This is because the swap device may have failed and we would
like to communicate this event to a management server over the
critical socket so that it can initiate the failover.
We are not trying to solve swapping over network problem. It is much
simpler. The critical sockets are to be used only to send/receive
a few critical messages reliably during a short period of emergency.
Thanks
Sridhar
David S. Miller <[email protected]> wrote:
> The idea to mark, for example, IPSEC key management daemon's sockets
> as critical is flawed, because the key management daemon could hit a
> swap page over the iSCSI device. Don't even start with the idea to
> lock the IPSEC key management daemon into ram with mlock().
How are you going to swap in the key manager if you need the key manager
for doing this?
However, I'd prefer a system where you can't dirty mor than (e.g.) 80 % of
RAM unless you need this to maintain vital system activity and not more
than 95 % unless it will help to get more clean RAM. (Like the priority
inheritance suggestion from this thread.) I suppose this to least
significantly reduce thrashing and give a very good chance of recovering
from memory pressure. Off cause the implementation won't be easy,
especially if userspace applications need to inherit priority from
different code paths, but in theory, it can be done.
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.
On Thu, 15 Dec 2005 18:09:22 -0800
Sridhar Samudrala <[email protected]> wrote:
> On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote:
> > From: Sridhar Samudrala <[email protected]>
> > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
> >
> > > Instead, you seem to be suggesting in_emergency to be set dynamically
> > > when we are about to run out of ATOMIC memory. Is this right?
> >
> > Not when we run out, but rather when we reach some low water mark, the
> > "critical sockets" would still use GFP_ATOMIC memory but only
> > "critical sockets" would be allowed to do so.
> >
> > But even this has faults, consider the IPSEC scenerio I mentioned, and
> > this applies to any kind of encapsulation actually, even simple
> > tunneling examples can be concocted which make the "critical socket"
> > idea fail.
> >
> > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
> > tunneling allocations critical, and... and..." well you have
> > GFP_ATOMIC then my friend.
>
> I would like to mention another reason why we need to have a new
> GFP_CRITICAL flag for an allocation request. When we are in emergency,
> even the GFP_KERNEL allocations for a critical socket should not
> sleep. This is because the swap device may have failed and we would
> like to communicate this event to a management server over the
> critical socket so that it can initiate the failover.
>
> We are not trying to solve swapping over network problem. It is much
> simpler. The critical sockets are to be used only to send/receive
> a few critical messages reliably during a short period of emergency.
>
If it is only one place, why not pre-allocate one "I'm sick now"
skb and hold onto it. Any bigger solution seems to snowball into
a huge mess.
--
Stephen Hemminger <[email protected]>
OSDL http://developer.osdl.org/~shemminger
On Fri, 2005-12-16 at 09:48 -0800, Stephen Hemminger wrote:
> On Thu, 15 Dec 2005 18:09:22 -0800
> Sridhar Samudrala <[email protected]> wrote:
>
> > On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote:
> > > From: Sridhar Samudrala <[email protected]>
> > > Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
> > >
> > > > Instead, you seem to be suggesting in_emergency to be set dynamically
> > > > when we are about to run out of ATOMIC memory. Is this right?
> > >
> > > Not when we run out, but rather when we reach some low water mark, the
> > > "critical sockets" would still use GFP_ATOMIC memory but only
> > > "critical sockets" would be allowed to do so.
> > >
> > > But even this has faults, consider the IPSEC scenerio I mentioned, and
> > > this applies to any kind of encapsulation actually, even simple
> > > tunneling examples can be concocted which make the "critical socket"
> > > idea fail.
> > >
> > > The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
> > > tunneling allocations critical, and... and..." well you have
> > > GFP_ATOMIC then my friend.
> >
> > I would like to mention another reason why we need to have a new
> > GFP_CRITICAL flag for an allocation request. When we are in emergency,
> > even the GFP_KERNEL allocations for a critical socket should not
> > sleep. This is because the swap device may have failed and we would
> > like to communicate this event to a management server over the
> > critical socket so that it can initiate the failover.
> >
> > We are not trying to solve swapping over network problem. It is much
> > simpler. The critical sockets are to be used only to send/receive
> > a few critical messages reliably during a short period of emergency.
> >
>
> If it is only one place, why not pre-allocate one "I'm sick now"
> skb and hold onto it. Any bigger solution seems to snowball into
> a huge mess.
But the problem is even sending/receiving a single packet can cause
multiple dynamic allocations in the networking path all the way from
the sockets layer->transport->ip->driver.
To successfully send a packet, we may have to do arp, send acks and
create cached routes etc. So my patch tried to identify the allocations
that are needed to succesfully send/receive packets over a pre-established
socket and adds a new flag GFP_CRITICAL to those calls.
This doesn't make any difference when we are not in emergency. But when
we go into emergency, VM will try to satisfy these allocations from a
critical pool if the normal path leads to failure.
We go into emergency when some management app detects that a swap device
is about to fail(we are not yet in OOM, but will enter OOM soon). In order
to avoid entering OOM, we need to send a message over a critical socket to
a remote server that can initiate failover and switch to a different swap
device. The switchover will happen within 2 minutes after it is initiated.
In a cluster environment, the remote server also sends a message to other
nodes which are also running the management app so that they also enter
emergency. Once we successfully switch to a different swap device, the remote
server sends a message to all the nodes and they come out of emergency.
During the period of emergency, all other communications can block. But
guranteeing the successful delivery of the critical messages will help
in making sure that we do not enter OOM situation.
Thanks
Sridhar
Hi!
> > If it is only one place, why not pre-allocate one "I'm sick now"
> > skb and hold onto it. Any bigger solution seems to snowball into
> > a huge mess.
>
> But the problem is even sending/receiving a single packet can cause
> multiple dynamic allocations in the networking path all the way from
> the sockets layer->transport->ip->driver.
> To successfully send a packet, we may have to do arp, send acks and
> create cached routes etc. So my patch tried to identify the allocations
> that are needed to succesfully send/receive packets over a pre-established
> socket and adds a new flag GFP_CRITICAL to those calls.
> This doesn't make any difference when we are not in emergency. But when
> we go into emergency, VM will try to satisfy these allocations from a
> critical pool if the normal path leads to failure.
>
> We go into emergency when some management app detects that a swap device
> is about to fail(we are not yet in OOM, but will enter OOM soon). In order
> to avoid entering OOM, we need to send a message over a critical socket to
> a remote server that can initiate failover and switch to a different swap
> device. The switchover will happen within 2 minutes after it is initiated.
> In a cluster environment, the remote server also sends a message to other
> nodes which are also running the management app so that they also enter
> emergency. Once we successfully switch to a different swap device, the remote
> server sends a message to all the nodes and they come out of emergency.
>
> During the period of emergency, all other communications can block. But
> guranteeing the successful delivery of the critical messages will help
> in making sure that we do not enter OOM situation.
Why not do it the other way? "If you don't hear from me for 2 minutes,
do a switchover". Then all you have to do is _not_ to send a packet --
easier to do.
Anything else seems overkill.
Pavel
--
Thanks, Sharp!
> Why not do it the other way? "If you don't hear from me for 2 minutes,
> do a switchover". Then all you have to do is _not_ to send a packet --
> easier to do.
>
> Anything else seems overkill.
> Pavel
Because in some of the scenarios, including ours, it isn't a
simple failover to a known alternate device or configuration --
it is reconfiguring dynamically with information received on a
socket from a remote machine (while the swap device is unavailable).
Limited socket communication without allocating new memory
that may not be available is the problem definition. Avoiding the
problem in the first place (your solution) is effective if you
can do it, of course. The trick is to solve the problem when you
can't avoid it. :-)
+-DLS