2008-02-11 16:24:47

by Jan-Bernd Themann

[permalink] [raw]
Subject: [PATCH] drivers/base: export gpl (un)register_memory_notifier

Drivers like eHEA need memory notifiers in order to
update their internal DMA memory map when memory is added
to or removed from the system.

Patch for eHEA memory hotplug support that uses these functions:
http://www.spinics.net/lists/netdev/msg54484.html

Signed-off-by: Jan-Bernd Themann <[email protected]>


---
Hi,

the eHEA patch belongs to a patchset that is usually
added by Jeff Garzik once this dependency (EXPORTS)
is resolved.

Regards,
Jan-Bernd


drivers/base/memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 7ae413f..f5a0bf1 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -52,11 +52,13 @@ int register_memory_notifier(struct notifier_block *nb)
{
return blocking_notifier_chain_register(&memory_chain, nb);
}
+EXPORT_SYMBOL_GPL(register_memory_notifier);

void unregister_memory_notifier(struct notifier_block *nb)
{
blocking_notifier_chain_unregister(&memory_chain, nb);
}
+EXPORT_SYMBOL_GPL(unregister_memory_notifier);

/*
* register_memory - Setup a sysfs device for a memory block
--
1.5.2


2008-02-11 16:47:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Mon, 2008-02-11 at 17:24 +0100, Jan-Bernd Themann wrote:
> the eHEA patch belongs to a patchset that is usually
> added by Jeff Garzik once this dependency (EXPORTS)
> is resolved.

I know that's already in mainline but, man, that code is nasty. It has
stuff indented 7 levels or so and is virtually impossible to read.

This:

> int ehea_create_busmap(void)
> {
> u64 vaddr = EHEA_BUSMAP_START;
> unsigned long high_section_index = 0;
> int i;
>
> /*
> * Sections are not in ascending order -> Loop over all sections and
> * find the highest PFN to compute the required map size.
> */
> ehea_bmap.valid_sections = 0;
>
> for (i = 0; i < NR_MEM_SECTIONS; i++)
> if (valid_section_nr(i))
> high_section_index = i;

is also completely bogus for arch-independent code. If someone ever
wants to do ppc without sparsemem (or redoes internal sparsemem
interfaces), this code will break.

Also, keeping your own mapping of all of memory is really nasty. With
SPARSEMEM_EXTREME/VMEMMAP you can have extremely sparse physical memory,
and this:

> ehea_bmap.entries = high_section_index + 1;
> ehea_bmap.vaddr = vmalloc(ehea_bmap.entries * sizeof(*ehea_bmap.vaddr));

could theoretically consume all of memory if the sections are very sparse.

Could you please try to explain what the heck this driver is doing?
We'll try to fix up the generic interfaces so that you can deal with
things in a more generic fashion, but it can't stay this way.

Also, just ripping down and completely re-doing the entire mass of cards
every time a 16MB area of memory is added or removed seems like an
awfully big sledgehammer to me. I would *HATE* to see anybody else
using this driver as an example to work off of? Can't you just keep
track of which areas the driver is actually *USING* and only worry about
changing mappings if that intersects with an area having hotplug done on
it?

Can you please give it some CodingStyle love and make it so that it
doesn't indent so much? Something like this:

diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index c051c7e..72c5652 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -2638,38 +2638,40 @@ static void ehea_rereg_mrs(struct work_struct *work)
down(&dlpar_mem_lock);
ehea_info("LPAR memory enlarged - re-initializing driver");

- list_for_each_entry(adapter, &adapter_list, list)
- if (adapter->active_ports) {
- /* Shutdown all ports */
- for (i = 0; i < EHEA_MAX_PORTS; i++) {
- struct ehea_port *port = adapter->port[i];
-
- if (port) {
- struct net_device *dev = port->netdev;
-
- if (dev->flags & IFF_UP) {
- down(&port->port_lock);
- netif_stop_queue(dev);
- ret = ehea_stop_qps(dev);
- if (ret) {
- up(&port->port_lock);
- goto out;
- }
- port_napi_disable(port);
- up(&port->port_lock);
- }
- }
- }
-
- /* Unregister old memory region */
- ret = ehea_rem_mr(&adapter->mr);
+ list_for_each_entry(adapter, &adapter_list, list) {
+ if (!adapter->active_ports)
+ continue;
+ /* Shutdown all ports */
+ for (i = 0; i < EHEA_MAX_PORTS; i++) {
+ struct ehea_port *port = adapter->port[i];
+ struct net_device *dev;
+ if (!port)
+ continue;
+
+ dev = port->netdev;
+ if (!(dev->flags & IFF_UP))
+ continue;
+
+ down(&port->port_lock);
+ netif_stop_queue(dev);
+ ret = ehea_stop_qps(dev);
if (ret) {
- ehea_error("unregister MR failed - driver"
- " inoperable!");
+ up(&port->port_lock);
goto out;
}
+ port_napi_disable(port);
+ up(&port->port_lock);
}

+ /* Unregister old memory region */
+ ret = ehea_rem_mr(&adapter->mr);
+ if (ret) {
+ ehea_error("unregister MR failed - driver"
+ " inoperable!");
+ goto out;
+ }
+ }
+
ehea_destroy_busmap();
ret = ehea_create_busmap();
if (ret) {


-- Dave

2008-02-11 16:50:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Mon, 2008-02-11 at 17:24 +0100, Jan-Bernd Themann wrote:
> Drivers like eHEA need memory notifiers in order to
> update their internal DMA memory map when memory is added
> to or removed from the system.
>
> Patch for eHEA memory hotplug support that uses these functions:
> http://www.spinics.net/lists/netdev/msg54484.html
>
> Signed-off-by: Jan-Bernd Themann <[email protected]>

How about you fix up the driver, first? Then, this can go in.

-- Dave

2008-02-12 18:04:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Mon, 2008-02-11 at 17:24 +0100, Jan-Bernd Themann wrote:
> Drivers like eHEA need memory notifiers in order to
> update their internal DMA memory map when memory is added
> to or removed from the system.
>
> Patch for eHEA memory hotplug support that uses these functions:
> http://www.spinics.net/lists/netdev/msg54484.html

This driver is broken pretty horribly. It won't even compile for a
plain ppc64 kernel:

http://sr71.net/~dave/linux/ehea-is-broken.config

I know it's used for very specific hardware, but this is the symptom of
it not using the proper abstracted interfaces to the VM.

In file included from /home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_main.c:42:
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:44:14: warning: "SECTION_SIZE_BITS" is not defined
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:45:2: error: #error eHEA module cannot work if kernel sectionsize < ehea sectionsize
CC drivers/net/mii.o
make[4]: *** [drivers/net/ehea/ehea_main.o] Error 1
make[4]: *** Waiting for unfinished jobs....
CC drivers/net/ixgb/ixgb_param.o
In file included from /home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:32:
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:44:14: warning: "SECTION_SIZE_BITS" is not defined
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.h:45:2: error: #error eHEA module cannot work if kernel sectionsize < ehea sectionsize
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c: In function 'ehea_create_busmap':
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: 'NR_MEM_SECTIONS' undeclared (first use in this function)
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: (Each undeclared identifier is reported only once
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:574: error: for each function it appears in.)
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:575: error: implicit declaration of function 'valid_section_nr'
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c: In function 'ehea_map_vaddr':
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:606: error: 'SECTION_SIZE_BITS' undeclared (first use in this function)
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c: In function 'ehea_reg_kernel_mr':
/home/dave/work/linux/2.6/23/linux-2.6.git/drivers/net/ehea/ehea_qmr.c:655: error: 'SECTION_SIZE_BITS' undeclared (first use in this function)

-- Dave

2008-02-13 15:18:25

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

Hi Dave,

On Monday 11 February 2008 17:47, Dave Hansen wrote:
> Also, just ripping down and completely re-doing the entire mass of cards
> every time a 16MB area of memory is added or removed seems like an
> awfully big sledgehammer to me. I would *HATE* to see anybody else
> using this driver as an example to work off of? Can't you just keep
> track of which areas the driver is actually *USING* and only worry about
> changing mappings if that intersects with an area having hotplug done on
> it?


to form a base for the eHEA memory add / remove concept discussion:

Explanation of the current eHEA memory add / remove concept:

Constraints imposed by HW / FW:
- eHEA has own MMU
- eHEA ?Memory Regions (MRs) are used by the eHEA MMU ?to translate virtual
? addresses to absolute addresses (like DMA mapped memory on a PCI bus)
- The number of MRs is limited (not enough to have one MR per packet)
- Registration of MRs is comparativley slow as done via slow firmware call
(H_CALL)
- MRs can have a maximum size of the memory available under linux
- MRs cover a contiguous virtual memory block (no holes)

Because of this there is just one big MR that covers entire kernel memory.
We also need a mapping table from kernel addresses to this
contiguous "virtual memory IO space" (here called ehea_bmap).

- When memory is added / removed to LPAR (and linux), the MR has to be updated.
? This can only be done by destroying and recreating the MR. There is no H_CALL
? to modify MR size. To find holes in the linux kernel memory layout we have to
? iterate over the memory sections for recreating a ehea_bmap
(otherwise MR would be bigger then available memory causing the
registration to fail)

- DLPAR userspace tools, kernel, driver, firmware and HMC are involved in that
? process on System p

Memory add: version without a external memory notifier call
- new memory used in a transfer_xmit will result in a "ehea_bmap
translation miss", which triggers a rebuild and reregistration
? of the ehea_bmap based on the current kernel memory setup.
- advantage: the number of MR rebuilds is reduced significantly compared to
a rebuild for each 16MB chunk of memory added.

Memory add: version with external notifier call:
- We still need a ehea_bmap (whatever structure it has)

Memory remove with notifier:
- We have to rebuild the ehea_bmap instantly to remove the pages that are
no longer available. Without doing that, the firmware (pHYP) cannot remove
that memory from the LPAR. As we don't know if or how many additional
sections are to be removed before the DLPAR user space tool tells the
firmware to remove the memory, we can't wait with the rebuild.


Our current understanding about the current Memory Hotplug System are
(please correct me
if I'm wrong):

- depends on sparse mem
- only whole memory sections are added / removed
- for each section a memory resource is registered


>From the driver side we need:
- some kind of memory notification mechanism.
? For memory add we can live without any external memory notification
event. For memory remove we do need an external trigger (see explanation
above).
- a way to iterate over all kernel pages and a way to detect holes in the
kernel memory layout in order to build up our own ehea_bmap.


Memory notification trigger:
- These triggers exist, an exported "register_memory_notifier" /
? "unregister_memory_notifier" would work in this scheme

Functions to use while building ehea_bmap + MRs:
- Use either the functions that are used by the memory hotplug system as
well, that means using the section defines + functions (section_nr_to_pfn,
? pfn_valid)
- Use currently other not exported functions in kernel/resource.c, like
walk_memory_resource (where we would still need the maximum possible number
of pages NR_MEM_SECTIONS)
- Maybe some kind of new interface?

What would you suggest?

Regards,
Jan-Bernd & Christoph

2008-02-13 17:06:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote:
> Constraints imposed by HW / FW:
> - eHEA has own MMU
> - eHEA Memory Regions (MRs) are used by the eHEA MMU to translate virtual
> addresses to absolute addresses (like DMA mapped memory on a PCI bus)
> - The number of MRs is limited (not enough to have one MR per packet)

Are there enough to have one per 16MB section?

> Our current understanding about the current Memory Hotplug System are
> (please correct me if I'm wrong):
>
> - depends on sparse mem

You're wrong ;). In mainline, this is true. There was a version of the
SUSE kernel that did otherwise. But you can not and should not depend
on this never changing. But, someone is perfectly free to go out an
implement something better than sparsemem for memory hotplug. If they
go and do this, your driver may get left behind.

> - only whole memory sections are added / removed
> - for each section a memory resource is registered

True, and true. (There might be exceptions to the whole sections one,
but that's blatant abuse and should be fixed. :)

> From the driver side we need:
> - some kind of memory notification mechanism.
> For memory add we can live without any external memory notification
> event. For memory remove we do need an external trigger (see explanation
> above).

You can export and use (un)register_memory_notifier. You just need to
do it in a reasonable way that compiles for randconfig on your
architecture. Believe me, we don't want to start teaching drivers about
sparsemem.

> - a way to iterate over all kernel pages and a way to detect holes in the
> kernel memory layout in order to build up our own ehea_bmap.

Look at kernel/resource.c

But, I'm really not convinced that you can actually keep this map
yourselves. It's not as simple as you think. What happens if you get
on an LPAR with two sections, one 256MB@0x0 and another
16MB@0x1000000000000000. That's quite possible. I think your vmalloc'd
array will eat all of memory.

That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP implemented
in the core, so that we can deal with these kinds of problems, once and
*NOT* in every single little driver out there.

> Functions to use while building ehea_bmap + MRs:
> - Use either the functions that are used by the memory hotplug system as
> well, that means using the section defines + functions (section_nr_to_pfn,
> pfn_valid)

Basically, you can't use anything related to sections outside of the
core code. You can use things like pfn_valid(), or you can create new
interfaces that are properly abstracted.

> - Use currently other not exported functions in kernel/resource.c, like
> walk_memory_resource (where we would still need the maximum possible number
> of pages NR_MEM_SECTIONS)

It isn't the act of exporting that's the problem. It's making sure that
the exports won't be prone to abuse and that people are using them
properly. You should assume that you can export and use
walk_memory_resource().

Do you know what other operating systems do with this hardware?

In the future, please make an effort to get review from knowledgeable
people about these kinds of things before using them in your driver.
Your company has many, many resources available, and all you need to do
is ask. All that you have to do is look to the tops of the files of the
functions you are calling.

-- Dave

2008-02-14 08:46:30

by Christoph Raisch

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier


Dave Hansen <[email protected]> wrote on 13.02.2008 18:05:00:

> On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote:
> > Constraints imposed by HW / FW:
> > - eHEA has own MMU
> > - eHEA Memory Regions (MRs) are used by the eHEA MMU to translate
virtual
> > addresses to absolute addresses (like DMA mapped memory on a PCI bus)
> > - The number of MRs is limited (not enough to have one MR per packet)
>
> Are there enough to have one per 16MB section?

Unfortunately this won't work. This was one of our first ideas we tossed
out,
but the number of MRs will not be sufficient.

>
> > Our current understanding about the current Memory Hotplug System are
> > (please correct me if I'm wrong):
> >
> > - depends on sparse mem
>
> You're wrong ;). In mainline, this is true. There was a version of the
> SUSE kernel that did otherwise. But you can not and should not depend
> on this never changing. But, someone is perfectly free to go out an
> implement something better than sparsemem for memory hotplug. If they
> go and do this, your driver may get left behind.

We understand that the add/remove area is not as
settled in the kernel like for example f_ops ;-)
Are there already base working assumptions which are very unlikely to
change?

>
> > - only whole memory sections are added / removed
> > - for each section a memory resource is registered
>
> True, and true. (There might be exceptions to the whole sections one,
> but that's blatant abuse and should be fixed. :)
>
> > From the driver side we need:
> > - some kind of memory notification mechanism.
> > For memory add we can live without any external memory notification
> > event. For memory remove we do need an external trigger (see
explanation
> > above).
>
> You can export and use (un)register_memory_notifier. You just need to
> do it in a reasonable way that compiles for randconfig on your
> architecture. Believe me, we don't want to start teaching drivers about
> sparsemem.

I'm a little confused here....
...the existing add/remove code depends on sparse mem.
Other pieces on the POWER6 version of the architecture do as well.
So we could either chose to disable add/remove if sparsemem is not there,
or disable the driver by Kconfig in this case.


> > - a way to iterate over all kernel pages and a way to detect holes in
the
> > kernel memory layout in order to build up our own ehea_bmap.
>
> Look at kernel/resource.c
>
> But, I'm really not convinced that you can actually keep this map
> yourselves. It's not as simple as you think. What happens if you get
> on an LPAR with two sections, one 256MB@0x0 and another
> 16MB@0x1000000000000000. That's quite possible. I think your vmalloc'd
> array will eat all of memory.
I'm glad you mention this part. There are many algorithms out there to
handle this problem,
hashes/trees/... all of these trade speed for smaller memory footprint.
We based the table decission on the existing implementations of the
architecture.
Do you see such a case coming along for the next generation POWER systems?
I would guess these drastic changes would also require changes in base
kernel.

Will you provide a generic mapping system with a contiguous virtual address
space
like the ehea_bmap we can query? This would need to be a "stable" part of
the implementation,
including translation functions from kernel to nextgen_ehea_generic_bmap
like virt_to_abs.

>
> That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP implemented
> in the core, so that we can deal with these kinds of problems, once and
> *NOT* in every single little driver out there.
>
> > Functions to use while building ehea_bmap + MRs:
> > - Use either the functions that are used by the memory hotplug system
as
> > well, that means using the section defines + functions
(section_nr_to_pfn,
> > pfn_valid)
>
> Basically, you can't use anything related to sections outside of the
> core code. You can use things like pfn_valid(), or you can create new
> interfaces that are properly abstracted.

We picked sections instead of PFNs because this keeps the ehea_bmap in a
reasonable range
on the existing systems.
But if you provide a abstract method handling exactly the problem we
mention
we'll be happy to use that and dump our private implementation.

>
> > - Use currently other not exported functions in kernel/resource.c, like
> > walk_memory_resource (where we would still need the maximum
> possible number
> > of pages NR_MEM_SECTIONS)
>
> It isn't the act of exporting that's the problem. It's making sure that
> the exports won't be prone to abuse and that people are using them
> properly. You should assume that you can export and use
> walk_memory_resource().

So this seems to come down to a basic question:
New hardware seems to have a tendency to get "private MMUs",
which need private mappings from the kernel address space into a
"HW defined address space with potentially unique characteristics"
RDMA in Openfabrics with global MR is the most prominent example heading
there


>
> Do you know what other operating systems do with this hardware?

We're not aware of another open source Operating system trying to address
this topic.

>
> In the future, please make an effort to get review from knowledgeable
> people about these kinds of things before using them in your driver.
> Your company has many, many resources available, and all you need to do
> is ask. All that you have to do is look to the tops of the files of the
> functions you are calling.
>

So we're glad we finally found the right person who takes responsibility
for this topic!


> -- Dave
>
Gruss / Regards
Christoph Raisch + Jan-Bernd Themann

2008-02-14 17:13:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Thu, 2008-02-14 at 09:46 +0100, Christoph Raisch wrote:
> Dave Hansen <[email protected]> wrote on 13.02.2008 18:05:00:
> > On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote:
> > > Constraints imposed by HW / FW:
> > > - eHEA has own MMU
> > > - eHEA Memory Regions (MRs) are used by the eHEA MMU to translate
> virtual
> > > addresses to absolute addresses (like DMA mapped memory on a PCI bus)
> > > - The number of MRs is limited (not enough to have one MR per packet)
> >
> > Are there enough to have one per 16MB section?
>
> Unfortunately this won't work. This was one of our first ideas we tossed
> out,
> but the number of MRs will not be sufficient.

Can you give a ballpark of how many there are to work with? 10? 100?
1000?

> We understand that the add/remove area is not as
> settled in the kernel like for example f_ops ;-)
> Are there already base working assumptions which are very unlikely to
> change?

If you use good interfaces, and someone changes them, they'll likely
also fix your driver.

If you use bad interfaces, people may not even notice when they break.
As I showed you with those compile failures, you're using bad interfaces
that don't even compile on some .configs.

> I'm a little confused here....
> ...the existing add/remove code depends on sparse mem.
> Other pieces on the POWER6 version of the architecture do as well.
> So we could either chose to disable add/remove if sparsemem is not there,
> or disable the driver by Kconfig in this case.

Technically, you can do this. But, it's not a sign of a professionally
written driver that is going to get its patches accepted into mainline.
Technically, you can also use lots of magic numbers and not obey
CodingStyle. But, you'll probably get review comments asking you to
change it.

> > > - a way to iterate over all kernel pages and a way to detect holes in
> the
> > > kernel memory layout in order to build up our own ehea_bmap.
> >
> > Look at kernel/resource.c
> >
> > But, I'm really not convinced that you can actually keep this map
> > yourselves. It's not as simple as you think. What happens if you get
> > on an LPAR with two sections, one 256MB@0x0 and another
> > 16MB@0x1000000000000000. That's quite possible. I think your vmalloc'd
> > array will eat all of memory.
> I'm glad you mention this part. There are many algorithms out there to
> handle this problem,
> hashes/trees/... all of these trade speed for smaller memory footprint.
> We based the table decission on the existing implementations of the
> architecture.
> Do you see such a case coming along for the next generation POWER systems?

Dude. It exists *TODAY*. Go take a machine, add tens of gigabytes of
memory to it. Then, remove all of the sections of memory in the middle.
You'll be left with a very sparse memory configuration that we *DO*
handle today in the core VM. We handle it quite well, actually.

The hypervisor does not shrink memory from the top down. It pulls
things out of the middle and shuffles things around. In fact, a NUMA
node's memory isn't even contiguous.

Your code will OOM the machine in this case. I consider the ehea driver
buggy in this regard.

> I would guess these drastic changes would also require changes in base
> kernel.

No, we actually solved those a couple years ago.

> Will you provide a generic mapping system with a contiguous virtual address
> space
> like the ehea_bmap we can query? This would need to be a "stable" part of
> the implementation,
> including translation functions from kernel to nextgen_ehea_generic_bmap
> like virt_to_abs.

Yes, that's a real possibility, especially if some other users for it
come forward. We could definitely add something like that to the
generic code. But, you'll have to be convincing that what we have now
is insufficient.

Does this requirement:
"- MRs cover a contiguous virtual memory block (no holes)"
come from the hardware?

Is that *EACH* MR? OR all MRs?

Where does EHEA_BUSMAP_START come from? Is that defined in the
hardware? Have you checked to ensure that no other users might want a
chunk of memory in that area?

Can you query the existing MRs? Not change them in place, but can you
query their contents?

> > That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP implemented
> > in the core, so that we can deal with these kinds of problems, once and
> > *NOT* in every single little driver out there.
> >
> > > Functions to use while building ehea_bmap + MRs:
> > > - Use either the functions that are used by the memory hotplug system
> as
> > > well, that means using the section defines + functions
> (section_nr_to_pfn,
> > > pfn_valid)
> >
> > Basically, you can't use anything related to sections outside of the
> > core code. You can use things like pfn_valid(), or you can create new
> > interfaces that are properly abstracted.
>
> We picked sections instead of PFNs because this keeps the ehea_bmap in a
> reasonable range
> on the existing systems.
> But if you provide a abstract method handling exactly the problem we
> mention
> we'll be happy to use that and dump our private implementation.

One thing you can guarantee today is that things are contiguous up to
MAX_ORDER_NR_PAGES. That's a symbol that is unlikely to change and is
much more appropriate than using sparsemem. We could also give you a
nice new #define like MINIMUM_CONTIGUOUS_PAGES or something. I think
that's what you really want.

> > > - Use currently other not exported functions in kernel/resource.c, like
> > > walk_memory_resource (where we would still need the maximum
> > possible number
> > > of pages NR_MEM_SECTIONS)
> >
> > It isn't the act of exporting that's the problem. It's making sure that
> > the exports won't be prone to abuse and that people are using them
> > properly. You should assume that you can export and use
> > walk_memory_resource().
>
> So this seems to come down to a basic question:
> New hardware seems to have a tendency to get "private MMUs",
> which need private mappings from the kernel address space into a
> "HW defined address space with potentially unique characteristics"
> RDMA in Openfabrics with global MR is the most prominent example heading
> there

That's not a question. ;)

Please explain to me why walk_memory_resource() is insufficient for your
needs. I've now pointed it out to you at least 3 times.

> > Do you know what other operating systems do with this hardware?
>
> We're not aware of another open source Operating system trying to address
> this topic.

What about AIX? Do you know who wrote its driver? Perhaps you should
go ask them.

-- Dave

2008-02-14 17:34:00

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Thu, 2008-02-14 at 09:12 -0800, Dave Hansen wrote:
..
> > > > - Use currently other not exported functions in kernel/resource.c, like
> > > > walk_memory_resource (where we would still need the maximum
> > > possible number
> > > > of pages NR_MEM_SECTIONS)
> > >
> > > It isn't the act of exporting that's the problem. It's making sure that
> > > the exports won't be prone to abuse and that people are using them
> > > properly. You should assume that you can export and use
> > > walk_memory_resource().
> >
> > So this seems to come down to a basic question:
> > New hardware seems to have a tendency to get "private MMUs",
> > which need private mappings from the kernel address space into a
> > "HW defined address space with potentially unique characteristics"
> > RDMA in Openfabrics with global MR is the most prominent example heading
> > there
>
> That's not a question. ;)
>
> Please explain to me why walk_memory_resource() is insufficient for your
> needs. I've now pointed it out to you at least 3 times.

I am not sure what you are trying to do with walk_memory_resource(). The
behavior is different on ppc64. Hotplug memory usage assumes that all
the memory resources (all system memory, not just IOMEM) are represented
in /proc/iomem. Its the case with i386 and ia64. But on ppc64 is
contains ONLY iomem related. Paulus didn't want to export all the system
memory into /proc/iomem on ppc64. So I had to workaround by providing
arch-specific walk_memory_resource() function for ppc64.

Thanks,
Badari

2008-02-14 17:38:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier


On Thu, 2008-02-14 at 09:36 -0800, Badari Pulavarty wrote:
>
> I am not sure what you are trying to do with walk_memory_resource().
> The
> behavior is different on ppc64. Hotplug memory usage assumes that all
> the memory resources (all system memory, not just IOMEM) are
> represented
> in /proc/iomem. Its the case with i386 and ia64. But on ppc64 is
> contains ONLY iomem related. Paulus didn't want to export all the
> system
> memory into /proc/iomem on ppc64. So I had to workaround by providing
> arch-specific walk_memory_resource() function for ppc64.

OK, let's use that one.

-- Dave

2008-02-15 15:19:23

by Christoph Raisch

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier


Dave Hansen <[email protected]> wrote on 14.02.2008 18:12:43:

> On Thu, 2008-02-14 at 09:46 +0100, Christoph Raisch wrote:
> > Dave Hansen <[email protected]> wrote on 13.02.2008 18:05:00:
> > > On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote:
> > > > Constraints imposed by HW / FW:
> > > > - eHEA has own MMU
> > > > - eHEA Memory Regions (MRs) are used by the eHEA MMU to translate
> > virtual
> > > > addresses to absolute addresses (like DMA mapped memory on a PCI
bus)
> > > > - The number of MRs is limited (not enough to have one MR per
packet)
> > >
> > > Are there enough to have one per 16MB section?
> >
> > Unfortunately this won't work. This was one of our first ideas we
tossed
> > out,
> > but the number of MRs will not be sufficient.
>
> Can you give a ballpark of how many there are to work with? 10? 100?
> 1000?
>
It depends on HMC configuration, but in worst case the upper limit is in
the 2 digits range.

> > > But, I'm really not convinced that you can actually keep this map
> > > yourselves. It's not as simple as you think. What happens if you
get
> > > on an LPAR with two sections, one 256MB@0x0 and another
> > > 16MB@0x1000000000000000. That's quite possible. I think your
vmalloc'd
> > > array will eat all of memory.
> > I'm glad you mention this part. There are many algorithms out there to
> > handle this problem,
> > hashes/trees/... all of these trade speed for smaller memory footprint.
> > We based the table decission on the existing implementations of the
> > architecture.
> > Do you see such a case coming along for the next generation POWER
systems?
>
> Dude. It exists *TODAY*. Go take a machine, add tens of gigabytes of
> memory to it. Then, remove all of the sections of memory in the middle.
> You'll be left with a very sparse memory configuration that we *DO*
> handle today in the core VM. We handle it quite well, actually.
>
> The hypervisor does not shrink memory from the top down. It pulls
> things out of the middle and shuffles things around. In fact, a NUMA
> node's memory isn't even contiguous.
>
> Your code will OOM the machine in this case. I consider the ehea driver
> buggy in this regard.

Your comment indicates that the upper limit for memory to be set on HMC
does not influence
the upper limit of the partition physical address space.
So our base assumption we discussed internally is wrong here.
(conclusion see below)
>
> > I would guess these drastic changes would also require changes in base
> > kernel.
>
> No, we actually solved those a couple years ago.
>
> > Will you provide a generic mapping system with a contiguous virtual
address
> > space
> > like the ehea_bmap we can query? This would need to be a "stable" part
of
> > the implementation,
> > including translation functions from kernel to
nextgen_ehea_generic_bmap
> > like virt_to_abs.
>
> Yes, that's a real possibility, especially if some other users for it
> come forward. We could definitely add something like that to the
> generic code. But, you'll have to be convincing that what we have now
> is insufficient.
>
> Does this requirement:
> "- MRs cover a contiguous virtual memory block (no holes)"
> come from the hardware?
>
yes
> Is that *EACH* MR? OR all MRs?
>
each
> Where does EHEA_BUSMAP_START come from? Is that defined in the
> hardware? Have you checked to ensure that no other users might want a
> chunk of memory in that area?
>
EHEA_BUSMAP_START is a value which has to match between the wqe
virtual addresses and the MR used in them.
Fortunately there's a simple answer on that one. Each MR has a own address
space,
so there's no need to check.
A HEA MR actually has exactly the same attributes as a Infiniband MR with
this hardware.
send/receive processing is pretty much comparable to a Infiniband UD queue.

> Can you query the existing MRs?
no
> Not change them in place, but can you
> query their contents?
no
>
> > > That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP
implemented
> > > in the core, so that we can deal with these kinds of problems, once
and
> > > *NOT* in every single little driver out there.
> > >
> > > > Functions to use while building ehea_bmap + MRs:
> > > > - Use either the functions that are used by the memory hotplug
system
> > as
> > > > well, that means using the section defines + functions
> > (section_nr_to_pfn,
> > > > pfn_valid)
> > >
> > > Basically, you can't use anything related to sections outside of the
> > > core code. You can use things like pfn_valid(), or you can create
new
> > > interfaces that are properly abstracted.
> >
> > We picked sections instead of PFNs because this keeps the ehea_bmap in
a
> > reasonable range
> > on the existing systems.
> > But if you provide a abstract method handling exactly the problem we
> > mention
> > we'll be happy to use that and dump our private implementation.
>
> One thing you can guarantee today is that things are contiguous up to
> MAX_ORDER_NR_PAGES. That's a symbol that is unlikely to change and is
> much more appropriate than using sparsemem. We could also give you a
> nice new #define like MINIMUM_CONTIGUOUS_PAGES or something. I think
> that's what you really want.

That's definitely the right direction.

>From this mail thread I would conclude....
memory space can have holes, and drivers shouldn't make any assumption when
where and how.

A translation from kernel to ehea_bmap space should be fast and predictable
(ruling out hashes).
If a driver doesn't know anything else about the mapping structure,
the normal solution in kernel for this type of problem is a multi level
look up table
like pgd->pud->pmd->pte
This doesn't sound right to be implemented in a device driver.

We didn't see from the existing code that such a mapping to a contiguous
space already exists.
Maybe we've missed it.

If the mapping is less random, the translation gets much simpler.
MAX_ORDER_NR_PAGES helps here, is there more like that?


Gruss / Regards
Christoph Raisch + Jan-Bernd Themann

2008-02-15 16:56:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Fri, 2008-02-15 at 14:22 +0100, Christoph Raisch wrote:
> A translation from kernel to ehea_bmap space should be fast and
> predictable
> (ruling out hashes).
> If a driver doesn't know anything else about the mapping structure,
> the normal solution in kernel for this type of problem is a multi
> level
> look up table
> like pgd->pud->pmd->pte
> This doesn't sound right to be implemented in a device driver.
>
> We didn't see from the existing code that such a mapping to a
> contiguous
> space already exists.
> Maybe we've missed it.

I've been thinking about that, and I don't think you really *need* to
keep a comprehensive map like that.

When the memory is in a particular configuration (range of memory
present along with unique set of holes) you get a unique ehea_bmap
configuration. That layout is completely predictable.

So, if at any time you want to figure out what the ehea_bmap address for
a particular *Linux* virtual address is, you just need to pretend that
you're creating the entire ehea_bmap, use the same algorithm and figure
out host you would have placed things, and use that result.

Now, that's going to be a slow, crappy linear search (but maybe not as
slow as recreating the silly thing). So, you might eventually run into
some scalability problems with a lot of packets going around. But, I'd
be curious if you do in practice.

The other idea is that you create a mapping that is precisely 1:1 with
kernel memory. Let's say you have two sections present, 0 and 100. You
have a high_section_index of 100, and you vmalloc() a 100 entry array.

You need to create a *CONTIGUOUS* ehea map? Create one like this:

EHEA_VADDR->Linux Section
0->0
1->0
2->0
3->0
...
100->100

It's contiguous. Each area points to a valid Linux memory address.
It's also discernable in O(1) to what EHEA address a given Linux address
is mapped. You just have a couple of duplicate entries.

-- Dave

2008-02-18 10:02:32

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

switching to proper mail client...

Dave Hansen <[email protected]> wrote on 15.02.2008 17:55:38:

> I've been thinking about that, and I don't think you really *need* to
> keep a comprehensive map like that.
>
> When the memory is in a particular configuration (range of memory
> present along with unique set of holes) you get a unique ehea_bmap
> configuration. ?That layout is completely predictable.
>
> So, if at any time you want to figure out what the ehea_bmap address for
> a particular *Linux* virtual address is, you just need to pretend that
> you're creating the entire ehea_bmap, use the same algorithm and figure
> out host you would have placed things, and use that result.
>
> Now, that's going to be a slow, crappy linear search (but maybe not as
> slow as recreating the silly thing). ?So, you might eventually run into
> some scalability problems with a lot of packets going around. ?But, I'd
> be curious if you do in practice.

Up to 14 addresses translation per packet (sg_list) might be required on
the transmit side. On receive side it is only 1. Most packets require only
very few translations (1 or sometimes more) ?translations. However,
with more then 700.000 packets per second this approach does not seem
reasonable from performance perspective when memory is fragmented as you
described.

>
> The other idea is that you create a mapping that is precisely 1:1 with
> kernel memory. ?Let's say you have two sections present, 0 and 100. ?You
> have a high_section_index of 100, and you vmalloc() a 100 entry array.
>
> You need to create a *CONTIGUOUS* ehea map? ?Create one like this:
>
> EHEA_VADDR->Linux Section
> 0->0
> 1->0
> 2->0
> 3->0
> ...
> 100->100
>
> It's contiguous. ?Each area points to a valid Linux memory address.
> It's also discernable in O(1) to what EHEA address a given Linux address
> is mapped. ?You just have a couple of duplicate entries.

This has a serious issues with constraint I mentions in the previous mail:

"- MRs can have a maximum size of the memory available under linux"

The requirement is not met that the memory region must not be
larger then the available memory for that partition. The "create MR"
H_CALL will fails (we tried this and discussed with FW development)


Regards,
Jan-Bernd & Christoph

2008-02-20 18:14:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier

On Mon, 2008-02-18 at 11:00 +0100, Jan-Bernd Themann wrote:
> Dave Hansen <[email protected]> wrote on 15.02.2008 17:55:38:
>
> > I've been thinking about that, and I don't think you really *need* to
> > keep a comprehensive map like that.
> >
> > When the memory is in a particular configuration (range of memory
> > present along with unique set of holes) you get a unique ehea_bmap
> > configuration. That layout is completely predictable.
> >
> > So, if at any time you want to figure out what the ehea_bmap address for
> > a particular *Linux* virtual address is, you just need to pretend that
> > you're creating the entire ehea_bmap, use the same algorithm and figure
> > out host you would have placed things, and use that result.
> >
> > Now, that's going to be a slow, crappy linear search (but maybe not as
> > slow as recreating the silly thing). So, you might eventually run into
> > some scalability problems with a lot of packets going around. But, I'd
> > be curious if you do in practice.
>
> Up to 14 addresses translation per packet (sg_list) might be required on
> the transmit side. On receive side it is only 1. Most packets require only
> very few translations (1 or sometimes more) translations. However,
> with more then 700.000 packets per second this approach does not seem
> reasonable from performance perspective when memory is fragmented as you
> described.

OK, but let's see the data. *SHOW* me that it's slow. If the algorithm
works, then perhaps we can simply speed it up with a little caching and
*MUCH* less memory overhead.

-- Dave