I'd like feedback on the following performance problem &
suggestions for a proper solution.
Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
part of the boot time is scanning ACPI tables. ACPI tables on UV systems
are located in RAM memory that is physically attached to node 0.
User programs (ex., acpidump) read the ACPI tables by mapping them thru
/dev/mem. Although mmap tries to map the tables as CACHED, there are
existing kernel UNCACHED mapping that conflict and the tables end up as
being mapped UNCACHED. (See the call to track_pfn_vma_new() in
remap_pfn_range()).
Much of the access is to small fields (bytes (checksums), shorts, etc).
Late in boot, there is significant scanning of the ACPI tables that take
place from nodes other than zero. Since the tables are not cached, each
reference accesses physical memory that is attached to remote nodes. These
memory requests must cross the numalink interconnect which adds several
hundred nsec to each access. This slows the boot process. Access from
node 0, although faster, is still very slow.
The following experimental patch changes the kernel mapping for ACPI tables
to CACHED. This eliminates the page attibute conflict & allows users to map
the tables CACHEABLE. This significantly speeds up boot:
38 minutes without the patch
27 minutes with the patch
~30% improvement
Time to run ACPIDUMP on a large system:
527 seconds without the patch
8 seconds with the patch
I don't know if the patch in it's current form is the correct solution. I'm
interested in feedback on how this should be solved. I expect there
are issues on other platforms so for now, the patch uses x86_platform_ops
to change mappings only on UV platforms (I'm paranoid :-).
I also need to experiment with early_ioremap'ing of the ACPI tables. I suspect
this is also mapped UNCACHED. There may be additional improvements if this
could be mapped CACHED. However, the potential performance gain is much
less since these references all occur from node 0.
Signed-off-by: Jack Steiner <[email protected]>
---
arch/x86/include/asm/x86_init.h | 2 ++
arch/x86/kernel/apic/x2apic_uv_x.c | 6 ++++++
arch/x86/kernel/x86_init.c | 3 +++
drivers/acpi/osl.c | 12 +++++++++---
4 files changed, 20 insertions(+), 3 deletions(-)
Index: linux/arch/x86/include/asm/x86_init.h
===================================================================
--- linux.orig/arch/x86/include/asm/x86_init.h 2010-07-21 16:53:30.226241589 -0500
+++ linux/arch/x86/include/asm/x86_init.h 2010-07-21 16:57:46.614872338 -0500
@@ -113,6 +113,7 @@ struct x86_cpuinit_ops {
/**
* struct x86_platform_ops - platform specific runtime functions
+ * @is_wb_acpi_tables E820 ACPI table are in WB memory
* @is_untracked_pat_range exclude from PAT logic
* @calibrate_tsc: calibrate TSC
* @get_wallclock: get time from HW clock like RTC etc.
@@ -120,6 +121,7 @@ struct x86_cpuinit_ops {
* @nmi_init enable NMI on cpus
*/
struct x86_platform_ops {
+ int (*is_wb_acpi_tables)(void);
int (*is_untracked_pat_range)(u64 start, u64 end);
unsigned long (*calibrate_tsc)(void);
unsigned long (*get_wallclock)(void);
Index: linux/arch/x86/kernel/apic/x2apic_uv_x.c
===================================================================
--- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:53:30.226241589 -0500
+++ linux/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:54:46.358866486 -0500
@@ -58,6 +58,11 @@ static int uv_is_untracked_pat_range(u64
return is_ISA_range(start, end) || is_GRU_range(start, end);
}
+static int uv_is_wb_acpi_tables(void)
+{
+ return 1;
+}
+
static int early_get_nodeid(void)
{
union uvh_node_id_u node_id;
@@ -81,6 +86,7 @@ static int __init uv_acpi_madt_oem_check
nodeid = early_get_nodeid();
x86_platform.is_untracked_pat_range = uv_is_untracked_pat_range;
x86_platform.nmi_init = uv_nmi_init;
+ x86_platform.is_wb_acpi_tables = uv_is_wb_acpi_tables;
if (!strcmp(oem_table_id, "UVL"))
uv_system_type = UV_LEGACY_APIC;
else if (!strcmp(oem_table_id, "UVX"))
Index: linux/arch/x86/kernel/x86_init.c
===================================================================
--- linux.orig/arch/x86/kernel/x86_init.c 2010-07-21 16:53:30.226241589 -0500
+++ linux/arch/x86/kernel/x86_init.c 2010-07-21 16:58:17.106240870 -0500
@@ -71,7 +71,10 @@ struct x86_cpuinit_ops x86_cpuinit __cpu
static void default_nmi_init(void) { };
+static int default_wb_acpi_tables(void) {return 0;}
+
struct x86_platform_ops x86_platform = {
+ .is_wb_acpi_tables = default_wb_acpi_tables,
.is_untracked_pat_range = default_is_untracked_pat_range,
.calibrate_tsc = native_calibrate_tsc,
.get_wallclock = mach_get_cmos_time,
Index: linux/drivers/acpi/osl.c
===================================================================
--- linux.orig/drivers/acpi/osl.c 2010-07-21 16:53:30.226241589 -0500
+++ linux/drivers/acpi/osl.c 2010-07-21 17:58:20.370414172 -0500
@@ -293,12 +293,18 @@ acpi_os_map_memory(acpi_physical_address
printk(KERN_ERR PREFIX "Cannot map memory that high\n");
return NULL;
}
- if (acpi_gbl_permanent_mmap)
+ if (acpi_gbl_permanent_mmap) {
/*
* ioremap checks to ensure this is in reserved space
*/
- return ioremap((unsigned long)phys, size);
- else
+ if (x86_platform.is_wb_acpi_tables() &&
+ (e820_all_mapped(phys, phys + size, E820_RAM) ||
+ e820_all_mapped(phys, phys + size, E820_ACPI) ||
+ e820_all_mapped(phys, phys + size, E820_NVS)))
+ return ioremap_cache((unsigned long)phys, size);
+ else
+ return ioremap((unsigned long)phys, size);
+ } else
return __acpi_map_table((unsigned long)phys, size);
}
EXPORT_SYMBOL_GPL(acpi_os_map_memory);
> The following experimental patch changes the kernel mapping for ACPI tables
> to CACHED. This eliminates the page attibute conflict & allows users to map
> the tables CACHEABLE. This significantly speeds up boot:
>
> 38 minutes without the patch
> 27 minutes with the patch
> ~30% improvement
>
> Time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch
Interesting.
Can you detect a performance differene on a 1-node machine
that doesn't magnify the penalty of the remote uncached access?
thanks,
-Len Brown, Intel Open Source Technology Cetner
On Thu, 2010-07-22 at 23:22 +0800, Jack Steiner wrote:
> I'd like feedback on the following performance problem &
> suggestions for a proper solution.
>
>
> Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> are located in RAM memory that is physically attached to node 0.
>
> User programs (ex., acpidump) read the ACPI tables by mapping them thru
> /dev/mem. Although mmap tries to map the tables as CACHED, there are
> existing kernel UNCACHED mapping that conflict and the tables end up as
> being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> remap_pfn_range()).
>
> Much of the access is to small fields (bytes (checksums), shorts, etc).
> Late in boot, there is significant scanning of the ACPI tables that take
> place from nodes other than zero. Since the tables are not cached, each
> reference accesses physical memory that is attached to remote nodes. These
> memory requests must cross the numalink interconnect which adds several
> hundred nsec to each access. This slows the boot process. Access from
> node 0, although faster, is still very slow.
>
>
>
> The following experimental patch changes the kernel mapping for ACPI tables
> to CACHED. This eliminates the page attibute conflict & allows users to map
> the tables CACHEABLE. This significantly speeds up boot:
>
> 38 minutes without the patch
> 27 minutes with the patch
> ~30% improvement
>
> Time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch
>
>
Hi, Jack
From the above data it seems that the performance is improved
significantly after using cached type for ACPI region.
But some ACPI region will be used for the communication between OS and
BIOS. Maybe it is inappropriate to map them as cached.
ACPI spec have the following two types of ACPI address range.
a. AddressRangeACPI(E820_ACPI): This belongs to the ACPI Reclaim
Memory. This range is available RAM usable by the OS after it reads the
ACPI tables.
b. AddressRangeNVS(E820_NVS): This belongs to the ACPI NVS Memory.
This range of addresses is in use or reserved by the system and must not
be used by the operating system. And this region will also be used for
the communication between OS and BIOS.
>From the above description maybe the E820_ACPI region can be mapped as
cached. But this still depends on the BIOS. If the some shared data
resides in the E820_ACPI region on some BIOS, maybe we can't map the
E820_ACPI region as cached again.
Thanks.
Yakui
> I don't know if the patch in it's current form is the correct solution. I'm
> interested in feedback on how this should be solved. I expect there
> are issues on other platforms so for now, the patch uses x86_platform_ops
> to change mappings only on UV platforms (I'm paranoid :-).
>
> I also need to experiment with early_ioremap'ing of the ACPI tables. I suspect
> this is also mapped UNCACHED. There may be additional improvements if this
> could be mapped CACHED. However, the potential performance gain is much
> less since these references all occur from node 0.
>
>
>
> Signed-off-by: Jack Steiner <[email protected]>
>
>
> ---
> arch/x86/include/asm/x86_init.h | 2 ++
> arch/x86/kernel/apic/x2apic_uv_x.c | 6 ++++++
> arch/x86/kernel/x86_init.c | 3 +++
> drivers/acpi/osl.c | 12 +++++++++---
> 4 files changed, 20 insertions(+), 3 deletions(-)
>
> Index: linux/arch/x86/include/asm/x86_init.h
> ===================================================================
> --- linux.orig/arch/x86/include/asm/x86_init.h 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/include/asm/x86_init.h 2010-07-21 16:57:46.614872338 -0500
> @@ -113,6 +113,7 @@ struct x86_cpuinit_ops {
>
> /**
> * struct x86_platform_ops - platform specific runtime functions
> + * @is_wb_acpi_tables E820 ACPI table are in WB memory
> * @is_untracked_pat_range exclude from PAT logic
> * @calibrate_tsc: calibrate TSC
> * @get_wallclock: get time from HW clock like RTC etc.
> @@ -120,6 +121,7 @@ struct x86_cpuinit_ops {
> * @nmi_init enable NMI on cpus
> */
> struct x86_platform_ops {
> + int (*is_wb_acpi_tables)(void);
> int (*is_untracked_pat_range)(u64 start, u64 end);
> unsigned long (*calibrate_tsc)(void);
> unsigned long (*get_wallclock)(void);
> Index: linux/arch/x86/kernel/apic/x2apic_uv_x.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:54:46.358866486 -0500
> @@ -58,6 +58,11 @@ static int uv_is_untracked_pat_range(u64
> return is_ISA_range(start, end) || is_GRU_range(start, end);
> }
>
> +static int uv_is_wb_acpi_tables(void)
> +{
> + return 1;
> +}
> +
> static int early_get_nodeid(void)
> {
> union uvh_node_id_u node_id;
> @@ -81,6 +86,7 @@ static int __init uv_acpi_madt_oem_check
> nodeid = early_get_nodeid();
> x86_platform.is_untracked_pat_range = uv_is_untracked_pat_range;
> x86_platform.nmi_init = uv_nmi_init;
> + x86_platform.is_wb_acpi_tables = uv_is_wb_acpi_tables;
> if (!strcmp(oem_table_id, "UVL"))
> uv_system_type = UV_LEGACY_APIC;
> else if (!strcmp(oem_table_id, "UVX"))
> Index: linux/arch/x86/kernel/x86_init.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/x86_init.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/x86_init.c 2010-07-21 16:58:17.106240870 -0500
> @@ -71,7 +71,10 @@ struct x86_cpuinit_ops x86_cpuinit __cpu
>
> static void default_nmi_init(void) { };
>
> +static int default_wb_acpi_tables(void) {return 0;}
> +
> struct x86_platform_ops x86_platform = {
> + .is_wb_acpi_tables = default_wb_acpi_tables,
> .is_untracked_pat_range = default_is_untracked_pat_range,
> .calibrate_tsc = native_calibrate_tsc,
> .get_wallclock = mach_get_cmos_time,
> Index: linux/drivers/acpi/osl.c
> ===================================================================
> --- linux.orig/drivers/acpi/osl.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/drivers/acpi/osl.c 2010-07-21 17:58:20.370414172 -0500
> @@ -293,12 +293,18 @@ acpi_os_map_memory(acpi_physical_address
> printk(KERN_ERR PREFIX "Cannot map memory that high\n");
> return NULL;
> }
> - if (acpi_gbl_permanent_mmap)
> + if (acpi_gbl_permanent_mmap) {
> /*
> * ioremap checks to ensure this is in reserved space
> */
> - return ioremap((unsigned long)phys, size);
> - else
> + if (x86_platform.is_wb_acpi_tables() &&
> + (e820_all_mapped(phys, phys + size, E820_RAM) ||
> + e820_all_mapped(phys, phys + size, E820_ACPI) ||
> + e820_all_mapped(phys, phys + size, E820_NVS)))
> + return ioremap_cache((unsigned long)phys, size);
> + else
> + return ioremap((unsigned long)phys, size);
> + } else
> return __acpi_map_table((unsigned long)phys, size);
> }
> EXPORT_SYMBOL_GPL(acpi_os_map_memory);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
* ykzhao <[email protected]> wrote:
> From the above description maybe the E820_ACPI region can be mapped as
> cached. But this still depends on the BIOS. If the some shared data resides
> in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> region as cached again.
I dont think we can do this safely unless some other OS (Windows) does it as
well. (the reason is that if some BIOS messes this up then it will cause nasty
bugs/problems only on Linux.)
But the benefits of caching are very clear and well measured by Jack, so we
want the feature. What we can do is to add an exception for 'known good' hw
vendors - i.e. something quite close to Jack's RFC patch, but implemented a
bit more cleanly:
Exposing x86_platform and e820 details to generic ACPI code isnt particularly
clean - there should be an ACPI accessor function for that or so: a new
acpi_table_can_be_cached(table) function or so.
In fact since __acpi_map_table(addr,size) is defined by architectures already,
this could be done purely within x86 code.
Thanks,
Ingo
On Fri, 2010-07-23 at 15:23 +0800, Ingo Molnar wrote:
> * ykzhao <[email protected]> wrote:
>
> > From the above description maybe the E820_ACPI region can be mapped as
> > cached. But this still depends on the BIOS. If the some shared data resides
> > in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> > region as cached again.
>
> I dont think we can do this safely unless some other OS (Windows) does it as
> well. (the reason is that if some BIOS messes this up then it will cause nasty
> bugs/problems only on Linux.)
>
Yes. We can't map the corresponding ACPI region as cached under the
following case:
>No E820_ACPI region is reported by BIOS. In such case the ACPI
table resides in the NVS region
But if the BIOS can follow the spec and report the separated
E820_ACPI/E820_NVS region, maybe we can give a try to map the E820_ACPI
region as cached type. For example: the server machine.(The spec
describes the E820_ACPI region as reclaimed memory, which means that it
can be managed by OS after ACPI table is loaded).
Can we add a boot option to control whether the E820_ACPI region can be
mapped as cached type?
> But the benefits of caching are very clear and well measured by Jack, so we
> want the feature. What we can do is to add an exception for 'known good' hw
> vendors - i.e. something quite close to Jack's RFC patch, but implemented a
> bit more cleanly:
>
> Exposing x86_platform and e820 details to generic ACPI code isnt particularly
> clean - there should be an ACPI accessor function for that or so: a new
> acpi_table_can_be_cached(table) function or so.
Agree. The function of acpi_os_map_memory will also be used on IA64
platform. It seems more reasonable to use a wrapper function to check
whether the corresponding region can be mapped as cached type.
>
> In fact since __acpi_map_table(addr,size) is defined by architectures already,
> this could be done purely within x86 code.
>
> Thanks,
>
> Ingo
On Thu, Jul 22, 2010 at 11:52:02AM -0400, Len Brown wrote:
> > The following experimental patch changes the kernel mapping for ACPI tables
> > to CACHED. This eliminates the page attibute conflict & allows users to map
> > the tables CACHEABLE. This significantly speeds up boot:
> >
> > 38 minutes without the patch
> > 27 minutes with the patch
> > ~30% improvement
> >
> > Time to run ACPIDUMP on a large system:
> > 527 seconds without the patch
> > 8 seconds with the patch
>
> Interesting.
>
> Can you detect a performance differene on a 1-node machine
> that doesn't magnify the penalty of the remote uncached access?
I timed acpidump on a smaller system running from both node 0 & a higher
node.
Serial number: UV-00000041
Partition number: 0
4 Blades
8 Nodes (Nehalem-EX sockets)
64 CPUs
60.87 Gb Memory Total
Times to run acpidump are (aver of 100 runs) show that cached runs 4X to 14X
faster than uncached, depending on the node it is running from. Since the
system is small, the total runtime is small.
baseline
.143 sec node 0
.479 sec node 7
ACPI tables mapped cached
.034 sec node 0
.037 sec node 7
I added trace code to remap_pfn_range() to see what ranges are mmapped.
The ranges are (first number is the number of times the range was mapped):
2 : paddr 0x78d1c000 - 0x79d1c000 DSDT @ 0x78d1c000
2 : paddr 0x78d1c000 - 0x9bd1c000 DSDT @ 0x78d1c000 << ???
4 : paddr 0x78d3f000 - 0x79d3f000 FACS @ 0x78d3f000
4 : paddr 0x78d6f000 - 0x79d6f000 DMAR @ 0x78d6f000
4 : paddr 0x78d70000 - 0x79d70000 SPCR @ 0x78d70000
4 : paddr 0x78d71000 - 0x79d71000 MCFG @ 0x78d71000
4 : paddr 0x78d72000 - 0x79d72000 SRAT @ 0x78d72000
4 : paddr 0x78d74000 - 0x79d74000 APIC @ 0x78d74000
4 : paddr 0x78d76000 - 0x79d76000 SLIT @ 0x78d76000
4 : paddr 0x78d78000 - 0x79d78000 HPET @ 0x78d78000
2 : paddr 0x78d79000 - 0x79d79000 SSDT @ 0x78d79000
2 : paddr 0x78d79000 - 0x7ed79000 SSDT @ 0x78d79000
4 : paddr 0x78d7f000 - 0x79d7f000 FACP @ 0x78d7f000
5 : paddr 0x78d80000 - 0x79d80000 ???
These ranges correspond to the following E820 entries
[ 0.000000] BIOS-e820: 000000000008f000 - 0000000000090000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078c61000 - 0000000078cd6000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078cd6000 - 0000000078d3f000 (ACPI data)
[ 0.000000] BIOS-e820: 0000000078d3f000 - 0000000078d61000 (ACPI NVS)
[ 0.000000] BIOS-e820: 0000000078d61000 - 0000000078d81000 (ACPI data)
[ 0.000000] BIOS-e820: 0000000078d81000 - 000000007cde1000 (usable)
and EFI entries:
[ 0.000000] EFI: mem136: type=9, attr=0xf, range=[0x0000000078cd6000-0x0000000078d3f000) (0MB)
[ 0.000000] EFI: mem137: type=10, attr=0xf, range=[0x0000000078d3f000-0x0000000078d61000) (0MB)
[ 0.000000] EFI: mem138: type=9, attr=0xf, range=[0x0000000078d61000-0x0000000078d6f000) (0MB)
[ 0.000000] EFI: mem139: type=9, attr=0xf, range=[0x0000000078d6f000-0x0000000078d81000) (0MB)
attr = 0xf ==> WB memory (UC WC WT also supported)
type 9 ==> EFI_ACPI_RECLAIM_MEMORY
type 10 ==> EFI_ACPI_MEMORY_NVS
On Thu, 22 Jul 2010, Jack Steiner wrote:
> Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> are located in RAM memory that is physically attached to node 0.
>
> User programs (ex., acpidump) read the ACPI tables by mapping them thru
> /dev/mem. Although mmap tries to map the tables as CACHED, there are
> existing kernel UNCACHED mapping that conflict and the tables end up as
> being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> remap_pfn_range()).
Well, as it was raised in this thread, ACPI tables are likely to be near RAM
regions used for IPC with the firmware or SMBIOS, and we have no idea of the
kind of crap that could happen if we enable caching on those areas.
OTOH, we *know* of systems that force us to copy the ACPI tables to regular
RAM, otherwise, the utterly broken BIOS corrupts the ACPI tables after the
kernel has loaded.
Couldn't we simply always copy all tables to regular RAM and mark THAT as
cacheable (since there will be no IPC regions in it)? For the tables that
are only used once, we can free the RAM later.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
On Fri, Jul 23, 2010 at 09:14:50PM -0300, Henrique de Moraes Holschuh wrote:
> Well, as it was raised in this thread, ACPI tables are likely to be near RAM
> regions used for IPC with the firmware or SMBIOS, and we have no idea of the
> kind of crap that could happen if we enable caching on those areas.
>
> OTOH, we *know* of systems that force us to copy the ACPI tables to regular
> RAM, otherwise, the utterly broken BIOS corrupts the ACPI tables after the
> kernel has loaded.
>
> Couldn't we simply always copy all tables to regular RAM and mark THAT as
> cacheable (since there will be no IPC regions in it)? For the tables that
> are only used once, we can free the RAM later.
I think this is reasonable. There's an argument that we shouldn't cache
operation regions that may be sitting next to the ACPI tables, but I
can't see any problems being caused by copying the tables to RAM.
--
Matthew Garrett | [email protected]
On Sat, 24 Jul 2010, Matthew Garrett wrote:
> On Fri, Jul 23, 2010 at 09:14:50PM -0300, Henrique de Moraes Holschuh wrote:
> > Well, as it was raised in this thread, ACPI tables are likely to be near RAM
> > regions used for IPC with the firmware or SMBIOS, and we have no idea of the
> > kind of crap that could happen if we enable caching on those areas.
> >
> > OTOH, we *know* of systems that force us to copy the ACPI tables to regular
> > RAM, otherwise, the utterly broken BIOS corrupts the ACPI tables after the
> > kernel has loaded.
> >
> > Couldn't we simply always copy all tables to regular RAM and mark THAT as
> > cacheable (since there will be no IPC regions in it)? For the tables that
> > are only used once, we can free the RAM later.
>
> I think this is reasonable. There's an argument that we shouldn't cache
> operation regions that may be sitting next to the ACPI tables, but I
> can't see any problems being caused by copying the tables to RAM.
Yes. And well-engineered platforms that are known to be safe (such as UV)
could just opt-out of that and mark the ACPI tables directly as cachable if
they want, if the penalty for an unecessary copy-to-RAM [on these systems]
is high enough to merit it.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
On Fri, Jul 23, 2010 at 09:46:13AM +0800, ykzhao wrote:
> On Thu, 2010-07-22 at 23:22 +0800, Jack Steiner wrote:
> > I'd like feedback on the following performance problem &
> > suggestions for a proper solution.
> >
> >
> > Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> > part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> > are located in RAM memory that is physically attached to node 0.
> >
> > User programs (ex., acpidump) read the ACPI tables by mapping them thru
> > /dev/mem. Although mmap tries to map the tables as CACHED, there are
> > existing kernel UNCACHED mapping that conflict and the tables end up as
> > being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> > remap_pfn_range()).
> >
> > Much of the access is to small fields (bytes (checksums), shorts, etc).
> > Late in boot, there is significant scanning of the ACPI tables that take
> > place from nodes other than zero. Since the tables are not cached, each
> > reference accesses physical memory that is attached to remote nodes. These
> > memory requests must cross the numalink interconnect which adds several
> > hundred nsec to each access. This slows the boot process. Access from
> > node 0, although faster, is still very slow.
> >
> >
> >
> > The following experimental patch changes the kernel mapping for ACPI tables
> > to CACHED. This eliminates the page attibute conflict & allows users to map
> > the tables CACHEABLE. This significantly speeds up boot:
> >
> > 38 minutes without the patch
> > 27 minutes with the patch
> > ~30% improvement
> >
> > Time to run ACPIDUMP on a large system:
> > 527 seconds without the patch
> > 8 seconds with the patch
> >
> >
> Hi, Jack
> From the above data it seems that the performance is improved
> significantly after using cached type for ACPI region.
>
> But some ACPI region will be used for the communication between OS and
> BIOS. Maybe it is inappropriate to map them as cached.
>
> ACPI spec have the following two types of ACPI address range.
>
> a. AddressRangeACPI(E820_ACPI): This belongs to the ACPI Reclaim
> Memory. This range is available RAM usable by the OS after it reads the
> ACPI tables.
>
> b. AddressRangeNVS(E820_NVS): This belongs to the ACPI NVS Memory.
> This range of addresses is in use or reserved by the system and must not
> be used by the operating system. And this region will also be used for
> the communication between OS and BIOS.
(Sorry for taking so long to reply. I've been fighting a different set
of fires).
I don't see any issue here for platforms that actually place the ACPI tables
(both E820_ACPI & E820_NVS) in WB memory. This is definitely platform
specific & many platforms wont do this! However, on UV platforms all ACPI
memory is real WB memory. In fact, the BIOS references it as WB memory
during BIOS initialization.
If the platform supports it, what is the issue in having the OS reference
both E820_ACPI & E820_NVS ACPI tables as WB memory?
>
> >From the above description maybe the E820_ACPI region can be mapped as
> cached. But this still depends on the BIOS. If the some shared data
> resides in the E820_ACPI region on some BIOS, maybe we can't map the
> E820_ACPI region as cached again.
Agree. All of this is dependent on the BIOS. Whatever we do needs to
use the memory attributes set by BIOS for the memory regions.
AFAICT linux never reuses the E820_ACPI memory. It remains
reserved and is never reclaimed. Programs like "acpidump" expect the
E820_ACPI memory to remain intact.
>
> Thanks.
> Yakui
> > I don't know if the patch in it's current form is the correct solution. I'm
> > interested in feedback on how this should be solved. I expect there
> > are issues on other platforms so for now, the patch uses x86_platform_ops
> > to change mappings only on UV platforms (I'm paranoid :-).
> >
> > I also need to experiment with early_ioremap'ing of the ACPI tables. I suspect
> > this is also mapped UNCACHED. There may be additional improvements if this
> > could be mapped CACHED. However, the potential performance gain is much
> > less since these references all occur from node 0.
> >
> >
> >
> > Signed-off-by: Jack Steiner <[email protected]>
> >
> >
> > ---
> > arch/x86/include/asm/x86_init.h | 2 ++
> > arch/x86/kernel/apic/x2apic_uv_x.c | 6 ++++++
> > arch/x86/kernel/x86_init.c | 3 +++
> > drivers/acpi/osl.c | 12 +++++++++---
> > 4 files changed, 20 insertions(+), 3 deletions(-)
> >
> > Index: linux/arch/x86/include/asm/x86_init.h
> > ===================================================================
> > --- linux.orig/arch/x86/include/asm/x86_init.h 2010-07-21 16:53:30.226241589 -0500
> > +++ linux/arch/x86/include/asm/x86_init.h 2010-07-21 16:57:46.614872338 -0500
> > @@ -113,6 +113,7 @@ struct x86_cpuinit_ops {
> >
> > /**
> > * struct x86_platform_ops - platform specific runtime functions
> > + * @is_wb_acpi_tables E820 ACPI table are in WB memory
> > * @is_untracked_pat_range exclude from PAT logic
> > * @calibrate_tsc: calibrate TSC
> > * @get_wallclock: get time from HW clock like RTC etc.
> > @@ -120,6 +121,7 @@ struct x86_cpuinit_ops {
> > * @nmi_init enable NMI on cpus
> > */
> > struct x86_platform_ops {
> > + int (*is_wb_acpi_tables)(void);
> > int (*is_untracked_pat_range)(u64 start, u64 end);
> > unsigned long (*calibrate_tsc)(void);
> > unsigned long (*get_wallclock)(void);
> > Index: linux/arch/x86/kernel/apic/x2apic_uv_x.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:53:30.226241589 -0500
> > +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:54:46.358866486 -0500
> > @@ -58,6 +58,11 @@ static int uv_is_untracked_pat_range(u64
> > return is_ISA_range(start, end) || is_GRU_range(start, end);
> > }
> >
> > +static int uv_is_wb_acpi_tables(void)
> > +{
> > + return 1;
> > +}
> > +
> > static int early_get_nodeid(void)
> > {
> > union uvh_node_id_u node_id;
> > @@ -81,6 +86,7 @@ static int __init uv_acpi_madt_oem_check
> > nodeid = early_get_nodeid();
> > x86_platform.is_untracked_pat_range = uv_is_untracked_pat_range;
> > x86_platform.nmi_init = uv_nmi_init;
> > + x86_platform.is_wb_acpi_tables = uv_is_wb_acpi_tables;
> > if (!strcmp(oem_table_id, "UVL"))
> > uv_system_type = UV_LEGACY_APIC;
> > else if (!strcmp(oem_table_id, "UVX"))
> > Index: linux/arch/x86/kernel/x86_init.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/x86_init.c 2010-07-21 16:53:30.226241589 -0500
> > +++ linux/arch/x86/kernel/x86_init.c 2010-07-21 16:58:17.106240870 -0500
> > @@ -71,7 +71,10 @@ struct x86_cpuinit_ops x86_cpuinit __cpu
> >
> > static void default_nmi_init(void) { };
> >
> > +static int default_wb_acpi_tables(void) {return 0;}
> > +
> > struct x86_platform_ops x86_platform = {
> > + .is_wb_acpi_tables = default_wb_acpi_tables,
> > .is_untracked_pat_range = default_is_untracked_pat_range,
> > .calibrate_tsc = native_calibrate_tsc,
> > .get_wallclock = mach_get_cmos_time,
> > Index: linux/drivers/acpi/osl.c
> > ===================================================================
> > --- linux.orig/drivers/acpi/osl.c 2010-07-21 16:53:30.226241589 -0500
> > +++ linux/drivers/acpi/osl.c 2010-07-21 17:58:20.370414172 -0500
> > @@ -293,12 +293,18 @@ acpi_os_map_memory(acpi_physical_address
> > printk(KERN_ERR PREFIX "Cannot map memory that high\n");
> > return NULL;
> > }
> > - if (acpi_gbl_permanent_mmap)
> > + if (acpi_gbl_permanent_mmap) {
> > /*
> > * ioremap checks to ensure this is in reserved space
> > */
> > - return ioremap((unsigned long)phys, size);
> > - else
> > + if (x86_platform.is_wb_acpi_tables() &&
> > + (e820_all_mapped(phys, phys + size, E820_RAM) ||
> > + e820_all_mapped(phys, phys + size, E820_ACPI) ||
> > + e820_all_mapped(phys, phys + size, E820_NVS)))
> > + return ioremap_cache((unsigned long)phys, size);
> > + else
> > + return ioremap((unsigned long)phys, size);
> > + } else
> > return __acpi_map_table((unsigned long)phys, size);
> > }
> > EXPORT_SYMBOL_GPL(acpi_os_map_memory);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 23, 2010 at 09:23:01AM +0200, Ingo Molnar wrote:
>
> * ykzhao <[email protected]> wrote:
>
> > From the above description maybe the E820_ACPI region can be mapped as
> > cached. But this still depends on the BIOS. If the some shared data resides
> > in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> > region as cached again.
>
> I dont think we can do this safely unless some other OS (Windows) does it as
> well. (the reason is that if some BIOS messes this up then it will cause nasty
> bugs/problems only on Linux.)
>
> But the benefits of caching are very clear and well measured by Jack, so we
> want the feature. What we can do is to add an exception for 'known good' hw
> vendors - i.e. something quite close to Jack's RFC patch, but implemented a
> bit more cleanly:
>
> Exposing x86_platform and e820 details to generic ACPI code isnt particularly
> clean - there should be an ACPI accessor function for that or so: a new
> acpi_table_can_be_cached(table) function or so.
Agree. I am looking for the right set of abstractions for this.
>
> In fact since __acpi_map_table(addr,size) is defined by architectures already,
> this could be done purely within x86 code.
No. Unfortunately the function __acpi_map_tables() is not called on the
path that does the permanent mappings. The code is (somewhat simplified):
drivers/acpi/osl.c:
acpi_os_map_memory(acpi_physical_address phys, acpi_size size)
{
if (acpi_gbl_permanent_mmap)
return ioremap((unsigned long)phys, size);
else
return __acpi_map_table((unsigned long)phys, size);
}
Early in boot before "acpi_gbl_permanent_mmap" is set, __acpi_map_table()
is called to map tables. __acpi_map_table() calls early_iomap() and all
early mappings are subsequently unmapped.
For the permanent mappings, we need a way to make the acpi code call
ioremap_cache() instead of ioremap() for all tables that are actually
in WB memory.
Timings made during boot show only a small benefit __acpi_map_table()
mapping tables cacheable. (I didn't check, but perhaps the early mapping
are only checking table IDs - not the full table).
The performance benefit of WB is for the permanent mapping made after
acpi_gbl_permanent_mmap is set. For some reason, most of the time
consuming references occur after this point. In addition ALL offnode
references occur after this point.
--- jack
On Fri, Jul 23, 2010 at 10:26:05PM +0800, ykzhao wrote:
> On Fri, 2010-07-23 at 15:23 +0800, Ingo Molnar wrote:
> > * ykzhao <[email protected]> wrote:
> >
> > > From the above description maybe the E820_ACPI region can be mapped as
> > > cached. But this still depends on the BIOS. If the some shared data resides
> > > in the E820_ACPI region on some BIOS, maybe we can't map the E820_ACPI
> > > region as cached again.
> >
> > I dont think we can do this safely unless some other OS (Windows) does it as
> > well. (the reason is that if some BIOS messes this up then it will cause nasty
> > bugs/problems only on Linux.)
> >
>
> Yes. We can't map the corresponding ACPI region as cached under the
> following case:
> >No E820_ACPI region is reported by BIOS. In such case the ACPI
> table resides in the NVS region
>
> But if the BIOS can follow the spec and report the separated
> E820_ACPI/E820_NVS region, maybe we can give a try to map the E820_ACPI
> region as cached type. For example: the server machine.(The spec
> describes the E820_ACPI region as reclaimed memory, which means that it
> can be managed by OS after ACPI table is loaded).
This would work for the problem I'm trying to solve. On our platform, the
majority of the ACPI tables are in E820_ACPI regions. Mapping as cached
solves most of the performance. However, even the E820_NVS regions are
mapped as WB by BIOS on our platform. What is the advantage of having the
OS map this as UC as long as this mapping option is a platform specific?
>
> Can we add a boot option to control whether the E820_ACPI region can be
> mapped as cached type?
In general, I don't like boot options for thing like this. They are prone
to misuse. Mapping as WB really is not an option - it is a platform-specific
attribute & should be controlled by the OS or BIOS and transparant to the user.
In the patch I posted, I had the mapping controlled by platform specific
code. I originally had it as a x86_platform callout but other schemes may
be more appropriate.
Would it be better to add an acpi function that is called directly from the
UV initialization code. The new acpi function would set a flag that controls
the mapping done later by acpi_os_map_memory()?
>
> > But the benefits of caching are very clear and well measured by Jack, so we
> > want the feature. What we can do is to add an exception for 'known good' hw
> > vendors - i.e. something quite close to Jack's RFC patch, but implemented a
> > bit more cleanly:
> >
> > Exposing x86_platform and e820 details to generic ACPI code isnt particularly
> > clean - there should be an ACPI accessor function for that or so: a new
> > acpi_table_can_be_cached(table) function or so.
>
> Agree. The function of acpi_os_map_memory will also be used on IA64
> platform. It seems more reasonable to use a wrapper function to check
> whether the corresponding region can be mapped as cached type.
> >
> > In fact since __acpi_map_table(addr,size) is defined by architectures already,
> > this could be done purely within x86 code.
> >
> > Thanks,
> >
> > Ingo
On Fri, Jul 23, 2010 at 09:14:50PM -0300, Henrique de Moraes Holschuh wrote:
> On Thu, 22 Jul 2010, Jack Steiner wrote:
> > Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> > part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> > are located in RAM memory that is physically attached to node 0.
> >
> > User programs (ex., acpidump) read the ACPI tables by mapping them thru
> > /dev/mem. Although mmap tries to map the tables as CACHED, there are
> > existing kernel UNCACHED mapping that conflict and the tables end up as
> > being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> > remap_pfn_range()).
>
> Well, as it was raised in this thread, ACPI tables are likely to be near RAM
> regions used for IPC with the firmware or SMBIOS, and we have no idea of the
> kind of crap that could happen if we enable caching on those areas.
I'm certainly not suggesting that ALL platforms map ACPI tables as WB. That
would be a disaster. Only platforms where BIOS specifically reports that
that tables are in WB memory should be mapped as MB.
>
> OTOH, we *know* of systems that force us to copy the ACPI tables to regular
> RAM, otherwise, the utterly broken BIOS corrupts the ACPI tables after the
> kernel has loaded.
>
> Couldn't we simply always copy all tables to regular RAM and mark THAT as
> cacheable (since there will be no IPC regions in it)? For the tables that
> are only used once, we can free the RAM later.
I'm really out of the area of the kernel that I understand :-)
What are the implications of copying the ACPI tables to another location?
Does BIOS ever make changes to their tables that would make it necessary
to update the OS copy?
While copying tables might sound attractive, it seems like there is a
ripple effect. For example, acpidump does the following:
- opens /sys/firmware/efi/systable
- reads the physical address of the ACPI20 table
(strace)
open("/sys/firmware/efi/systab", O_RDONLY) = 3
read(3, "MPS=0x0\nACPI20=0x78d76014\nACPI=0"..., 4096) = 83
(contents)
ACPI20=0x78d76014
SMBIOS=0x78c33000
- opens /dev/mem
(strace)
open("/dev/mem", O_RDONLY) = 4
mmap(NULL, 396, PROT_READ, MAP_PRIVATE, 4, 0x78d76000)
- mmaps the physical address of the ACPI20 table
- reads the table
If the ACPI tables are copied to a different location, I don't see a
clean/simple way to make this work.
It seems like the best approach would to provide platform specific hooks
in ACPI to map the tables as WB in the first place - IF the platform supports it.
--- jack
On 07/23/2010 07:26 AM, ykzhao wrote:
>
> Yes. We can't map the corresponding ACPI region as cached under the
> following case:
> >No E820_ACPI region is reported by BIOS. In such case the ACPI
> table resides in the NVS region
>
Why could we not map the NVS region as cached? That doesn't seem to
make sense. In practice, at least, on all BIOSes I've seen the NVS
region is just another hunk of RAM.
Sample from a real system:
BIOS-e820: 000000007d6b0000 - 000000007d6cc000 (ACPI data)
BIOS-e820: 000000007d6cc000 - 000000007d700000 (ACPI NVS)
Both are clearly RAM.
If you're not talking about the e820 NVS region, that might be a
different thing, but for the ROM region in the legacy area, the fixed
MTRRs are often set up to allow caching, and we should be able to map
them cacheable, e.g. on this system:
00000-9FFFF write-back
A0000-BFFFF uncachable
C0000-CFFFF write-protect
D0000-DFFFF uncachable
E0000-FFFFF write-protect
Clearly cacheable.
-hpa
On Thu, Jul 22, 2010 at 10:22:20AM -0500, Jack Steiner wrote:
> I'd like feedback on the following performance problem &
> suggestions for a proper solution.
I did some more digging & found something I should have seen earlier.
On EFI-enabled systems (like UV), the ACPI tables are already mapped as
WB memory. This is done in the EFI function efi_enter_virtual_mode().
The E820_ACPI & E820_NVS regions will be mapped as WB memory if the BIOS
sets the WB attribute in the EFI memmap entry for the chunk of memory.
The ACPI code in acpi_os_map_memory() is not currently aware of the EFI mapping
& currently maps the memory as UC. This seems like a bug.
In order to prevent attribute aliasing, I think the ACPI mapping needs to
be consistent with the EFI mapping. Otherwise, we will have the OS
referencing the memory as UC at the same time BIOS is referencing it as WB.
>
>
> Large SGI UV systems (3072p, 5TB) take a long time to boot. A significant
> part of the boot time is scanning ACPI tables. ACPI tables on UV systems
> are located in RAM memory that is physically attached to node 0.
>
> User programs (ex., acpidump) read the ACPI tables by mapping them thru
> /dev/mem. Although mmap tries to map the tables as CACHED, there are
> existing kernel UNCACHED mapping that conflict and the tables end up as
> being mapped UNCACHED. (See the call to track_pfn_vma_new() in
> remap_pfn_range()).
>
> Much of the access is to small fields (bytes (checksums), shorts, etc).
> Late in boot, there is significant scanning of the ACPI tables that take
> place from nodes other than zero. Since the tables are not cached, each
> reference accesses physical memory that is attached to remote nodes. These
> memory requests must cross the numalink interconnect which adds several
> hundred nsec to each access. This slows the boot process. Access from
> node 0, although faster, is still very slow.
>
>
>
> The following experimental patch changes the kernel mapping for ACPI tables
> to CACHED. This eliminates the page attibute conflict & allows users to map
> the tables CACHEABLE. This significantly speeds up boot:
>
> 38 minutes without the patch
> 27 minutes with the patch
> ~30% improvement
>
> Time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch
>
>
> I don't know if the patch in it's current form is the correct solution. I'm
> interested in feedback on how this should be solved. I expect there
> are issues on other platforms so for now, the patch uses x86_platform_ops
> to change mappings only on UV platforms (I'm paranoid :-).
>
> I also need to experiment with early_ioremap'ing of the ACPI tables. I suspect
> this is also mapped UNCACHED. There may be additional improvements if this
> could be mapped CACHED. However, the potential performance gain is much
> less since these references all occur from node 0.
>
>
>
> Signed-off-by: Jack Steiner <[email protected]>
>
>
> ---
> arch/x86/include/asm/x86_init.h | 2 ++
> arch/x86/kernel/apic/x2apic_uv_x.c | 6 ++++++
> arch/x86/kernel/x86_init.c | 3 +++
> drivers/acpi/osl.c | 12 +++++++++---
> 4 files changed, 20 insertions(+), 3 deletions(-)
>
> Index: linux/arch/x86/include/asm/x86_init.h
> ===================================================================
> --- linux.orig/arch/x86/include/asm/x86_init.h 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/include/asm/x86_init.h 2010-07-21 16:57:46.614872338 -0500
> @@ -113,6 +113,7 @@ struct x86_cpuinit_ops {
>
> /**
> * struct x86_platform_ops - platform specific runtime functions
> + * @is_wb_acpi_tables E820 ACPI table are in WB memory
> * @is_untracked_pat_range exclude from PAT logic
> * @calibrate_tsc: calibrate TSC
> * @get_wallclock: get time from HW clock like RTC etc.
> @@ -120,6 +121,7 @@ struct x86_cpuinit_ops {
> * @nmi_init enable NMI on cpus
> */
> struct x86_platform_ops {
> + int (*is_wb_acpi_tables)(void);
> int (*is_untracked_pat_range)(u64 start, u64 end);
> unsigned long (*calibrate_tsc)(void);
> unsigned long (*get_wallclock)(void);
> Index: linux/arch/x86/kernel/apic/x2apic_uv_x.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/apic/x2apic_uv_x.c 2010-07-21 16:54:46.358866486 -0500
> @@ -58,6 +58,11 @@ static int uv_is_untracked_pat_range(u64
> return is_ISA_range(start, end) || is_GRU_range(start, end);
> }
>
> +static int uv_is_wb_acpi_tables(void)
> +{
> + return 1;
> +}
> +
> static int early_get_nodeid(void)
> {
> union uvh_node_id_u node_id;
> @@ -81,6 +86,7 @@ static int __init uv_acpi_madt_oem_check
> nodeid = early_get_nodeid();
> x86_platform.is_untracked_pat_range = uv_is_untracked_pat_range;
> x86_platform.nmi_init = uv_nmi_init;
> + x86_platform.is_wb_acpi_tables = uv_is_wb_acpi_tables;
> if (!strcmp(oem_table_id, "UVL"))
> uv_system_type = UV_LEGACY_APIC;
> else if (!strcmp(oem_table_id, "UVX"))
> Index: linux/arch/x86/kernel/x86_init.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/x86_init.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/arch/x86/kernel/x86_init.c 2010-07-21 16:58:17.106240870 -0500
> @@ -71,7 +71,10 @@ struct x86_cpuinit_ops x86_cpuinit __cpu
>
> static void default_nmi_init(void) { };
>
> +static int default_wb_acpi_tables(void) {return 0;}
> +
> struct x86_platform_ops x86_platform = {
> + .is_wb_acpi_tables = default_wb_acpi_tables,
> .is_untracked_pat_range = default_is_untracked_pat_range,
> .calibrate_tsc = native_calibrate_tsc,
> .get_wallclock = mach_get_cmos_time,
> Index: linux/drivers/acpi/osl.c
> ===================================================================
> --- linux.orig/drivers/acpi/osl.c 2010-07-21 16:53:30.226241589 -0500
> +++ linux/drivers/acpi/osl.c 2010-07-21 17:58:20.370414172 -0500
> @@ -293,12 +293,18 @@ acpi_os_map_memory(acpi_physical_address
> printk(KERN_ERR PREFIX "Cannot map memory that high\n");
> return NULL;
> }
> - if (acpi_gbl_permanent_mmap)
> + if (acpi_gbl_permanent_mmap) {
> /*
> * ioremap checks to ensure this is in reserved space
> */
> - return ioremap((unsigned long)phys, size);
> - else
> + if (x86_platform.is_wb_acpi_tables() &&
> + (e820_all_mapped(phys, phys + size, E820_RAM) ||
> + e820_all_mapped(phys, phys + size, E820_ACPI) ||
> + e820_all_mapped(phys, phys + size, E820_NVS)))
> + return ioremap_cache((unsigned long)phys, size);
> + else
> + return ioremap((unsigned long)phys, size);
> + } else
> return __acpi_map_table((unsigned long)phys, size);
> }
> EXPORT_SYMBOL_GPL(acpi_os_map_memory);
On Tue, Aug 17, 2010 at 7:49 AM, Jack Steiner <[email protected]> wrote:
>
> I'm certainly not suggesting that ALL platforms map ACPI tables as WB. That
> would be a disaster. Only platforms where BIOS specifically reports that
> that tables are in WB memory should be mapped as MB.
Hmm. I actually suspect we _should_ map ACPI tables as WB all the
time. I'm not actually seeing any reason why we should ever map them
uncacheable, because as far as I can tell there are exactly two
realistic situations:
- they are in RAM to begin with. I would pretty much expect this to
be true 99.9% of the time. Everybody uses compressed BIOS flash images
and uncompresses the image into RAM anyway, because (a) bigger flash
chips are another 25?, so nobody can afford that on a PC motherboard
(b) the flash interface is ridiculously slow anyway, and you don't
want to execute your BIOS off it, and (c) the BIOS almost always
actually _changes_ the tables depending on various BIOS settings, so
mapping the tables as anything but RAM wouldn't work _anyway_.
- even if they aren't in RAM, it's likely fine to let the dang things
be in the cache. On at least some platforms (old ones), if they aren't
in RAM, the system logic will override any MTRR/pageattribute issues
anyway.
So I think we should just map those things WB by default. Maybe with
some way to override it (possibly automatically). It sounds like it's
a big enough performance issue even on smaller systems (0.1 seconds is
quite a bit of the boot time on some systems).
Linus
On 07/23/2010 05:14 PM, Henrique de Moraes Holschuh wrote:
>
> Well, as it was raised in this thread, ACPI tables are likely to be near RAM
> regions used for IPC with the firmware or SMBIOS, and we have no idea of the
> kind of crap that could happen if we enable caching on those areas.
>
I'm really not sure I buy that argument -- at least not on x86: if that
is the case, then when PAT is off (and we fall down to MTRR-only
control) then we'd have the same failures. If we mark them cacheable
and the MTRRs say uncachable, then we will *still* not cache them (since
MTRR UC overrides PAT WB -- in fact "PAT off" really just means ALL the
pagetables are marked WB.)
In that sense it is probably *safer* to map them WB, since the firmware
if it uses page tables at all is extremely likely to have all the cache
control bits at zero (meaning WB) -- and if it doesn't use page tables,
they are functionally zero by default (MTRR control only.)
So I think it'd be safer to map them cacheable -- regardless of if we
want to copy them to RAM or not.
-hpa
Map ACPI tables as WB on x86_64. No substantive changes to IA64.
Signed-off-by: Jack Steiner <[email protected]>
---
V2 - Change the patch to unconditionally map ACPI tables as WB on x86_64.
I'm still some paranoid about this because of the potential imapct on
some platforms with weird BIOSs. However, note that on EFI-enabled
systems (like UV), the ACPI tables are already mapped as WB memory.
This is done in the EFI function efi_enter_virtual_mode().
The ACPI code in acpi_os_map_memory() is not currently aware of the
EFI mapping & currently maps the memory as UC. This seems like a bug.
arch/ia64/kernel/acpi.c | 5 +++++
arch/x86/kernel/acpi/boot.c | 14 ++++++++++++++
drivers/acpi/osl.c | 2 +-
include/linux/acpi.h | 1 +
4 files changed, 21 insertions(+), 1 deletion(-)
Index: linux/arch/ia64/kernel/acpi.c
===================================================================
--- linux.orig/arch/ia64/kernel/acpi.c 2010-08-26 09:32:46.000000000 -0500
+++ linux/arch/ia64/kernel/acpi.c 2010-08-26 11:51:05.544732478 -0500
@@ -172,6 +172,11 @@ char *__init __acpi_map_table(unsigned l
return __va(phys_addr);
}
+char *__init __acpi_map_table_permanent(unsigned long phys_addr, unsigned long size)
+{
+ return ioremap(phys_addr, size);
+}
+
void __init __acpi_unmap_table(char *map, unsigned long size)
{
}
Index: linux/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux.orig/arch/x86/kernel/acpi/boot.c 2010-08-26 09:32:48.000000000 -0500
+++ linux/arch/x86/kernel/acpi/boot.c 2010-08-26 11:57:27.408724846 -0500
@@ -167,6 +167,20 @@ void __init __acpi_unmap_table(char *map
early_iounmap(map, size);
}
+/*
+ * Permanently map memory for ACPI. Map ACPI tables and RAM as WB,
+ * other regions as UC.
+ */
+char *__init __acpi_map_table_permanent(unsigned long phys, unsigned long size)
+{
+ if (e820_all_mapped(phys, phys + size, E820_RAM) ||
+ e820_all_mapped(phys, phys + size, E820_ACPI) ||
+ e820_all_mapped(phys, phys + size, E820_NVS))
+ return ioremap_cache((unsigned long)phys, size);
+ else
+ return ioremap(phys, size);
+}
+
#ifdef CONFIG_X86_LOCAL_APIC
static int __init acpi_parse_madt(struct acpi_table_header *table)
{
Index: linux/drivers/acpi/osl.c
===================================================================
--- linux.orig/drivers/acpi/osl.c 2010-08-26 09:32:48.276653117 -0500
+++ linux/drivers/acpi/osl.c 2010-08-26 11:45:33.488606489 -0500
@@ -271,7 +271,7 @@ acpi_os_map_memory(acpi_physical_address
/*
* ioremap checks to ensure this is in reserved space
*/
- return ioremap((unsigned long)phys, size);
+ return __acpi_map_table_permanent((unsigned long)phys, size);
else
return __acpi_map_table((unsigned long)phys, size);
}
Index: linux/include/linux/acpi.h
===================================================================
--- linux.orig/include/linux/acpi.h 2010-08-26 09:32:51.000000000 -0500
+++ linux/include/linux/acpi.h 2010-08-26 11:41:14.484709188 -0500
@@ -77,6 +77,7 @@ typedef int (*acpi_table_handler) (struc
typedef int (*acpi_table_entry_handler) (struct acpi_subtable_header *header, const unsigned long end);
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
+char * __acpi_map_table_permanent (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
int acpi_boot_init (void);
On Tue, 17 Aug 2010, Jack Steiner wrote:
> On EFI-enabled systems (like UV), the ACPI tables are already mapped as
> WB memory. This is done in the EFI function efi_enter_virtual_mode().
>
> The E820_ACPI & E820_NVS regions will be mapped as WB memory if the BIOS
> sets the WB attribute in the EFI memmap entry for the chunk of memory.
>
> The ACPI code in acpi_os_map_memory() is not currently aware of the EFI mapping
> & currently maps the memory as UC. This seems like a bug.
>
> In order to prevent attribute aliasing, I think the ACPI mapping needs to
> be consistent with the EFI mapping. Otherwise, we will have the OS
> referencing the memory as UC at the same time BIOS is referencing it as WB.
When would the firmware touch those tables after booting the OS?
We once found a Toshiba laptop where the BIOS
scribbles on the DSDT at run-time.
We presumed from this that Windows
must snapshot the tables at boot and run from a copy.
We decided not to copy the tables, but to use them in-place.
So we added a check for run-time corruption of the tables
and we've seen it fire only on that one Toshiba box
(which now boots with "acpi=copy_dsdt").
thanks,
Len Brown, Intel Open Source Technology Center
On 08/26/2010 10:17 AM, Jack Steiner wrote:
> Map ACPI tables as WB on x86_64. No substantive changes to IA64.
>
>
> Signed-off-by: Jack Steiner <[email protected]>
>
> ---
> V2 - Change the patch to unconditionally map ACPI tables as WB on x86_64.
> I'm still some paranoid about this because of the potential imapct on
> some platforms with weird BIOSs. However, note that on EFI-enabled
> systems (like UV), the ACPI tables are already mapped as WB memory.
> This is done in the EFI function efi_enter_virtual_mode().
> The ACPI code in acpi_os_map_memory() is not currently aware of the
> EFI mapping & currently maps the memory as UC. This seems like a bug.
>
I really think that it's all a consequence of the fact that ioremap()
became implicitly UC when we added PAT support (being the more
conservative choice.)
Len, want to take this one or should I?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Aug 26, 2010 at 11:08:02AM -0700, H. Peter Anvin wrote:
> On 08/26/2010 10:17 AM, Jack Steiner wrote:
> > Map ACPI tables as WB on x86_64. No substantive changes to IA64.
> >
> >
> > Signed-off-by: Jack Steiner <[email protected]>
> >
> > ---
> > V2 - Change the patch to unconditionally map ACPI tables as WB on x86_64.
> > I'm still some paranoid about this because of the potential imapct on
> > some platforms with weird BIOSs. However, note that on EFI-enabled
> > systems (like UV), the ACPI tables are already mapped as WB memory.
> > This is done in the EFI function efi_enter_virtual_mode().
> > The ACPI code in acpi_os_map_memory() is not currently aware of the
> > EFI mapping & currently maps the memory as UC. This seems like a bug.
> >
>
> I really think that it's all a consequence of the fact that ioremap()
> became implicitly UC when we added PAT support (being the more
> conservative choice.)
>
> Len, want to take this one or should I?
>
> -hpa
What is the upstream status of this patch? See the LKML discussion at:
http://marc.info/?l=linux-kernel&m=128206079905826&w=2
http://marc.info/?l=linux-acpi&m=128284304032481&w=2
We would like to get this patch into the distros but need upstream acceptance first.
This patch substantially reduces the time to run ACPIDUMP on a large system:
527 seconds without the patch
8 seconds with the patch
Is there something I should be doing? I can resend the patch if needed.
--- jack
On 12/08/2010 01:22 PM, Jack Steiner wrote:
>
> This patch substantially reduces the time to run ACPIDUMP on a large system:
> 527 seconds without the patch
> 8 seconds with the patch
>
This is probably the worst possible motivation you can give here.
Bootup time is much more of an issue.
However, this really needs Len's ack.
-hpa
On Wed, Dec 08, 2010 at 05:27:39PM -0800, H. Peter Anvin wrote:
> On 12/08/2010 01:22 PM, Jack Steiner wrote:
> >
> > This patch substantially reduces the time to run ACPIDUMP on a large system:
> > 527 seconds without the patch
> > 8 seconds with the patch
> >
>
> This is probably the worst possible motivation you can give here.
> Bootup time is much more of an issue.
>
> However, this really needs Len's ack.
Sorry - should have included more info from the original posting.
The boot time speedup is the most critical. From the original mail:
...
The following experimental patch changes the kernel mapping for ACPI tables
to CACHED. This eliminates the page attibute conflict & allows users to map
the tables CACHEABLE. This significantly speeds up boot:
38 minutes without the patch
27 minutes with the patch
~30% improvement
...
(Since the original posting, we've made additional reductions in
boot times. The absolute improvement from this patch is still the same
but the percentage improvement is now larger).
This is on a large SGI system. Boot time is reduced on smaller systems
but obviously the improvement is much less.
See the original mail for more details.
Len - the original patch had the cached/uncached mapping controlled by a
platform attribute (I'm paranoid). However, the community concensus was
that cached mappings were safe for all platforms.
--- jack
Jack,
I agree that we should try to map cached always,
and that we should try do the same thing on all platforms.
I favor continuing to map the tables in-place.
Yes, Toshiba has a quirk and we copy the tables there.
But frankly, we have no idea what the heck Toshiba is doing
and we don't want every platform to pay the price because
of some vendor specific evil.
Please send an upstream ready version of this RFC
(checkpatch clean), labled with "PATCH".
thanks,
Len Brown, Intel Open Source Technology Center
ps. acpidump can be updated to use /sys/firmware/acpi/tables
to get rid of the mmap stuff. That is someplace on our todo list...