This series adds L1 and L2 caches support for error detection and
correction for Amazon's Annapurna Labs SoCs.
Alpine SoCs supports L1 and L2 single bit correction and two bits detection
capability based on ARM implementation.
Changes since v2:
-----------------
- Use BIT for single bit instead of GENMASK
- Use BIT_ULL and GENMASK_ULL for 64bit vector
- Fix the mod_name/ctrl_name.
Changes since v1:
-----------------
- Split into two drivers
- Get cpu-mask according to l2-cache handler from devicetree
- Remove parameter casting
- Use GENMASK() in bit mask
- Use FIELD_GET()
- Update define description PLRU_RAM -> PF_RAM
- Use sys_reg() and read_sysreg_s()
- Remove all write/read wrappers
- Check fatal field to set if the error correctable or not
- Remove un-relevant information from error prints.
- Update smp_call_function_single() call function to wait
- remove usage of get_online_cpus/put_online_cpus
- Use on_each_cpu() and smp_call_function_any() instead of loop with for_each_cpu.
- use buffer for error prints and pass to edac API
- Remove edac_op_state set
- Add for loop to report on repeated errors of the same type
- Fix error name of the TLB to be L2_TLB as written in ARM TRM
- Minor change in Kconfig
- Minor changes in commit message
Hanna Hawa (4):
dt-bindings: EDAC: Add Amazon's Annapurna Labs L1 EDAC
edac: Add support for Amazon's Annapurna Labs L1 EDAC
dt-bindings: EDAC: Add Amazon's Annapurna Labs L2 EDAC
edac: Add support for Amazon's Annapurna Labs L2 EDAC
.../devicetree/bindings/edac/amazon,al-l1-edac.txt | 14 ++
.../devicetree/bindings/edac/amazon,al-l2-edac.txt | 20 +++
MAINTAINERS | 12 ++
drivers/edac/Kconfig | 16 ++
drivers/edac/Makefile | 2 +
drivers/edac/al_l1_edac.c | 156 +++++++++++++++++
drivers/edac/al_l2_edac.c | 187 +++++++++++++++++++++
7 files changed, 407 insertions(+)
create mode 100644 Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
create mode 100644 Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt
create mode 100644 drivers/edac/al_l1_edac.c
create mode 100644 drivers/edac/al_l2_edac.c
--
2.7.4
Document Amazon's Annapurna Labs L1 EDAC SoC binding.
Signed-off-by: Hanna Hawa <[email protected]>
---
.../devicetree/bindings/edac/amazon,al-l1-edac.txt | 14 ++++++++++++++
1 file changed, 14 insertions(+)
create mode 100644 Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
diff --git a/Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt b/Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
new file mode 100644
index 0000000..2ae8370
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
@@ -0,0 +1,14 @@
+* Amazon's Annapurna Labs L1 EDAC
+
+Amazon's Annapurna Labs SoCs supports L1 single bit correction and
+two bits detection capability based on ARM implementation.
+
+Required properties:
+- compatible:
+ should be "amazon,al-l1-edac".
+
+Example:
+
+ al_l1_edac {
+ compatible = "amazon,al-l1-edac";
+ };
--
2.7.4
Document Amazon's Annapurna Labs L2 EDAC SoC binding.
Signed-off-by: Hanna Hawa <[email protected]>
---
.../devicetree/bindings/edac/amazon,al-l2-edac.txt | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
create mode 100644 Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt
diff --git a/Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt b/Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt
new file mode 100644
index 0000000..7b0b734
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt
@@ -0,0 +1,20 @@
+* Amazon's Annapurna Labs L2 EDAC
+
+Amazon's Annapurna Labs SoCs supports L2 single bit correction and
+two bits detection capability based on ARM implementation.
+
+Required properties:
+- compatible:
+ should be "amazon,al-l2-edac".
+- l2-cache:
+ Phandle to L2 cache handler.
+ This property is used to compare with the CPU node property
+ 'next-level-cache' to create cpu-mask with all CPUs that
+ share same L2 cache.
+
+Example:
+
+ al_l2_edac {
+ compatible = "amazon,al-l2-edac";
+ l2-cache = <&cluster0_l2>;
+ };
--
2.7.4
Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
report L2 errors.
Signed-off-by: Hanna Hawa <[email protected]>
---
MAINTAINERS | 6 ++
drivers/edac/Kconfig | 8 ++
drivers/edac/Makefile | 1 +
drivers/edac/al_l2_edac.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 202 insertions(+)
create mode 100644 drivers/edac/al_l2_edac.c
diff --git a/MAINTAINERS b/MAINTAINERS
index fd29ea6..a6dcf3d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -749,6 +749,12 @@ S: Maintained
F: drivers/edac/al_l1_edac.c
F: Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
+AMAZON ANNAPURNA LABS L2 EDAC
+M: Hanna Hawa <[email protected]>
+S: Maintained
+F: drivers/edac/al_l2_edac.c
+F: Documentation/devicetree/bindings/edac/amazon,al-l2-edac.txt
+
AMAZON ANNAPURNA LABS THERMAL MMIO DRIVER
M: Talel Shenhar <[email protected]>
S: Maintained
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 58b92bc..8bbb745 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -82,6 +82,14 @@ config EDAC_AL_L1
for Amazon's Annapurna Labs SoCs.
This driver detects errors of L1 caches.
+config EDAC_AL_L2
+ bool "Amazon's Annapurna Labs L2 EDAC"
+ depends on ARCH_ALPINE
+ help
+ Support for L2 error detection and correction
+ for Amazon's Annapurna Labs SoCs.
+ This driver detects errors of L2 caches.
+
config EDAC_AMD64
tristate "AMD64 (Opteron, Athlon64)"
depends on AMD_NB && EDAC_DECODE_MCE
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index caa2dc9..60a6b8b 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -23,6 +23,7 @@ edac_mce_amd-y := mce_amd.o
obj-$(CONFIG_EDAC_DECODE_MCE) += edac_mce_amd.o
obj-$(CONFIG_EDAC_AL_L1) += al_l1_edac.o
+obj-$(CONFIG_EDAC_AL_L2) += al_l2_edac.o
obj-$(CONFIG_EDAC_AMD76X) += amd76x_edac.o
obj-$(CONFIG_EDAC_CPC925) += cpc925_edac.o
obj-$(CONFIG_EDAC_I5000) += i5000_edac.o
diff --git a/drivers/edac/al_l2_edac.c b/drivers/edac/al_l2_edac.c
new file mode 100644
index 0000000..bae1ea9
--- /dev/null
+++ b/drivers/edac/al_l2_edac.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/of.h>
+
+#include "edac_device.h"
+#include "edac_module.h"
+
+#define DRV_NAME "al_l2_edac"
+
+/* Same bit assignments of L2MERRSR_EL1 in ARM CA57/CA72 */
+#define ARM_CA57_L2MERRSR_EL1 sys_reg(3, 1, 15, 2, 3)
+#define ARM_CA57_L2MERRSR_RAMID GENMASK(30, 24)
+#define ARM_CA57_L2_TAG_RAM 0x10
+#define ARM_CA57_L2_DATA_RAM 0x11
+#define ARM_CA57_L2_SNOOP_RAM 0x12
+#define ARM_CA57_L2_DIRTY_RAM 0x14
+#define ARM_CA57_L2_INC_PF_RAM 0x18
+#define ARM_CA57_L2MERRSR_VALID BIT(31)
+#define ARM_CA57_L2MERRSR_REPEAT GENMASK_ULL(39, 32)
+#define ARM_CA57_L2MERRSR_OTHER GENMASK_ULL(47, 40)
+#define ARM_CA57_L2MERRSR_FATAL BIT_ULL(63)
+
+#define AL_L2_EDAC_MSG_MAX 256
+
+struct al_l2_edac {
+ cpumask_t cluster_cpus;
+};
+
+static void al_l2_edac_l2merrsr(void *arg)
+{
+ struct edac_device_ctl_info *edac_dev = arg;
+ int cpu, i;
+ u32 ramid, repeat, other, fatal;
+ u64 val = read_sysreg_s(ARM_CA57_L2MERRSR_EL1);
+ char msg[AL_L2_EDAC_MSG_MAX];
+ int space, count;
+ char *p;
+
+ if (!(FIELD_GET(ARM_CA57_L2MERRSR_VALID, val)))
+ return;
+
+ cpu = smp_processor_id();
+ ramid = FIELD_GET(ARM_CA57_L2MERRSR_RAMID, val);
+ repeat = FIELD_GET(ARM_CA57_L2MERRSR_REPEAT, val);
+ other = FIELD_GET(ARM_CA57_L2MERRSR_OTHER, val);
+ fatal = FIELD_GET(ARM_CA57_L2MERRSR_FATAL, val);
+
+ space = sizeof(msg);
+ p = msg;
+ count = snprintf(p, space, "CPU%d L2 %serror detected", cpu,
+ (fatal) ? "Fatal " : "");
+ p += count;
+ space -= count;
+
+ switch (ramid) {
+ case ARM_CA57_L2_TAG_RAM:
+ count = snprintf(p, space, " RAMID='L2 Tag RAM'");
+ break;
+ case ARM_CA57_L2_DATA_RAM:
+ count = snprintf(p, space, " RAMID='L2 Data RAM'");
+ break;
+ case ARM_CA57_L2_SNOOP_RAM:
+ count = snprintf(p, space, " RAMID='L2 Snoop RAM'");
+ break;
+ case ARM_CA57_L2_DIRTY_RAM:
+ count = snprintf(p, space, " RAMID='L2 Dirty RAM'");
+ break;
+ case ARM_CA57_L2_INC_PF_RAM:
+ count = snprintf(p, space, " RAMID='L2 internal metadat'");
+ break;
+ default:
+ count = snprintf(p, space, " RAMID='unknown'");
+ break;
+ }
+
+ p += count;
+ space -= count;
+
+ count = snprintf(p, space,
+ " repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)",
+ repeat, other, val);
+
+ for (i = 0; i < repeat; i++) {
+ if (fatal)
+ edac_device_handle_ue(edac_dev, 0, 0, msg);
+ else
+ edac_device_handle_ce(edac_dev, 0, 0, msg);
+ }
+
+ write_sysreg_s(0, ARM_CA57_L2MERRSR_EL1);
+}
+
+static void al_l2_edac_check(struct edac_device_ctl_info *edac_dev)
+{
+ struct al_l2_edac *al_l2 = edac_dev->pvt_info;
+
+ smp_call_function_any(&al_l2->cluster_cpus, al_l2_edac_l2merrsr,
+ edac_dev, 1);
+}
+
+static int al_l2_edac_probe(struct platform_device *pdev)
+{
+ struct edac_device_ctl_info *edac_dev;
+ struct al_l2_edac *al_l2;
+ struct device *dev = &pdev->dev;
+ int ret, i;
+
+ edac_dev = edac_device_alloc_ctl_info(sizeof(*al_l2),
+ (char *)dev_name(dev), 1, "L", 1,
+ 2, NULL, 0,
+ edac_device_alloc_index());
+ if (IS_ERR(edac_dev))
+ return -ENOMEM;
+
+ al_l2 = edac_dev->pvt_info;
+ edac_dev->edac_check = al_l2_edac_check;
+ edac_dev->dev = dev;
+ edac_dev->mod_name = DRV_NAME;
+ edac_dev->dev_name = dev_name(dev);
+ edac_dev->ctl_name = "L2 cache";
+ platform_set_drvdata(pdev, edac_dev);
+
+ for_each_online_cpu(i) {
+ struct device_node *cpu;
+ struct device_node *cpu_cache, *l2_cache;
+
+ cpu = of_get_cpu_node(i, NULL);
+ cpu_cache = of_find_next_cache_node(cpu);
+ l2_cache = of_parse_phandle(dev->of_node, "l2-cache", 0);
+
+ if (cpu_cache == l2_cache)
+ cpumask_set_cpu(i, &al_l2->cluster_cpus);
+ }
+
+ if (cpumask_empty(&al_l2->cluster_cpus)) {
+ dev_err(dev, "CPU mask is empty for this L2 cache\n");
+ ret = -EINVAL;
+ goto err;
+ }
+
+ ret = edac_device_add_device(edac_dev);
+ if (ret) {
+ dev_err(dev, "Failed to add L2 edac device\n");
+ goto err;
+ }
+
+ return 0;
+
+err:
+ edac_device_free_ctl_info(edac_dev);
+
+ return ret;
+}
+
+static int al_l2_edac_remove(struct platform_device *pdev)
+{
+ struct edac_device_ctl_info *edac_dev = platform_get_drvdata(pdev);
+
+ edac_device_del_device(edac_dev->dev);
+ edac_device_free_ctl_info(edac_dev);
+
+ return 0;
+}
+
+static const struct of_device_id al_l2_edac_of_match[] = {
+ { .compatible = "amazon,al-l2-edac" },
+ {}
+};
+MODULE_DEVICE_TABLE(of, al_l2_edac_of_match);
+
+static struct platform_driver al_l2_edac_driver = {
+ .probe = al_l2_edac_probe,
+ .remove = al_l2_edac_remove,
+ .driver = {
+ .name = DRV_NAME,
+ .of_match_table = al_l2_edac_of_match,
+ },
+};
+module_platform_driver(al_l2_edac_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Hanna Hawa <[email protected]>");
+MODULE_DESCRIPTION("Amazon's Annapurna Lab's L2 EDAC Driver");
--
2.7.4
Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
report L1 errors.
Signed-off-by: Hanna Hawa <[email protected]>
---
MAINTAINERS | 6 ++
drivers/edac/Kconfig | 8 +++
drivers/edac/Makefile | 1 +
drivers/edac/al_l1_edac.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 171 insertions(+)
create mode 100644 drivers/edac/al_l1_edac.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 77eae44..fd29ea6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -743,6 +743,12 @@ F: drivers/tty/serial/altera_jtaguart.c
F: include/linux/altera_uart.h
F: include/linux/altera_jtaguart.h
+AMAZON ANNAPURNA LABS L1 EDAC
+M: Hanna Hawa <[email protected]>
+S: Maintained
+F: drivers/edac/al_l1_edac.c
+F: Documentation/devicetree/bindings/edac/amazon,al-l1-edac.txt
+
AMAZON ANNAPURNA LABS THERMAL MMIO DRIVER
M: Talel Shenhar <[email protected]>
S: Maintained
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 200c04c..58b92bc 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -74,6 +74,14 @@ config EDAC_GHES
In doubt, say 'Y'.
+config EDAC_AL_L1
+ bool "Amazon's Annapurna Labs L1 EDAC"
+ depends on ARCH_ALPINE
+ help
+ Support for L1 error detection and correction
+ for Amazon's Annapurna Labs SoCs.
+ This driver detects errors of L1 caches.
+
config EDAC_AMD64
tristate "AMD64 (Opteron, Athlon64)"
depends on AMD_NB && EDAC_DECODE_MCE
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index 165ca65e..caa2dc9 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_EDAC_GHES) += ghes_edac.o
edac_mce_amd-y := mce_amd.o
obj-$(CONFIG_EDAC_DECODE_MCE) += edac_mce_amd.o
+obj-$(CONFIG_EDAC_AL_L1) += al_l1_edac.o
obj-$(CONFIG_EDAC_AMD76X) += amd76x_edac.o
obj-$(CONFIG_EDAC_CPC925) += cpc925_edac.o
obj-$(CONFIG_EDAC_I5000) += i5000_edac.o
diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
new file mode 100644
index 0000000..70510ea
--- /dev/null
+++ b/drivers/edac/al_l1_edac.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#include <linux/bitfield.h>
+
+#include "edac_device.h"
+#include "edac_module.h"
+
+#define DRV_NAME "al_l1_edac"
+
+/* Same bit assignments of CPUMERRSR_EL1 in ARM CA57/CA72 */
+#define ARM_CA57_CPUMERRSR_EL1 sys_reg(3, 1, 15, 2, 2)
+#define ARM_CA57_CPUMERRSR_RAM_ID GENMASK(30, 24)
+#define ARM_CA57_L1_I_TAG_RAM 0x00
+#define ARM_CA57_L1_I_DATA_RAM 0x01
+#define ARM_CA57_L1_D_TAG_RAM 0x08
+#define ARM_CA57_L1_D_DATA_RAM 0x09
+#define ARM_CA57_L2_TLB_RAM 0x18
+#define ARM_CA57_CPUMERRSR_VALID BIT(31)
+#define ARM_CA57_CPUMERRSR_REPEAT GENMASK_ULL(39, 32)
+#define ARM_CA57_CPUMERRSR_OTHER GENMASK_ULL(47, 40)
+#define ARM_CA57_CPUMERRSR_FATAL BIT_ULL(63)
+
+#define AL_L1_EDAC_MSG_MAX 256
+
+static void al_l1_edac_cpumerrsr(void *arg)
+{
+ struct edac_device_ctl_info *edac_dev = arg;
+ int cpu, i;
+ u32 ramid, repeat, other, fatal;
+ u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
+ char msg[AL_L1_EDAC_MSG_MAX];
+ int space, count;
+ char *p;
+
+ if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
+ return;
+
+ cpu = smp_processor_id();
+ ramid = FIELD_GET(ARM_CA57_CPUMERRSR_RAM_ID, val);
+ repeat = FIELD_GET(ARM_CA57_CPUMERRSR_REPEAT, val);
+ other = FIELD_GET(ARM_CA57_CPUMERRSR_OTHER, val);
+ fatal = FIELD_GET(ARM_CA57_CPUMERRSR_FATAL, val);
+
+ space = sizeof(msg);
+ p = msg;
+ count = snprintf(p, space, "CPU%d L1 %serror detected", cpu,
+ (fatal) ? "Fatal " : "");
+ p += count;
+ space -= count;
+
+ switch (ramid) {
+ case ARM_CA57_L1_I_TAG_RAM:
+ count = snprintf(p, space, " RAMID='L1-I Tag RAM'");
+ break;
+ case ARM_CA57_L1_I_DATA_RAM:
+ count = snprintf(p, space, " RAMID='L1-I Data RAM'");
+ break;
+ case ARM_CA57_L1_D_TAG_RAM:
+ count = snprintf(p, space, " RAMID='L1-D Tag RAM'");
+ break;
+ case ARM_CA57_L1_D_DATA_RAM:
+ count = snprintf(p, space, " RAMID='L1-D Data RAM'");
+ break;
+ case ARM_CA57_L2_TLB_RAM:
+ count = snprintf(p, space, " RAMID='L2 TLB RAM'");
+ break;
+ default:
+ count = snprintf(p, space, " RAMID='unknown'");
+ break;
+ }
+
+ p += count;
+ space -= count;
+ count = snprintf(p, space,
+ " repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)",
+ repeat, other, val);
+
+ for (i = 0; i < repeat; i++) {
+ if (fatal)
+ edac_device_handle_ue(edac_dev, 0, 0, msg);
+ else
+ edac_device_handle_ce(edac_dev, 0, 0, msg);
+ }
+
+ write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1);
+}
+
+static void al_l1_edac_check(struct edac_device_ctl_info *edac_dev)
+{
+ on_each_cpu(al_l1_edac_cpumerrsr, edac_dev, 1);
+}
+
+static int al_l1_edac_probe(struct platform_device *pdev)
+{
+ struct edac_device_ctl_info *edac_dev;
+ struct device *dev = &pdev->dev;
+ int ret;
+
+ edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
+ 1, 1, NULL, 0,
+ edac_device_alloc_index());
+ if (IS_ERR(edac_dev))
+ return -ENOMEM;
+
+ edac_dev->edac_check = al_l1_edac_check;
+ edac_dev->dev = dev;
+ edac_dev->mod_name = DRV_NAME;
+ edac_dev->dev_name = dev_name(dev);
+ edac_dev->ctl_name = "L1 cache";
+ platform_set_drvdata(pdev, edac_dev);
+
+ ret = edac_device_add_device(edac_dev);
+ if (ret) {
+ dev_err(dev, "Failed to add L1 edac device\n");
+ goto err;
+ }
+
+ return 0;
+err:
+ edac_device_free_ctl_info(edac_dev);
+
+ return ret;
+}
+
+static int al_l1_edac_remove(struct platform_device *pdev)
+{
+ struct edac_device_ctl_info *edac_dev = platform_get_drvdata(pdev);
+
+ edac_device_del_device(edac_dev->dev);
+ edac_device_free_ctl_info(edac_dev);
+
+ return 0;
+}
+
+static const struct of_device_id al_l1_edac_of_match[] = {
+ { .compatible = "amazon,al-l1-edac" },
+ {}
+};
+MODULE_DEVICE_TABLE(of, al_l1_edac_of_match);
+
+static struct platform_driver al_l1_edac_driver = {
+ .probe = al_l1_edac_probe,
+ .remove = al_l1_edac_remove,
+ .driver = {
+ .name = DRV_NAME,
+ .of_match_table = al_l1_edac_of_match,
+ },
+};
+module_platform_driver(al_l1_edac_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Hanna Hawa <[email protected]>");
+MODULE_DESCRIPTION("Amazon's Annapurna Lab's L1 EDAC Driver");
--
2.7.4
Hi Hanna,
On 15/07/2019 14:24, Hanna Hawa wrote:
> Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
> report L1 errors.
> diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
> new file mode 100644
> index 0000000..70510ea
> --- /dev/null
> +++ b/drivers/edac/al_l1_edac.c
> @@ -0,0 +1,156 @@
> +#include <linux/bitfield.h>
You need <linux/smp.h> for on-each_cpu().
> +#include "edac_device.h"
> +#include "edac_module.h"
You need <asm/sysreg.h> for the sys_reg() macro. The ARCH_ALPINE dependency doesn't stop
this from being built on 32bit arm, where this sys_reg() won't work/exist.
[...]
> +static void al_l1_edac_cpumerrsr(void *arg)
> +{
> + struct edac_device_ctl_info *edac_dev = arg;
> + int cpu, i;
> + u32 ramid, repeat, other, fatal;
> + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
> + char msg[AL_L1_EDAC_MSG_MAX];
> + int space, count;
> + char *p;
> + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
> + return;
> + space = sizeof(msg);
> + p = msg;
> + count = snprintf(p, space, "CPU%d L1 %serror detected", cpu,
> + (fatal) ? "Fatal " : "");
> + p += count;
> + space -= count;
snprintf() will return the number of characters it would have generated, even if that is
more than space. If this happen, space becomes negative, p points outside msg[] and msg[]
isn't NULL terminated...
It looks like you want scnprintf(), which returns the number of characters written to buf
instead. (I don't see how 256 characters would be printed by this code)
> + switch (ramid) {
> + case ARM_CA57_L1_I_TAG_RAM:
> + count = snprintf(p, space, " RAMID='L1-I Tag RAM'");
> + break;
> + case ARM_CA57_L1_I_DATA_RAM:
> + count = snprintf(p, space, " RAMID='L1-I Data RAM'");
> + break;
> + case ARM_CA57_L1_D_TAG_RAM:
> + count = snprintf(p, space, " RAMID='L1-D Tag RAM'");
> + break;
> + case ARM_CA57_L1_D_DATA_RAM:
> + count = snprintf(p, space, " RAMID='L1-D Data RAM'");
> + break;
> + case ARM_CA57_L2_TLB_RAM:
> + count = snprintf(p, space, " RAMID='L2 TLB RAM'");
> + break;
> + default:
> + count = snprintf(p, space, " RAMID='unknown'");
> + break;
> + }
> +
> + p += count;
> + space -= count;
> + count = snprintf(p, space,
> + " repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)",
> + repeat, other, val);
> +
> + for (i = 0; i < repeat; i++) {
> + if (fatal)
> + edac_device_handle_ue(edac_dev, 0, 0, msg);
> + else
> + edac_device_handle_ce(edac_dev, 0, 0, msg);
> + }
> +
> + write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1);
Writing 0 just after you've read the value would minimise the window where repeat could
have increased behind your back, or another error was counted as other, when it could have
been reported more accurately.
> +}
> +static int al_l1_edac_probe(struct platform_device *pdev)
> +{
> + struct edac_device_ctl_info *edac_dev;
> + struct device *dev = &pdev->dev;
> + int ret;
> +
> + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
> + 1, 1, NULL, 0,
> + edac_device_alloc_index());
> + if (IS_ERR(edac_dev))
edac_device_alloc_ctl_info() returns NULL, or dev_ctl, which comes from kzalloc(). I think
you need to check for NULL here, IS_ERR() only catches the -errno range. (there is an
IS_ERR_OR_NULL() if you really needed both)
> + return -ENOMEM;
With the header-includes and edac_device_alloc_ctl_info() NULL check:
Reviewed-by: James Morse <[email protected]>
Thanks,
James
On 7/26/2019 7:49 PM, James Morse wrote:
> Hi Hanna,
>
> On 15/07/2019 14:24, Hanna Hawa wrote:
>> Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
>> report L1 errors.
>
>> diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
>> new file mode 100644
>> index 0000000..70510ea
>> --- /dev/null
>> +++ b/drivers/edac/al_l1_edac.c
>> @@ -0,0 +1,156 @@
>
>> +#include <linux/bitfield.h>
>
> You need <linux/smp.h> for on-each_cpu().
>
>> +#include "edac_device.h"
>> +#include "edac_module.h"
>
> You need <asm/sysreg.h> for the sys_reg() macro. The ARCH_ALPINE dependency doesn't stop
> this from being built on 32bit arm, where this sys_reg() won't work/exist.
Will fix.
>
> [...]
>
>> +static void al_l1_edac_cpumerrsr(void *arg)
>> +{
>> + struct edac_device_ctl_info *edac_dev = arg;
>> + int cpu, i;
>> + u32 ramid, repeat, other, fatal;
>> + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
>> + char msg[AL_L1_EDAC_MSG_MAX];
>> + int space, count;
>> + char *p;
>> + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
>> + return;
>> + space = sizeof(msg);
>> + p = msg;
>> + count = snprintf(p, space, "CPU%d L1 %serror detected", cpu,
>> + (fatal) ? "Fatal " : "");
>> + p += count;
>> + space -= count;
>
> snprintf() will return the number of characters it would have generated, even if that is
> more than space. If this happen, space becomes negative, p points outside msg[] and msg[]
> isn't NULL terminated...
>
> It looks like you want scnprintf(), which returns the number of characters written to buf
> instead. (I don't see how 256 characters would be printed by this code)
Will use scnprintf() instead.
>
>
>> + switch (ramid) {
>> + case ARM_CA57_L1_I_TAG_RAM:
>> + count = snprintf(p, space, " RAMID='L1-I Tag RAM'");
>> + break;
>> + case ARM_CA57_L1_I_DATA_RAM:
>> + count = snprintf(p, space, " RAMID='L1-I Data RAM'");
>> + break;
>> + case ARM_CA57_L1_D_TAG_RAM:
>> + count = snprintf(p, space, " RAMID='L1-D Tag RAM'");
>> + break;
>> + case ARM_CA57_L1_D_DATA_RAM:
>> + count = snprintf(p, space, " RAMID='L1-D Data RAM'");
>> + break;
>> + case ARM_CA57_L2_TLB_RAM:
>> + count = snprintf(p, space, " RAMID='L2 TLB RAM'");
>> + break;
>> + default:
>> + count = snprintf(p, space, " RAMID='unknown'");
>> + break;
>> + }
>> +
>> + p += count;
>> + space -= count;
>> + count = snprintf(p, space,
>> + " repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)",
>> + repeat, other, val);
>> +
>> + for (i = 0; i < repeat; i++) {
>> + if (fatal)
>> + edac_device_handle_ue(edac_dev, 0, 0, msg);
>> + else
>> + edac_device_handle_ce(edac_dev, 0, 0, msg);
>> + }
>> +
>> + write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1);
>
> Writing 0 just after you've read the value would minimise the window where repeat could
> have increased behind your back, or another error was counted as other, when it could have
> been reported more accurately.
Good point, I will move the write after checking the valid bit.
>
>
>> +}
>
>
>> +static int al_l1_edac_probe(struct platform_device *pdev)
>> +{
>> + struct edac_device_ctl_info *edac_dev;
>> + struct device *dev = &pdev->dev;
>> + int ret;
>> +
>> + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
>> + 1, 1, NULL, 0,
>> + edac_device_alloc_index());
>> + if (IS_ERR(edac_dev))
>
> edac_device_alloc_ctl_info() returns NULL, or dev_ctl, which comes from kzalloc(). I think
> you need to check for NULL here, IS_ERR() only catches the -errno range. (there is an
> IS_ERR_OR_NULL() if you really needed both)
Will fix.
>
>
>> + return -ENOMEM;
>
>
> With the header-includes and edac_device_alloc_ctl_info() NULL check:
> Reviewed-by: James Morse <[email protected]>
Thanks James.
Thanks,
Hanna
>
>
> Thanks,
>
> James
>
On 15.07.19 16:24:07, Hanna Hawa wrote:
> Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
> report L1 errors.
>
> Signed-off-by: Hanna Hawa <[email protected]>
> Reviewed-by: James Morse <[email protected]>
> ---
> MAINTAINERS | 6 ++
> drivers/edac/Kconfig | 8 +++
> drivers/edac/Makefile | 1 +
> drivers/edac/al_l1_edac.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 171 insertions(+)
> create mode 100644 drivers/edac/al_l1_edac.c
> diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
> new file mode 100644
> index 0000000..70510ea
> --- /dev/null
> +++ b/drivers/edac/al_l1_edac.c
[...]
> +static void al_l1_edac_cpumerrsr(void *arg)
Could this being named to something meaningful, such as
*_read_status() or so?
> +{
> + struct edac_device_ctl_info *edac_dev = arg;
> + int cpu, i;
> + u32 ramid, repeat, other, fatal;
> + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
> + char msg[AL_L1_EDAC_MSG_MAX];
> + int space, count;
> + char *p;
> +
> + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
> + return;
[...]
> +static void al_l1_edac_check(struct edac_device_ctl_info *edac_dev)
> +{
> + on_each_cpu(al_l1_edac_cpumerrsr, edac_dev, 1);
> +}
> +
> +static int al_l1_edac_probe(struct platform_device *pdev)
> +{
> + struct edac_device_ctl_info *edac_dev;
> + struct device *dev = &pdev->dev;
> + int ret;
> +
> + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
This type cast looks broken. dev_name() is a constant string already.
Other drivers do not use the dynamically generated dev_name() string
here, instead a fix string such as mod_name or ctl_name could be used.
edac_device_alloc_ctl_info() later generates a unique instance name
derived from name + index.
Regarding the type, this seems to be an API issue of edac_device_
alloc_ctl_info() that should actually use const char* in its
interface. So if needed (from what I wrote above it is not) the type
in the argument list needs to be changed instead.
> + 1, 1, NULL, 0,
> + edac_device_alloc_index());
> + if (IS_ERR(edac_dev))
> + return -ENOMEM;
Use the original error code instead.
> +
> + edac_dev->edac_check = al_l1_edac_check;
> + edac_dev->dev = dev;
> + edac_dev->mod_name = DRV_NAME;
> + edac_dev->dev_name = dev_name(dev);
> + edac_dev->ctl_name = "L1 cache";
Should not contain spaces and maybe a bit more specific.
> + platform_set_drvdata(pdev, edac_dev);
> +
> + ret = edac_device_add_device(edac_dev);
> + if (ret) {
> + dev_err(dev, "Failed to add L1 edac device\n");
Move this printk below to the error path and maybe print the error
code. You do not cover the -ENOMEM failure.
-Robert
> + goto err;
> + }
> +
> + return 0;
> +err:
> + edac_device_free_ctl_info(edac_dev);
> +
> + return ret;
> +}
On 15.07.19 16:24:09, Hanna Hawa wrote:
> Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
> report L2 errors.
>
> Signed-off-by: Hanna Hawa <[email protected]>
> ---
> MAINTAINERS | 6 ++
> drivers/edac/Kconfig | 8 ++
> drivers/edac/Makefile | 1 +
> drivers/edac/al_l2_edac.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 202 insertions(+)
> create mode 100644 drivers/edac/al_l2_edac.c
From a brief look at it, it seems some of my comments from 2/4 apply
here too. Please look through it.
-Robert
On 9/3/2019 10:24 AM, Robert Richter wrote:
> On 15.07.19 16:24:07, Hanna Hawa wrote:
>> Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
>> report L1 errors.
>>
>> Signed-off-by: Hanna Hawa <[email protected]>
>> Reviewed-by: James Morse <[email protected]>
>> ---
>> MAINTAINERS | 6 ++
>> drivers/edac/Kconfig | 8 +++
>> drivers/edac/Makefile | 1 +
>> drivers/edac/al_l1_edac.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 171 insertions(+)
>> create mode 100644 drivers/edac/al_l1_edac.c
>
>> diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
>> new file mode 100644
>> index 0000000..70510ea
>> --- /dev/null
>> +++ b/drivers/edac/al_l1_edac.c
>
> [...]
>
>> +static void al_l1_edac_cpumerrsr(void *arg)
>
> Could this being named to something meaningful, such as
> *_read_status() or so?
>
>> +{
>> + struct edac_device_ctl_info *edac_dev = arg;
>> + int cpu, i;
>> + u32 ramid, repeat, other, fatal;
>> + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
>> + char msg[AL_L1_EDAC_MSG_MAX];
>> + int space, count;
>> + char *p;
>> +
>> + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
>> + return;
>
> [...]
>
>> +static void al_l1_edac_check(struct edac_device_ctl_info *edac_dev)
>> +{
>> + on_each_cpu(al_l1_edac_cpumerrsr, edac_dev, 1);
>> +}
>> +
>> +static int al_l1_edac_probe(struct platform_device *pdev)
>> +{
>> + struct edac_device_ctl_info *edac_dev;
>> + struct device *dev = &pdev->dev;
>> + int ret;
>> +
>> + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
>
> This type cast looks broken. dev_name() is a constant string already.
>
> Other drivers do not use the dynamically generated dev_name() string
> here, instead a fix string such as mod_name or ctl_name could be used.
> edac_device_alloc_ctl_info() later generates a unique instance name
> derived from name + index.
Ack, will fix and use DRV_NAME.
>
> Regarding the type, this seems to be an API issue of edac_device_
> alloc_ctl_info() that should actually use const char* in its
> interface. So if needed (from what I wrote above it is not) the type
> in the argument list needs to be changed instead.
>
>> + 1, 1, NULL, 0,
>> + edac_device_alloc_index());
>> + if (IS_ERR(edac_dev))
>> + return -ENOMEM;
>
> Use the original error code instead.
Actually it return NULL in case of failure, it was changed in v5 to
check if error/NULL.
>
>> +
>> + edac_dev->edac_check = al_l1_edac_check;
>> + edac_dev->dev = dev;
>> + edac_dev->mod_name = DRV_NAME;
>> + edac_dev->dev_name = dev_name(dev);
>> + edac_dev->ctl_name = "L1 cache";
>
> Should not contain spaces and maybe a bit more specific.
L1_cache_ecc_err? or L1_cache_err?
>
>> + platform_set_drvdata(pdev, edac_dev);
>> +
>> + ret = edac_device_add_device(edac_dev);
>> + if (ret) {
>> + dev_err(dev, "Failed to add L1 edac device\n");
>
> Move this printk below to the error path and maybe print the error
> code. You do not cover the -ENOMEM failure.
Ack.
>
> -Robert
>
>> + goto err;
>> + }
>> +
>> + return 0;
>> +err:
>> + edac_device_free_ctl_info(edac_dev);
>> +
>> + return ret;
>> +}
On 9/3/2019 10:27 AM, Robert Richter wrote:
> On 15.07.19 16:24:09, Hanna Hawa wrote:
>> Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
>> report L2 errors.
>>
>> Signed-off-by: Hanna Hawa <[email protected]>
>> ---
>> MAINTAINERS | 6 ++
>> drivers/edac/Kconfig | 8 ++
>> drivers/edac/Makefile | 1 +
>> drivers/edac/al_l2_edac.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 202 insertions(+)
>> create mode 100644 drivers/edac/al_l2_edac.c
>
> From a brief look at it, it seems some of my comments from 2/4 apply
> here too. Please look through it.
Thanks for your review, will look and fix on top of v5.
Thanks,
Hanna
>
> -Robert
>
On 03.09.19 11:28:41, Hawa, Hanna wrote:
> On 9/3/2019 10:27 AM, Robert Richter wrote:
> > On 15.07.19 16:24:09, Hanna Hawa wrote:
> > > Adds support for Amazon's Annapurna Labs L2 EDAC driver to detect and
> > > report L2 errors.
> > >
> > > Signed-off-by: Hanna Hawa <[email protected]>
> > > ---
> > > MAINTAINERS | 6 ++
> > > drivers/edac/Kconfig | 8 ++
> > > drivers/edac/Makefile | 1 +
> > > drivers/edac/al_l2_edac.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 202 insertions(+)
> > > create mode 100644 drivers/edac/al_l2_edac.c
> >
> > From a brief look at it, it seems some of my comments from 2/4 apply
> > here too. Please look through it.
>
> Thanks for your review, will look and fix on top of v5.
I have later seen your newer versions posted which haven't made it
into my inbox. But looking at the changelog my comments are still
valid. Thanks for addressing them.
From the changelog:
"Changes since v1:
-----------------
- Split into two drivers"
This is good, but recent practice is also to have all the drivers for
the same piece of hardware in a single file, see e.g. thunderx_edac.c.
I don't know how detailed this was discussed already, but I would
prefer to join the code.
-Robert
On Tue, Sep 03, 2019 at 08:46:24AM +0000, Robert Richter wrote:
> This is good, but recent practice is also to have all the drivers for
> the same piece of hardware in a single file, see e.g. thunderx_edac.c.
> I don't know how detailed this was discussed already, but I would
> prefer to join the code.
This is no longer needed anymore. Check out this thread:
https://lkml.kernel.org/r/[email protected]
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On 03.09.19 10:58:16, Borislav Petkov wrote:
> On Tue, Sep 03, 2019 at 08:46:24AM +0000, Robert Richter wrote:
> > This is good, but recent practice is also to have all the drivers for
> > the same piece of hardware in a single file, see e.g. thunderx_edac.c.
> > I don't know how detailed this was discussed already, but I would
> > prefer to join the code.
>
> This is no longer needed anymore. Check out this thread:
>
> https://lkml.kernel.org/r/[email protected]
Thanks for the pointer, looks good to me.
-Robert