2021-02-15 19:41:00

by Don Bollinger

[permalink] [raw]
Subject: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

optoe is an i2c based driver that supports read/write access to all
the pages (tables) of MSA standard SFP and similar devices (conforming
to the SFF-8472 spec), MSA standard QSFP and similar devices (conforming
to the SFF-8636 spec) and CMIS and similar devices (conforming to the
Common Management Interface Specfication).

These devices provide identification, operational status and control
registers via an EEPROM model. These devices support one or 3 fixed pages
(128 bytes) of data, and one page that is selected via a page register on
the first fixed page. Thus the driver's main task is to map these pages
onto a simple linear address space for user space management applications.
See the driver code for a detailed layout.

Several variants and predecessors of this driver exist outside the kernel
tree. Some of them don't support pages at all, only accessing the first
256 bytes of EEPROM. None of them handle all three specifications, none
of them support pages beyond page 3. This is a problem since critical
monitoring data on CMIS devices is on page 0x11. Some of them don't
support pages at all, only accessing the first 256 bytes of data. optoe
supports the full architected 256 page address space of these devices.
optoe is currently in production on multiple platforms running SONiC
(Microsoft's Software for Open Networking in the Cloud) and ONL (Open
Network Linux). They have requested that optoe be submitted upstream.

The EEPROM data is accessible to user space and kernel consumers via the
nvmem interface.

optoe devices can be configured via device tree or the 'new_device'
paradigm.

See Documentation/misc-devices/optoe.rst for details on configuring
optoe and accessing the EEPROM.

Signed-off-by: Don Bollinger <[email protected]>
---
Change in v2:
- made function optoe_make_regmap() static to fix compiler warnings
Reported-by: kernel test robot <[email protected]>
---
Documentation/devicetree/bindings/eeprom/optoe.txt | 26 +
Documentation/misc-devices/optoe.rst | 127 +++
drivers/misc/eeprom/Kconfig | 24 +
drivers/misc/eeprom/Makefile | 1 +
drivers/misc/eeprom/optoe.c | 931 +++++++++++++++++++++
5 files changed, 1109 insertions(+)
create mode 100644 Documentation/devicetree/bindings/eeprom/optoe.txt
create mode 100644 Documentation/misc-devices/optoe.rst
create mode 100644 drivers/misc/eeprom/optoe.c

diff --git a/Documentation/devicetree/bindings/eeprom/optoe.txt b/Documentation/devicetree/bindings/eeprom/optoe.txt
new file mode 100644
index 000000000000..3a8c350cf2f5
--- /dev/null
+++ b/Documentation/devicetree/bindings/eeprom/optoe.txt
@@ -0,0 +1,26 @@
+EEPROMs on SFP/QSFP/CMIS optoelectronic modules using SFF-8472, SFF-8676,
+CMIS (Common Management Interface Spec) devices.
+
+Required properties:
+- compatible: shall be one of :
+ 'optoe,optoe1' - for QSFP class devices, adhering to SFF-8636
+ 'optoe,optoe2' - for SFP class devices, adhering to SFF-8472
+ 'optoe,optoe3' - for CMIS devices (newer QSFP class devices)
+- reg: 0x50 The only valid value is 0x50, as all three standards specify that
+ the device is at i2c address 0x50. (optoe allocates an i2c dummy to access
+ the data at i2c address 0x51.)
+
+Optional property:
+- port_name: can be set to any string up to 19 characters. Note that the
+ actual mapping between i2c busses and network ports is platform dependent
+ and varies widely. The 'port_name' property provides a way to associate
+ specific network ports with their associated hardware ports.
+
+Example:
+ #address-cells = <1>;
+ #size-cells = <0>;
+ optoe@50 {
+ compatible = "optoe,optoe2";
+ reg = <0x50>;
+ port_name = "port1";
+ };
diff --git a/Documentation/misc-devices/optoe.rst b/Documentation/misc-devices/optoe.rst
new file mode 100644
index 000000000000..2ffce09f18bb
--- /dev/null
+++ b/Documentation/misc-devices/optoe.rst
@@ -0,0 +1,127 @@
+============================================================
+optoe - EEPROMs on SFP/QSFP/CMIS optoelectronic modules
+============================================================
+
+Author: Don Bollinger ([email protected])
+
+Description:
+============
+
+Optoe is an i2c based driver that supports read/write access to all
+the pages (tables) of MSA standard SFP and similar devices (conforming
+to the SFF-8472 spec), MSA standard QSFP and similar devices (conforming
+to the SFF-8436 or SFF-8636 spec) and CMIS devices (conforming to the
+Common Management Interface Specification).
+
+i2c based optoelectronic transceivers (SPF, QSFP, etc) provide identification,
+operational status, and control registers via an EEPROM model. Unlike the
+EEPROMs that at24 supports, these devices access data beyond byte 256 via
+a page select register, which must be managed by the driver. optoe
+represents the EEPROM as a linear address space, managing the page register
+as needed. On QSFP and CMIS devices, the first 256 bytes are page 0, followed
+by 128 byte pages 1-128. On SFP devices, the first 256 bytes are from
+i2c address 0x50, followed by 256 bytes from i2c address 0x51, followed
+by 128 byte pages 1-128. See the driver code for a more detailed
+explanation.
+
+The EEPROM data is accessible via an nvmem file, e.g.
+ /sys/bus/nvmem/devices/port1/nvmem
+
+The EEPROM data is also accessible within the kernel via nvmem calls e.g.
+ nvmem = nvmem_device_get(dev, "port1");
+ err = nvmem_device_read(nvmem, offset, length, buffer);
+
+The EEPROM data is also accessible via a bin_attribute file called 'eeprom',
+e.g. /sys/bus/i2c/devices/24-0050/eeprom
+
+Device class:
+=============
+
+Note that SFP, QSFP and CMIS type devices are not interchangeable. The
+driver expects the correct ID (optoe<n>) for each port (each i2c device).
+It does not check because the port will often be empty, and the only way
+to check is to interrogate the device. Incorrect choice of ID will lead
+to CORRECT data being reported for the first 256 bytes (for any ID, for
+any actual class device). Data beyond 256 bytes will be INCORRECT if
+the device doesn't match the optoe<n> type specified.
+
+The device class (1 = QSFP, 2 = SFP, 3 = CMIS) can be read from
+/sys/bus/i2c/devices/<i2c device>/dev_class. It can also be modified
+by writing to the same file. This can be useful when upgrading from QSFP
+type devices to CMIS devices (they may have the same form factor) or
+when developing with plug-in adapters to convert QSFP ports for SFP
+devices. It is also useful on development hardware that has both types
+of connectors attached to the same i2c bus.
+
+Port name:
+==========
+
+optoe maintains a 'port_name' for each device being managed. The port name
+is the device name in the nvmem directory, and the dev_name parameter in
+the nvmem_device_get() APIs.
+
+port_name can be set via a device tree property: port_name = "port1";
+
+port_name can also be read/changed by reading/writing to the sysfs file
+/sys/bus/i2c/devices/<i2c device>/port_name. port_name can be any string,
+up to 19 characters. If not initialized, port_name will match the i2c
+device name e.g. 1-0050. If port_name is changed, the nvmem device will
+be changed to match.
+
+Device Tree:
+============
+
+optoe can be instantiated via Device Tree as a child of the i2c bus
+that the device is attached to.
+
+Required properties:
+- compatible: shall be one of :
+ 'optoe,optoe1' - for QSFP class devices, adhering to SFF-8636
+ 'optoe,optoe2' - for SFP class devices, adhering to SFF-8472
+ 'optoe,optoe3' - for CMIS devices (newer QSFP class devices)
+- reg: 0x50 The only valid value is 0x50, as all three standards specify that
+ the device is at i2c address 0x50. (optoe allocates an i2c dummy to access
+ the data at i2c address 0x51.)
+
+Optional property:
+- port_name: can be set to any string up to 19 characters. Note that the
+ actual mapping between i2c busses and network ports is platform dependent
+ and varies widely. The 'port_name' property provides a way to associate
+ specific network ports with their associated hardware ports.
+
+Example:
+ #address-cells = <1>;
+ #size-cells = <0>;
+ optoe@50 {
+ compatible = "optoe,optoe2";
+ reg = <0x50>;
+ port_name = "port1";
+ };
+
+
+Dynamic installation:
+=====================
+
+Alternatively, optoe can be instantiated with 'new_device', per the convention
+described in Documentation/i2c/instantiating-devices. It wants one of
+three possible device identifiers, as described above under 'compatible'.
+Use 'optoe1' to indicate this is a QSFP type device, use 'optoe2' to
+indicate this is an SFP type device, use 'optoe3' to indicate this is a
+CMIS type device.
+
+Example:
+# echo optoe1 0x50 > /sys/bus/i2c/devices/i2c-64/new_device
+# echo port54 > /sys/bus/i2c/devices/i2c-64/port_name
+
+This will add a QSFP type device to i2c bus i2c-64, and name it 'port54'
+
+Example:
+# echo optoe2 0x50 > /sys/bus/i2c/devices/i2c-11/new_device
+# echo port1 > /sys/bus/i2c/devices/i2c-11/port_name
+
+This will add an SFP type device to i2c bus i2c-11, and name it 'port1'
+
+The second parameter to new_device is an i2c address, and MUST be 0x50 for
+this driver to work properly. This is part of the spec for these devices.
+(It is not necessary to create a device at 0x51 for SFP type devices, the
+driver does that automatically.)
diff --git a/drivers/misc/eeprom/Kconfig b/drivers/misc/eeprom/Kconfig
index 0f791bfdc1f5..17911266d3f6 100644
--- a/drivers/misc/eeprom/Kconfig
+++ b/drivers/misc/eeprom/Kconfig
@@ -129,4 +129,28 @@ config EEPROM_EE1004
This driver can also be built as a module. If so, the module
will be called ee1004.

+config EEPROM_OPTOE
+ tristate "read/write access to SFP*, QSFP* and CMIS EEPROMs"
+ depends on I2C && SYSFS
+ select NVMEM
+ select NVMEM_SYSFS
+ select REGMAP_I2C
+ help
+ If you say yes here you get support for read and write access to
+ the EEPROM of SFP, QSFP and CMIS type optical and copper
+ transceivers. This includes all devices which conform to SFF-8436
+ and SFF-8636 (QSFP, QSFP+, QSFP28, etc), SFF-8472 (SFP, SFP+, SFP28,
+ SFP-DWDM, etc) and Common Management Interface Spec (CMIS, including
+ QSFP-DD, OSFP, later QSFP*, etc).
+
+ These devices are usually found in network switches.
+
+ This driver only manages read/write access to the EEPROM. Access
+ to the I/O pins (reset, LPMODE, etc) is still platform specific.
+
+ This driver can also be built as a module. If so, the module
+ will be called optoe.
+
+ If unsure, say N.
+
endmenu
diff --git a/drivers/misc/eeprom/Makefile b/drivers/misc/eeprom/Makefile
index a9b4b6579b75..e22ef2c05896 100644
--- a/drivers/misc/eeprom/Makefile
+++ b/drivers/misc/eeprom/Makefile
@@ -8,3 +8,4 @@ obj-$(CONFIG_EEPROM_93XX46) += eeprom_93xx46.o
obj-$(CONFIG_EEPROM_DIGSY_MTC_CFG) += digsy_mtc_eeprom.o
obj-$(CONFIG_EEPROM_IDT_89HPESX) += idt_89hpesx.o
obj-$(CONFIG_EEPROM_EE1004) += ee1004.o
+obj-$(CONFIG_EEPROM_OPTOE) += optoe.o
diff --git a/drivers/misc/eeprom/optoe.c b/drivers/misc/eeprom/optoe.c
new file mode 100644
index 000000000000..5f6db43ca6be
--- /dev/null
+++ b/drivers/misc/eeprom/optoe.c
@@ -0,0 +1,931 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * optoe.c - A driver to read and write the EEPROM on optical transceivers
+ * (SFP, QSFP, CMIS (Common Management Interface Spec)
+ * and similar I2C based devices)
+ *
+ * Copyright (C) 2014 Cumulus networks Inc.
+ * Copyright (C) 2017 Finisar Corp.
+ */
+
+/*
+ * Description:
+ * a) Optical transceiver EEPROM read/write transactions are just like
+ * the at24 eeproms managed by the at24.c i2c driver
+ * b) The register/memory layout is up to 256 128 byte pages defined by
+ * a "pages valid" register and switched via a "page select"
+ * register as explained in below diagram.
+ * c) 256 bytes are mapped at a time. 'Lower page 00h' is the first 128
+ * bytes of address space, and always references the same
+ * location, independent of the page select register.
+ * All mapped pages are mapped into the upper 128 bytes
+ * (offset 128-255) of the i2c address.
+ * d) Devices with one I2C address (eg QSFP, CMIS) use I2C address 0x50
+ * (A0h in the spec), and map all pages in the upper 128 bytes
+ * of that address.
+ * e) Devices with two I2C addresses (eg SFP) have 256 bytes of data
+ * at I2C address 0x50, and 256 bytes of data at I2C address
+ * 0x51 (A2h in the spec). Page selection and paged access
+ * only apply to this second I2C address (0x51).
+ * e) The address space is presented, by the driver, as a linear
+ * address space. For devices with one I2C client at address
+ * 0x50 (eg QSFP, CMIS), offset 0-127 are in the lower
+ * half of address 50/A0h/optoe_client. Offset 128-255 are in
+ * page 0, 256-383 are page 1, etc. More generally, offset
+ * 'n' resides in page (n/128)-1. ('page -1' is the lower
+ * half, offset 0-127).
+ * f) For devices with two I2C clients at address 0x50 and 0x51 (eg SFP),
+ * the address space places offset 0-127 in the lower
+ * half of 50/A0/optoe_client, offset 128-255 in the upper
+ * half. Offset 256-383 is in the lower half of 51/A2/dummy.
+ * Offset 384-511 is in page 0, in the upper half of 51/A2/...
+ * Offset 512-639 is in page 1, in the upper half of 51/A2/...
+ * Offset 'n' is in page (n/128)-3 (for n > 383)
+ *
+ * One I2c addressed (eg QSFP, CMIS) Memory Map
+ *
+ * 2-Wire Serial Address: 1010000x
+ *
+ * Lower Page 00h (128 bytes)
+ * =====================
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * | |
+ * |Page Select Byte(127)|
+ * =====================
+ * |
+ * |
+ * V
+ * ------------------------------------------------------------
+ * | | | |
+ * | | | |
+ * V V V V
+ * ------------ -------------- --------------- --------------
+ * | | | | | | | |
+ * | Upper | | Upper | | Upper | | Upper |
+ * | Page 00h | | Page 01h | | Page 02h | | Page 03h |
+ * | | | (Optional) | | (Optional) | | (Optional |
+ * | | | | | | | for Cable |
+ * | | | | | | | Assemblies) |
+ * | ID | | AST | | User | | |
+ * | Fields | | Table | | EEPROM Data | | |
+ * | | | | | | | |
+ * | | | | | | | |
+ * | | | | | | | |
+ * ------------ -------------- --------------- --------------
+ *
+ * The SFF 8636 (QSFP) spec only defines the 4 pages described above.
+ * In anticipation of future applications and devices, this driver
+ * supports access to the full architected range, 256 pages.
+ *
+ * The CMIS (Common Management Interface Specification) defines use of
+ * considerably more pages (at least to page 0xAF), which this driver
+ * supports.
+ *
+ * NOTE: This version of the driver ONLY SUPPORTS BANK 0 PAGES on CMIS
+ * devices.
+ *
+ **/
+
+/* #define DEBUG 1 */
+
+#include <linux/delay.h>
+#include <linux/i2c.h>
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/nvmem-provider.h>
+#include <linux/regmap.h>
+#include <linux/slab.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+/* The maximum length of a port name */
+#define MAX_PORT_NAME_LEN 20
+
+/* fundamental unit of addressing for EEPROM */
+#define OPTOE_PAGE_SIZE 128
+/*
+ * Single address devices (eg QSFP, CMIS) have 256 pages, plus the unpaged
+ * low 128 bytes. If the device does not support paging, it is
+ * only 2 'pages' long.
+ */
+#define OPTOE_ARCH_PAGES 256
+#define ONE_ADDR_EEPROM_SIZE ((1 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
+#define ONE_ADDR_EEPROM_UNPAGED_SIZE (2 * OPTOE_PAGE_SIZE)
+/*
+ * Dual address devices (eg SFP) have 256 pages, plus the unpaged
+ * low 128 bytes, plus 256 bytes at 0x50. If the device does not
+ * support paging, it is 4 'pages' long.
+ */
+#define TWO_ADDR_EEPROM_SIZE ((3 + OPTOE_ARCH_PAGES) * OPTOE_PAGE_SIZE)
+#define TWO_ADDR_EEPROM_UNPAGED_SIZE (4 * OPTOE_PAGE_SIZE)
+#define TWO_ADDR_NO_0X51_SIZE (2 * OPTOE_PAGE_SIZE)
+
+/* a few constants to find our way around the EEPROM */
+#define OPTOE_PAGE_SELECT_REG 0x7F
+#define ONE_ADDR_PAGEABLE_REG 0x02
+#define QSFP_NOT_PAGEABLE (1<<2)
+#define CMIS_NOT_PAGEABLE (1<<7)
+#define TWO_ADDR_PAGEABLE_REG 0x40
+#define TWO_ADDR_PAGEABLE (1<<4)
+#define TWO_ADDR_0X51_REG 92
+#define TWO_ADDR_0X51_SUPP (1<<6)
+#define OPTOE_READ_OP 0
+#define OPTOE_WRITE_OP 1
+#define OPTOE_EOF 0 /* used for access beyond end of device */
+
+/* define optoe_client structure to tie the i2c client to its regmap */
+struct optoe_client {
+ struct i2c_client *client;
+ struct regmap *regmap;
+};
+
+struct optoe_data {
+ char port_name[MAX_PORT_NAME_LEN];
+ u32 byte_len; /* architected size of EEPROM */
+
+ /*
+ * Lock protects against activities from other Linux tasks,
+ * but not from changes by other I2C masters.
+ */
+ struct mutex lock;
+ struct attribute_group attr_group;
+ struct nvmem_device *nvmem;
+
+ unsigned int write_max;
+
+ /* dev_class: ONE_ADDR (QSFP), TWO_ADDR (SFP), or CMIS */
+ int dev_class;
+
+ /* client at 0x50 */
+ struct optoe_client optoe_client;
+ /* dummy at 0x51 for SFP devices */
+ struct optoe_client optoe_dummy;
+};
+
+/*
+ * specs often allow 5 msec for a page write, sometimes 20 msec;
+ * it's important to recover from write timeouts.
+ */
+static unsigned int write_timeout = 25;
+
+/*
+ * flags to distinguish one-address (QSFP family) from two-address (SFP family)
+ * and one-address Common Management Interface Specification (CMIS family)
+ */
+#define ONE_ADDR 1
+#define TWO_ADDR 2
+#define CMIS_ADDR 3
+
+static const struct i2c_device_id optoe_ids[] = {
+ { "optoe1", ONE_ADDR },
+ { "optoe2", TWO_ADDR },
+ { "optoe3", CMIS_ADDR },
+ { /* END OF LIST */ }
+};
+MODULE_DEVICE_TABLE(i2c, optoe_ids);
+
+/*-------------------------------------------------------------------------*/
+/*
+ * optoe_translate_offset() computes the addressing information to be used for
+ * a given r/w request.
+ *
+ * Task is to calculate the client (optoe_client for addr 50,
+ * optoe_dummy for addr 51) the page, and the offset.
+ *
+ * Handles both single address (QSFP), two address (SFP) and CMIS devices.
+ * For SFP, offset 0-255 are on optoe_client, >255 is on optoe_dummy
+ * Offset 256-383 are on the lower half of optoe_dummy
+ * Pages are accessible on the upper half of optoe_dummy
+ * Offset >383 are in 128 byte pages mapped into the upper half
+ *
+ * For QSFP and CMIS, all offsets are on optoe_client
+ * offset 0-127 are on the lower half of optoe_client (no paging)
+ * Pages are accessible on the upper half of optoe_client.
+ * Offset >127 are in 128 byte pages mapped into the upper half
+ *
+ * Callers must not read/write beyond the end of a client or a page
+ * without recomputing the client/page. Hence offset (within page)
+ * plus length must be less than or equal to 128. (Note that this
+ * routine does not have access to the length of the call, hence
+ * cannot do the validity check.)
+ *
+ * Offset within Lower Page 00h and Upper Page 00h are not recomputed
+ */
+
+static uint8_t optoe_translate_offset(struct optoe_data *optoe,
+ loff_t *offset, struct optoe_client **optoe_client)
+{
+ unsigned int page = 0;
+
+ *optoe_client = &optoe->optoe_client;
+
+ /* if SFP style, offset > 255, shift to i2c addr 0x51 */
+ if (optoe->dev_class == TWO_ADDR) {
+ if (*offset > 255) {
+ /* like QSFP, but shifted dummy client */
+ *optoe_client = &optoe->optoe_dummy;
+ *offset -= 256;
+ }
+ }
+
+ /*
+ * if offset is in the range 0-128...
+ * page doesn't matter (using lower half), return 0.
+ * offset is already correct (don't add 128 to get to paged area)
+ */
+ if (*offset < OPTOE_PAGE_SIZE)
+ return page;
+
+ /* note, page will always be positive since *offset >= 128 */
+ page = (*offset >> 7)-1;
+ /*
+ * OPTOE_PAGE_SIZE puts offset in the top half (the paged area),
+ * offset within the top half is last 7 bits
+ */
+ *offset = OPTOE_PAGE_SIZE + (*offset & 0x7f);
+
+ return page; /* note also returning client and offset */
+}
+
+static int optoe_regmap_rw(struct optoe_data *optoe,
+ struct optoe_client *optoe_client,
+ char *buf,
+ unsigned int offset,
+ size_t count, int opcode)
+{
+ unsigned long timeout, access_time;
+ struct i2c_client *client = optoe_client->client;
+ struct regmap *regmap = optoe_client->regmap;
+ int ret;
+
+ /*
+ * Accesses fail if the previous write didn't complete yet. We may
+ * loop a few times until this one succeeds, waiting at least
+ * long enough for one entire page write to work.
+ */
+ timeout = jiffies + msecs_to_jiffies(write_timeout);
+ do {
+ access_time = jiffies;
+
+ if (opcode == OPTOE_READ_OP) {
+ ret = regmap_bulk_read(regmap, offset, buf, count);
+ } else {
+ /* write_max is always 1 in this driver */
+ if (count > optoe->write_max)
+ count = optoe->write_max;
+ ret = regmap_bulk_write(regmap, offset, buf, count);
+ }
+ dev_dbg(&client->dev, "regmap %s %zu@%d --> %d (%lu)\n",
+ (opcode == OPTOE_READ_OP) ? "read" : "write",
+ count, offset, ret, jiffies);
+ if (!ret) /* regmap_bulk_calls returns 0 on success */
+ return count;
+
+ usleep_range(1000, 2000);
+ } while (time_before(access_time, timeout));
+
+ return -ETIMEDOUT;
+}
+
+static int optoe_eeprom_update_client(struct optoe_data *optoe,
+ char *buf, loff_t off,
+ size_t count, int opcode)
+{
+ struct optoe_client *optoe_client;
+ uint8_t page = 0;
+ loff_t phy_offset = off;
+ int retval = 0;
+ int ret = 0;
+ int status;
+ struct device *dev = &optoe->optoe_client.client->dev;
+
+ /* translate offset into page, 'offset within page' */
+ page = optoe_translate_offset(optoe, &phy_offset, &optoe_client);
+ dev_dbg(dev,
+ "%s off %lld page:%d phy_offset:%lld, count:%ld, opcode:%d\n",
+ __func__, off, page, phy_offset, (long) count, opcode);
+ /* set the page register */
+ if (page > 0) {
+ ret = optoe_regmap_rw(optoe, optoe_client, &page,
+ OPTOE_PAGE_SELECT_REG, 1, OPTOE_WRITE_OP);
+ if (ret < 0) {
+ dev_dbg(dev,
+ "Page register write, page %d failed:%d!\n",
+ page, ret);
+ return ret;
+ }
+ }
+
+ /* read/write the data */
+ while (count) {
+
+ status = optoe_regmap_rw(optoe, optoe_client,
+ buf, phy_offset, count, opcode);
+
+ if (status <= 0) {
+ if (retval == 0)
+ retval = status;
+ break;
+ }
+ buf += status;
+ phy_offset += status;
+ count -= status;
+ retval += status;
+ }
+
+
+ /*
+ * return the page register to page 0 - why?
+ * We either have to set the page register to 0 on every access
+ * to it, or restore it to 0 whenever we change it. Otherwise,
+ * accesses to page 0 would actually go to whatever the last page
+ * was. Assume more accesses to page 0 than all other pages
+ * combined, so less total accesses if we always leave it at page 0
+ */
+ if (page > 0) {
+ page = 0;
+ ret = optoe_regmap_rw(optoe, optoe_client, &page,
+ OPTOE_PAGE_SELECT_REG, 1, OPTOE_WRITE_OP);
+ if (ret < 0) {
+ dev_err(dev,
+ "Restore page register to 0 failed:%d!\n", ret);
+ /* error only if nothing has been transferred */
+ if (retval == 0)
+ retval = ret;
+ }
+ }
+ return retval;
+}
+
+/*
+ * Figure out if this access is within the range of supported pages.
+ * Note this is called on every access because we don't know if the
+ * module has been replaced since the last call.
+ *
+ * Returns updated len for this access:
+ * - entire access is legal, original len is returned.
+ * - access begins legal but is too long, len is truncated to fit.
+ * - initial offset exceeds supported pages, return OPTOE_EOF (zero)
+ */
+static int optoe_page_legal(struct optoe_data *optoe,
+ loff_t off, size_t len)
+{
+ struct optoe_client *optoe_client = &optoe->optoe_client;
+ u8 regval;
+ int not_pageable;
+ int status;
+ size_t maxlen;
+ struct device *dev = &optoe_client->client->dev;
+
+ if (off < 0)
+ return -EINVAL;
+ if (optoe->dev_class == TWO_ADDR) {
+ /* SFP case */
+ /* if only using addr 0x50 (first 256 bytes) we're good */
+ if ((off + len) <= TWO_ADDR_NO_0X51_SIZE)
+ return len;
+ /* if offset exceeds possible pages, we're not good */
+ if (off >= TWO_ADDR_EEPROM_SIZE)
+ return OPTOE_EOF;
+ /* in between, are pages supported? */
+ status = optoe_regmap_rw(optoe, optoe_client, &regval,
+ TWO_ADDR_PAGEABLE_REG, 1, OPTOE_READ_OP);
+ if (status < 0)
+ return status; /* error out (no module?) */
+ if (regval & TWO_ADDR_PAGEABLE) {
+ /* Pages supported, trim len to the end of pages */
+ maxlen = TWO_ADDR_EEPROM_SIZE - off;
+ } else {
+ /* pages not supported, trim len to unpaged size */
+ if (off >= TWO_ADDR_EEPROM_UNPAGED_SIZE)
+ return OPTOE_EOF;
+
+ /* will be accessing addr 0x51, is that supported? */
+ /* byte 92, bit 6 implies DDM support, 0x51 support */
+ status = optoe_regmap_rw(optoe, optoe_client, &regval,
+ TWO_ADDR_0X51_REG, 1,
+ OPTOE_READ_OP);
+ if (status < 0)
+ return status;
+ if (regval & TWO_ADDR_0X51_SUPP) {
+ /* addr 0x51 is OK */
+ maxlen = TWO_ADDR_EEPROM_UNPAGED_SIZE - off;
+ } else {
+ /* addr 0x51 NOT supported, trim to 256 max */
+ if (off >= TWO_ADDR_NO_0X51_SIZE)
+ return OPTOE_EOF;
+ maxlen = TWO_ADDR_NO_0X51_SIZE - off;
+ }
+ }
+ len = (len > maxlen) ? maxlen : len;
+ } else {
+ /* QSFP case, CMIS case */
+ /* if no pages needed, we're good */
+ if ((off + len) <= ONE_ADDR_EEPROM_UNPAGED_SIZE)
+ return len;
+ /* if offset exceeds possible pages, we're not good */
+ if (off >= ONE_ADDR_EEPROM_SIZE)
+ return OPTOE_EOF;
+ /* in between, are pages supported? */
+ status = optoe_regmap_rw(optoe, optoe_client, &regval,
+ ONE_ADDR_PAGEABLE_REG, 1, OPTOE_READ_OP);
+ if (status < 0)
+ return status; /* error out (no module?) */
+
+ /*
+ * note CMIS put the pageable bit in the same register
+ * as QSFP, but at a different bit :-(
+ */
+ if (optoe->dev_class == ONE_ADDR)
+ not_pageable = QSFP_NOT_PAGEABLE;
+ else
+ not_pageable = CMIS_NOT_PAGEABLE;
+ if (regval & not_pageable) {
+ /* pages not supported, trim len to unpaged size */
+ if (off >= ONE_ADDR_EEPROM_UNPAGED_SIZE)
+ return OPTOE_EOF;
+ maxlen = ONE_ADDR_EEPROM_UNPAGED_SIZE - off;
+ } else {
+ /* Pages supported, trim len to the end of pages */
+ maxlen = ONE_ADDR_EEPROM_SIZE - off;
+ }
+ len = (len > maxlen) ? maxlen : len;
+ }
+ dev_dbg(dev, "page_legal, class %d, off %lld len %ld\n",
+ optoe->dev_class, off, (long) len);
+ return len;
+}
+
+static int optoe_read_write(struct optoe_data *optoe,
+ char *buf, loff_t off, size_t len, int opcode)
+{
+ struct i2c_client *client = optoe->optoe_client.client;
+ int chunk;
+ int status = 0;
+ int retval;
+ size_t pending_len = 0, chunk_len = 0;
+ loff_t chunk_offset = 0, chunk_start_offset = 0;
+ loff_t chunk_end_offset = 0;
+
+ dev_dbg(&client->dev, "%s: off %lld len:%ld, opcode:%s\n",
+ __func__, off, (long) len,
+ (opcode == OPTOE_READ_OP) ? "r" : "w");
+ if (unlikely(!len))
+ return len;
+
+ /*
+ * Read data from chip, protecting against concurrent updates
+ * from this host, but not from other I2C masters.
+ */
+ mutex_lock(&optoe->lock);
+
+ /*
+ * Confirm this access fits within the device suppored addr range
+ */
+ status = optoe_page_legal(optoe, off, len);
+
+ /*
+ * returning 0 (OPTOE_EOF) on a write call gets into an infinite
+ * loop with the regmap/i2c code. Returning an error on a read call
+ * will show up as an error with 'cat <eeprom file>'. So, writing
+ * past EOF is an error, reading past EOF is just '0'
+ */
+ if ((status == OPTOE_EOF) && (opcode == OPTOE_WRITE_OP))
+ status = -EINVAL;
+
+ if ((status == OPTOE_EOF) || (status < 0)) {
+ mutex_unlock(&optoe->lock);
+ return status;
+ }
+ len = status;
+
+ /*
+ * For each (128 byte) chunk involved in this request, issue a
+ * separate call to optoe_eeprom_update_client(), to
+ * ensure that each access recalculates the client/page
+ * and writes the page register as needed.
+ * Note that chunk to page mapping is confusing, is different for
+ * QSFP/CMIS and SFP, and never needs to be done. Don't try!
+ */
+ pending_len = len; /* amount remaining to transfer */
+ retval = 0; /* amount transferred */
+ for (chunk = off >> 7; chunk <= (off + len - 1) >> 7; chunk++) {
+
+ /*
+ * Compute the offset and number of bytes to be read/written
+ *
+ * 1. start at an offset not equal to 0 (within the chunk)
+ * and read/write less than the rest of the chunk
+ * 2. start at an offset not equal to 0 and read/write the rest
+ * of the chunk
+ * 3. start at offset 0 (within the chunk) and read/write less
+ * than entire chunk
+ * 4. start at offset 0 (within the chunk), and read/write
+ * the entire chunk
+ */
+ chunk_start_offset = chunk * OPTOE_PAGE_SIZE;
+ chunk_end_offset = chunk_start_offset + OPTOE_PAGE_SIZE;
+
+ if (chunk_start_offset < off) {
+ chunk_offset = off;
+ if ((off + pending_len) < chunk_end_offset)
+ chunk_len = pending_len;
+ else
+ chunk_len = chunk_end_offset - off;
+ } else {
+ chunk_offset = chunk_start_offset;
+ if (pending_len < OPTOE_PAGE_SIZE)
+ chunk_len = pending_len;
+ else
+ chunk_len = OPTOE_PAGE_SIZE;
+ }
+
+ /*
+ * note: chunk_offset is from the start of the EEPROM,
+ * not the start of the chunk
+ */
+ status = optoe_eeprom_update_client(optoe, buf,
+ chunk_offset, chunk_len, opcode);
+ if (status != chunk_len) {
+ /* This is another 'no device present' path */
+ dev_dbg(&client->dev,
+ "o_u_c: chunk %d c_offset %lld c_len %ld failed %d!\n",
+ chunk, chunk_offset, (long) chunk_len, status);
+ if (status > 0)
+ retval += status;
+ if (retval == 0)
+ retval = status;
+ break;
+ }
+ buf += status;
+ pending_len -= status;
+ retval += status;
+ }
+ mutex_unlock(&optoe->lock);
+
+ return retval;
+}
+
+static int optoe_nvmem_read(void *priv, unsigned int off,
+ void *buf, size_t count)
+{
+ struct optoe_data *optoe = priv;
+
+ return optoe_read_write(optoe, buf, off, count, OPTOE_READ_OP);
+}
+
+static int optoe_nvmem_write(void *priv, unsigned int off,
+ void *buf, size_t count)
+{
+ struct optoe_data *optoe = priv;
+
+ return optoe_read_write(optoe, buf, off, count, OPTOE_WRITE_OP);
+}
+
+static int optoe_remove(struct i2c_client *client)
+{
+ struct optoe_data *optoe;
+
+ optoe = i2c_get_clientdata(client);
+ sysfs_remove_group(&client->dev.kobj, &optoe->attr_group);
+ kfree(optoe);
+ return 0;
+}
+
+
+/*
+ * optoe_make_regmap creates the regmap for the client.
+ * IMPORTANT: Don't call the regmap read/write calls directly
+ * for these devices. These devices are paged, and you have to
+ * set the page register before accessing the data in that page.
+ * Use the nvmem interfaces, those read/write calls use this
+ * driver to manage pages correctly.
+ */
+static struct regmap *optoe_make_regmap(struct i2c_client *client)
+{
+ struct regmap_config regmap_config = { };
+ struct regmap *regmap;
+
+ /* setup a minimal regmap - 8 bits, 8 bit addresses */
+ regmap_config.val_bits = 8;
+ regmap_config.reg_bits = 8;
+
+ /* I'll handle the locking */
+ regmap_config.disable_locking = true;
+ regmap = devm_regmap_init_i2c(client, &regmap_config);
+ return regmap;
+}
+
+/*
+ * optoe_make_nvmem() unregisters the existing optoe->nvmem if it
+ * exists, then registers a new one. Convenient when the size
+ * of an EEPROM device changes.
+ */
+static int optoe_make_nvmem(struct optoe_data *optoe)
+{
+ struct nvmem_config nvmem_config = { };
+ struct i2c_client *client = optoe->optoe_client.client;
+ struct device *dev = &client->dev;
+
+ nvmem_config.name = optoe->port_name;
+ /* NVMEM_DEVID_NONE tells nvmem not to append '0' to name */
+ nvmem_config.id = NVMEM_DEVID_NONE;
+ nvmem_config.dev = dev;
+ nvmem_config.read_only = false;
+ nvmem_config.root_only = false;
+ nvmem_config.owner = THIS_MODULE;
+ nvmem_config.compat = true;
+ nvmem_config.base_dev = dev;
+ nvmem_config.reg_read = optoe_nvmem_read;
+ nvmem_config.reg_write = optoe_nvmem_write;
+ nvmem_config.priv = optoe;
+ nvmem_config.stride = 1;
+ nvmem_config.word_size = 1;
+ nvmem_config.size = optoe->byte_len;
+ if (optoe->nvmem)
+ devm_nvmem_unregister(dev, optoe->nvmem);
+ optoe->nvmem = devm_nvmem_register(dev, &nvmem_config);
+ dev_info(dev, "%u byte class %d EEPROM\n",
+ optoe->byte_len, optoe->dev_class);
+ return 0;
+}
+
+static ssize_t dev_class_show(struct device *dev,
+ struct device_attribute *dattr, char *buf)
+{
+ struct i2c_client *client = to_i2c_client(dev);
+ struct optoe_data *optoe = i2c_get_clientdata(client);
+ ssize_t count;
+
+ mutex_lock(&optoe->lock);
+ count = sprintf(buf, "%d\n", optoe->dev_class);
+ mutex_unlock(&optoe->lock);
+
+ return count;
+}
+
+static ssize_t dev_class_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct i2c_client *client = to_i2c_client(dev);
+ struct optoe_data *optoe = i2c_get_clientdata(client);
+ int dev_class;
+ struct regmap *regmap;
+ ssize_t err;
+
+ /*
+ * dev_class is actually the number of i2c addresses used, thus
+ * legal values are "1" (QSFP class) and "2" (SFP class)
+ * And... CMIS spec is 1 i2c address, but puts the pageable
+ * bit in a different location, so CMIS devices are "3"
+ */
+
+ if (kstrtoint(buf, 0, &dev_class) != 0 ||
+ dev_class < 1 || dev_class > 3)
+ return -EINVAL;
+
+ if (optoe->dev_class == dev_class) /* no change, NOP */
+ return(count);
+
+ mutex_lock(&optoe->lock);
+ if (dev_class == TWO_ADDR) {
+ /* SFP family */
+ /* if it doesn't exist, create 0x51 i2c address */
+ if (!optoe->optoe_dummy.client) {
+ optoe->optoe_dummy.client =
+ devm_i2c_new_dummy_device(dev,
+ client->adapter,
+ 0x51);
+ if (!optoe->optoe_dummy.client) {
+ dev_err(&client->dev,
+ "address 0x51 unavailable\n");
+ mutex_unlock(&optoe->lock);
+ return -EADDRINUSE;
+ }
+ regmap = optoe_make_regmap(
+ optoe->optoe_dummy.client);
+ if (IS_ERR(regmap)) {
+ mutex_unlock(&optoe->lock);
+ return PTR_ERR(regmap);
+ }
+ optoe->optoe_dummy.regmap = regmap;
+ }
+ optoe->byte_len = TWO_ADDR_EEPROM_SIZE;
+ } else {
+ /* one-address (eg QSFP) and CMIS family */
+ /* note, no need to delete the dummy i2c device */
+ optoe->byte_len = ONE_ADDR_EEPROM_SIZE;
+ }
+ optoe->dev_class = dev_class;
+ optoe_make_nvmem(optoe); /* updates the reported size of EEPROM */
+ err = (IS_ERR(optoe->nvmem)) ? PTR_ERR(optoe->nvmem) : 0;
+ mutex_unlock(&optoe->lock);
+
+ return err;
+}
+
+static ssize_t port_name_show(struct device *dev,
+ struct device_attribute *dattr, char *buf)
+{
+ struct i2c_client *client = to_i2c_client(dev);
+ struct optoe_data *optoe = i2c_get_clientdata(client);
+ ssize_t count;
+
+ mutex_lock(&optoe->lock);
+ count = sprintf(buf, "%s\n", optoe->port_name);
+ mutex_unlock(&optoe->lock);
+
+ return count;
+}
+
+static ssize_t port_name_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct i2c_client *client = to_i2c_client(dev);
+ struct optoe_data *optoe = i2c_get_clientdata(client);
+ char port_name[MAX_PORT_NAME_LEN];
+
+ /* no checking, this value is not used except by port_name_show */
+
+ if (sscanf(buf, "%19s", port_name) != 1)
+ return -EINVAL;
+
+ mutex_lock(&optoe->lock);
+ strcpy(optoe->port_name, port_name);
+ optoe_make_nvmem(optoe); /* updates the name of the nvmem */
+ mutex_unlock(&optoe->lock);
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(port_name);
+
+static DEVICE_ATTR_RW(dev_class);
+
+
+static struct attribute *optoe_attrs[] = {
+ &dev_attr_port_name.attr,
+ &dev_attr_dev_class.attr,
+ NULL,
+};
+
+static struct attribute_group optoe_attr_group = {
+ .attrs = optoe_attrs,
+};
+
+static int optoe_probe(struct i2c_client *client)
+{
+ struct regmap *regmap;
+ int err;
+ struct optoe_data *optoe;
+ const char *of_port_name = NULL;
+ struct device *dev = &client->dev;
+
+ if (client->addr != 0x50) {
+ dev_dbg(dev, "probe, bad i2c addr: 0x%x\n", client->addr);
+ err = -EINVAL;
+ goto exit;
+ }
+
+ optoe = kzalloc(sizeof(struct optoe_data), GFP_KERNEL);
+ if (!optoe) {
+ err = -ENOMEM;
+ goto exit;
+ }
+ mutex_init(&optoe->lock);
+
+ /*
+ * if the port_name property is defined, use it, else use the
+ * device name as port_name
+ */
+ if (device_property_present(dev, "port_name")) {
+ err = device_property_read_string(dev, "port_name",
+ &of_port_name);
+ if (err)
+ of_port_name = dev_name(dev);
+ } else {
+ of_port_name = dev_name(dev);
+ }
+ memcpy(optoe->port_name, of_port_name, MAX_PORT_NAME_LEN);
+
+ /* determine whether this is a one-address or two-address module */
+ if (strcmp(client->name, "optoe1") == 0) {
+ /* QSFP family */
+ optoe->dev_class = ONE_ADDR;
+ optoe->byte_len = ONE_ADDR_EEPROM_SIZE;
+ } else if (strcmp(client->name, "optoe2") == 0) {
+ /* SFP family */
+ optoe->dev_class = TWO_ADDR;
+ optoe->byte_len = TWO_ADDR_EEPROM_SIZE;
+ } else if (strcmp(client->name, "optoe3") == 0) {
+ /* CMIS spec */
+ optoe->dev_class = CMIS_ADDR;
+ optoe->byte_len = ONE_ADDR_EEPROM_SIZE;
+ } else { /* those were the only choices */
+ err = -EINVAL;
+ goto exit;
+ }
+
+ /*
+ * Old application notes recommend 1 byte writes for some
+ * modules. This could probably be lifted, but lacking
+ * a broad base of devices and systems to test, I'm leaving
+ * this as is. If this is ever raised, the max would automatically
+ * be 128 bytes as anything larger would cross page
+ * boundaries with wraparound effects.
+ * TODO: Consider making this a device tree property
+ */
+
+ optoe->write_max = 1;
+
+ regmap = optoe_make_regmap(client);
+ if (IS_ERR(regmap)) {
+ err = PTR_ERR(regmap);
+ goto exit;
+ }
+
+ optoe->optoe_client.client = client;
+ optoe->optoe_client.regmap = regmap;
+
+ /* SFF-8472 spec requires that the second I2C address be 0x51 */
+ if (optoe->dev_class == TWO_ADDR) {
+ optoe->optoe_dummy.client =
+ devm_i2c_new_dummy_device(dev, client->adapter, 0x51);
+ if (!optoe->optoe_dummy.client) {
+ dev_err(dev, "address 0x51 unavailable\n");
+ err = -EADDRINUSE;
+ goto err_struct;
+ }
+ regmap = optoe_make_regmap(optoe->optoe_dummy.client);
+ if (IS_ERR(regmap)) {
+ err = PTR_ERR(regmap);
+ goto err_struct;
+ }
+ optoe->optoe_dummy.regmap = regmap;
+ }
+
+ optoe->attr_group = optoe_attr_group;
+
+ err = sysfs_create_group(&client->dev.kobj, &optoe->attr_group);
+ if (err) {
+ dev_err(dev, "failed to create sysfs attribute group.\n");
+ goto err_struct;
+ }
+
+ i2c_set_clientdata(client, optoe);
+ optoe_make_nvmem(optoe);
+ if (IS_ERR(optoe->nvmem)) {
+ err = PTR_ERR(optoe->nvmem);
+ goto err_struct;
+ }
+
+ return 0;
+
+err_struct:
+ kfree(optoe);
+exit:
+ dev_dbg(dev, "probe error %d\n", err);
+
+ return err;
+}
+
+/*-------------------------------------------------------------------------*/
+
+static struct i2c_driver optoe_driver = {
+ .driver = {
+ .name = "optoe",
+ .owner = THIS_MODULE,
+ },
+ .probe_new = optoe_probe,
+ .remove = optoe_remove,
+ .id_table = optoe_ids,
+};
+
+static int __init optoe_init(void)
+{
+ return i2c_add_driver(&optoe_driver);
+}
+module_init(optoe_init);
+
+static void __exit optoe_exit(void)
+{
+ i2c_del_driver(&optoe_driver);
+}
+module_exit(optoe_exit);
+
+MODULE_DESCRIPTION("Driver for optical transceiver (SFP/QSFP/CMIS) EEPROMs");
+MODULE_AUTHOR("DON BOLLINGER <[email protected]>");
+MODULE_LICENSE("GPL");

base-commit: a2ea4e1d9091cd8bc69f1c42c15bedc38618f04c
--
2.11.0


2021-02-26 22:36:46

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> optoe is an i2c based driver that supports read/write access to all
> the pages (tables) of MSA standard SFP and similar devices (conforming
> to the SFF-8472 spec), MSA standard QSFP and similar devices (conforming
> to the SFF-8636 spec) and CMIS and similar devices (conforming to the
> Common Management Interface Specfication).

Hi Don

Please make sure you Cc: netdev. This is networking stuff.

And we have seen this code before, and the netdev Maintainers have
argued against it before.

> These devices provide identification, operational status and control
> registers via an EEPROM model. These devices support one or 3 fixed pages
> (128 bytes) of data, and one page that is selected via a page register on
> the first fixed page. Thus the driver's main task is to map these pages
> onto a simple linear address space for user space management applications.
> See the driver code for a detailed layout.

I assume you have seen the work NVIDIA submitted last week? This idea
of linear pages is really restrictive and we are moving away from it.

> The EEPROM data is accessible to user space and kernel consumers via the
> nvmem interface.

ethtool -m ?

In the past, this code has been NACKed because it does not integrate
into the networking stack. Is this attempt any different?

Thanks
Andrew

2021-02-27 03:06:11

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, Feb 26, 2021 at 2:35 PM -0800, Andrew Lunn wrote:
> On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > optoe is an i2c based driver that supports read/write access to all
> > the pages (tables) of MSA standard SFP and similar devices (conforming
> > to the SFF-8472 spec), MSA standard QSFP and similar devices
> > (conforming to the SFF-8636 spec) and CMIS and similar devices
> > (conforming to the Common Management Interface Specfication).
>
> Hi Don
>
> Please make sure you Cc: netdev. This is networking stuff.

Will do.

> And we have seen this code before, and the netdev Maintainers have
> argued against it before.

Yes, though I have tried to make it more acceptable, and also more useful to
you...

> > These devices provide identification, operational status and control
> > registers via an EEPROM model. These devices support one or 3 fixed
> > pages
> > (128 bytes) of data, and one page that is selected via a page register
> > on the first fixed page. Thus the driver's main task is to map these
> > pages onto a simple linear address space for user space management
> applications.
> > See the driver code for a detailed layout.
>
> I assume you have seen the work NVIDIA submitted last week? This idea of
> linear pages is really restrictive and we are moving away from it.

No, I haven't seen it. I can't seem to locate anything in the past month on
LMKL from NVIDIA. Please point me to it.

What are you using instead of mapping the pages into a linear address space?
Perhaps I can provide a different interface to call with some other mapping.

> > The EEPROM data is accessible to user space and kernel consumers via
> > the nvmem interface.
>
> ethtool -m ?

This is actually below ethtool. Ethtool has to fetch the data from the
eeprom using an appropriate i2c interface (these are defined to be i2c
devices). And, to fetch the data from any but the first 256 bytes the code
ethtool calls must deal with the page register on the device. This driver
handles the page register, for both SFP and QSFP/CMIS devices. This driver
would be a useful path for ethtool to use when someone decides to make that
data available. Note for example, CMIS devices put the DOM data
(per-channel Tx power, Rx Power, laser bias) on page 0x11. When you want
that data, you'll have to go past 256 bytes and deal with the page register.
optoe will do that for you.

> In the past, this code has been NACKed because it does not integrate into
> the networking stack. Is this attempt any different?

Yes. I have updated the driver with all the latest changes in at24.c, the
primary eeprom driver. That update includes use of the nvmem interface,
which means this driver can now be called by kernel code. I believe it
would be useful and easy to replace the sfp.c code that accesses the eeprom
with nvmem calls to this driver. By doing so, you will be able to access
the additional pages of data on those devices. You would also get the
benefit of sharing common code the other eeprom drivers. As part of that
cleanup, the explicit sysfs creation of an eeprom device has been removed.
Full disclosure, the nvmem interface creates that device now.

> Thanks
> Andrew

One more point, I have been requested by the SONiC team to upstream this
driver. It is on their short list of kernel code that is not upstream, and
they want to fix that. They are not consumers of the netdev interface, but
they (and other NOS providers) have a need for a driver to access the eeprom
data. Greg KH was involved in that request, and I related your concerns to
him, as well as my position that this is an eeprom driver with partners that
need it independent of netdev. His response:

GKH> I can't remember the actual patch anymore, but you are right, it's
"just"
GKH> poking around in the EEPROM for the device, and doing what you want in
GKH> userspace, which really should be fine. Submit it and I'll be glad to
review it
GKH> under that type of functionality being added.

He didn't say he would override your position, but he suggested it was
appropriate to submit it. So, here it is.

Thus:
1. There are existing NOS platforms that need and use this functionality,
they want it in the upstream kernel.
2. This driver is better than sfp.c, and could easily be plugged in to
improve netdev and ethtool access to SFP/QSFP/CMIS EEPROM data.

Don

2021-02-27 16:27:13

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > I assume you have seen the work NVIDIA submitted last week? This idea of
> > linear pages is really restrictive and we are moving away from it.
>
> No, I haven't seen it. I can't seem to locate anything in the past month on
> LMKL from NVIDIA. Please point me to it.

[RFC PATCH net-next 0/5] ethtool: Extend module EEPROM dump
Message-Id: <[email protected]>

b4 should be able to fetch it for you, using that message id.

Clearly, we don't want two different kernel APIs for doing the same
thing. This new KAPI is still in its early days. You can contribute to
it, and make it work for your use case. If i understand correctly, you
are using Linux as a bootloader, and running the complete switch
driver in userspace, not making use of the Linux network stack. This
is not something the netdev community likes, but if you work within
the networking KAPI, rather than adding parallel KAPI, we can probably
work together. I think the biggest problem you have is identifiers.
Since you don't have the SFP associated to a netdev, the current IOCTL
interface which us the netdev name as an identifier does not work. But
the new code is netlink based. The identifier is just an attribute in
the message. See if you can use an alternative attribute which
directly identifies the SFP, not the netdev. It is O.K. to instantiate
an SFP device and then not make use of it in PHYLINK. So this should
work.

Andrew

2021-03-04 05:10:33

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, Feb 26, 2021 at 6:46 PM -0800, Don Bollinger wrote:
> On Fri, Feb 26, 2021 at 2:35 PM -0800, Andrew Lunn wrote:
> > On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > > optoe is an i2c based driver that supports read/write access to all
> > > the pages (tables) of MSA standard SFP and similar devices
> > > (conforming to the SFF-8472 spec), MSA standard QSFP and similar
> > > devices (conforming to the SFF-8636 spec) and CMIS and similar
> > > devices (conforming to the Common Management Interface
> Specfication).
> >
> > Hi Don
> >
> > And we have seen this code before, and the netdev Maintainers have
> > argued against it before.
>
> Yes, though I have tried to make it more acceptable, and also more useful
to
> you...
>
> > > These devices provide identification, operational status and control
> > > registers via an EEPROM model. These devices support one or 3 fixed
> > > pages
> > > (128 bytes) of data, and one page that is selected via a page
> > > register on the first fixed page. Thus the driver's main task is to
> > > map these pages onto a simple linear address space for user space
> > > management
> > applications.
> > > See the driver code for a detailed layout.
> >
> > I assume you have seen the work NVIDIA submitted last week? This idea
> > of linear pages is really restrictive and we are moving away from it.

Yes, I see at least most of it in your response in the netdev archive. The
linear pages are really a very simple mapping, but they also provide a very
simple access model. I can readily accommodate Moshe's addressing (i2c
addr, page, bank, offset, len). Details below...

> > > The EEPROM data is accessible to user space and kernel consumers via
> > > the nvmem interface.
> >
> > ethtool -m ?
>
> This is actually below ethtool. Ethtool has to fetch the data from the
eeprom
> using an appropriate i2c interface (these are defined to be i2c devices).
And,
> to fetch the data from any but the first 256 bytes the code ethtool calls
must
> deal with the page register on the device. This driver handles the page
> register, for both SFP and QSFP/CMIS devices. This driver would be a
useful
> path for ethtool to use when someone decides to make that data available.
> Note for example, CMIS devices put the DOM data (per-channel Tx power,
> Rx Power, laser bias) on page 0x11. When you want that data, you'll have
to
> go past 256 bytes and deal with the page register.
> optoe will do that for you.

To be more specific, optoe is only replacing the functionality of
drivers/net/phy/sfp.c, the functions of sfp_i2c_read() and sfp_i2c_write().
These are the routines at the very bottom of the ethtool stack that actually
execute the i2c calls to get the data. The existing routines are very
limited, in that they don't handle pages at all. Hence they can only reach
256 bytes of QSFP EEPROM data and 512 bytes of SFP EEPROM data. I can
propose a shorter cleaner replacement for each of those routines which will
provide access to the rest of the data on those devices.

ALL of the ethtool stack remains unchanged, works exactly the same, and will
now provide access to more data on the EEPROMs. We may have to remove some
range checks to allow requests to pages beyond page 0.

Note that Moshe's RFC does not include the actual access routines, the
equivalent of sfp_i2c_read/write. That will in fact require managing pages.
Using optoe, those routines are a few lines of code to map his
addr/page/bank/offset and a call to the new sfp_i2c_read/write calls.

And, all of the i2c/EEPROM accesses go through the same code that routinely
handles the rest of the EEPROMs in the system, with all of the architectural
consolidation, i2c anomaly handling, and simplified support inherent in
sharing common code.

> > In the past, this code has been NACKed because it does not integrate
> > into the networking stack. Is this attempt any different?
>
> Yes. I have updated the driver with all the latest changes in at24.c, the
> primary eeprom driver. That update includes use of the nvmem interface,
> which means this driver can now be called by kernel code. I believe it
would
> be useful and easy to replace the sfp.c code that accesses the eeprom with
> nvmem calls to this driver. By doing so, you will be able to access the
> additional pages of data on those devices. You would also get the benefit
of
> sharing common code the other eeprom drivers. As part of that cleanup,
the
> explicit sysfs creation of an eeprom device has been removed.
> Full disclosure, the nvmem interface creates that device now.
>
> > Thanks
> > Andrew
>
> One more point, I have been requested by the SONiC team to upstream this
> driver. It is on their short list of kernel code that is not upstream,
and they
> want to fix that. They are not consumers of the netdev interface, but
they
> (and other NOS providers) have a need for a driver to access the eeprom
> data. Greg KH was involved in that request, and I related your concerns
to
> him, as well as my position that this is an eeprom driver with partners
that
> need it independent of netdev. His response:
>
> GKH> I can't remember the actual patch anymore, but you are right, it's
> "just"
> GKH> poking around in the EEPROM for the device, and doing what you want
> GKH> in userspace, which really should be fine. Submit it and I'll be
> GKH> glad to
> review it
> GKH> under that type of functionality being added.
>
> He didn't say he would override your position, but he suggested it was
> appropriate to submit it. So, here it is.
>
> Thus:
> 1. There are existing NOS platforms that need and use this functionality,
> they want it in the upstream kernel.
> 2. This driver is better than sfp.c, and could easily be plugged in to
improve
> netdev and ethtool access to SFP/QSFP/CMIS EEPROM data.
>
> Don

2021-03-04 05:21:02

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> To be more specific, optoe is only replacing the functionality of
> drivers/net/phy/sfp.c, the functions of sfp_i2c_read() and sfp_i2c_write().
> These are the routines at the very bottom of the ethtool stack that actually
> execute the i2c calls to get the data. The existing routines are very
> limited, in that they don't handle pages at all. Hence they can only reach
> 256 bytes of QSFP EEPROM data and 512 bytes of SFP EEPROM data. I can
> propose a shorter cleaner replacement for each of those routines which will
> provide access to the rest of the data on those devices.

drivers/net/phy/sfp.c is not the only code making use of this KAPI.
Any MAC driver can implement the ethtool op calls for reading SFP
memory. The MAC driver can either directly reply because it has the
SFP hidden behind firmware, or it can call into the sfp.c code,
because Linux is driving the SFP.

Moshe is working on the Mellonox MAC drivers. As you say, the current
sfp.c code is very limited. But once Moshe code is merged, i will do
the work needed to extend sfp.c to fully support the KAPI. It will
then work for many more MAC drivers, those using phylink.

For me, the KAPI is the important thing, and less so how the
implementation underneath works. Ideally, we want one KAPI for
accessing SFP EEPROMs. Up until now, that KAPI is the ethtool IOCTL.
But that KAPI is now showing its age, and it causing us problems. So
we need to replace that KAPI. ethtool has recently moved to using
netlink messages. So any replacement should be based on netlink. The
whole network stack is pretty much controlled via netlink. So you will
find it very difficult to argue for any other form of KAPI within the
netdev community. Since optoe's KAPI is not netlink based, it is very
unlikely to be accepted.

But netlink is much more flexible than the older IOCTL interface.
Please work with us to ensure this new KAPI can work with your use
cases.

Andrew

2021-03-05 19:10:18

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Mon, Mar 1, 2021 at 12:46 PM-0800, Andrew Lunn wrote:
> > To be more specific, optoe is only replacing the functionality of
> > drivers/net/phy/sfp.c, the functions of sfp_i2c_read() and
sfp_i2c_write().
> > These are the routines at the very bottom of the ethtool stack that
> > actually execute the i2c calls to get the data. The existing routines
> > are very limited, in that they don't handle pages at all. Hence they
> > can only reach
> > 256 bytes of QSFP EEPROM data and 512 bytes of SFP EEPROM data. I can
> > propose a shorter cleaner replacement for each of those routines which
> > will provide access to the rest of the data on those devices.
>
> drivers/net/phy/sfp.c is not the only code making use of this KAPI.
> Any MAC driver can implement the ethtool op calls for reading SFP memory.
> The MAC driver can either directly reply because it has the SFP hidden
> behind firmware, or it can call into the sfp.c code, because Linux is
driving the
> SFP.

OK, I have checked with my partners, including NOS vendors and switch
platform vendors. They are not using the netdev framework, they are
basically not using kernel networking for managing the networking through
tens to over a hundred network ports at 10G to 400G speeds. The kernel is
not the source of truth regarding the state of network devices. I know
netdev *could* manage these systems, and that you are working toward that
goal, that is not the approach they are taking nor the direction they are
heading. I am not disparaging netdev, I respect and value the contribution
to linux networking. It's all good. It just isn't the direction my
partners are going at this time.

You have described this architecture in the past as a 'bootloader'. In fact
Linux is the operating system running on those switches. It is not
temporary (eg loading the real OS and going away). It is allocating memory,
dispatching processes and threads, handling interrupts, hosting docker
containers, and running the proprietary network APIs that manage the
networks. In this architecture, the optical modules are managed by the OS,
through drivers. The network APIs interact through these drivers. For much
of the configuration data, there are configuration files that match up
device hardware (e.g. Low Power Mode GPIO lines and Tx disable lines) and
i2c buses (through layers of i2c muxes) with switch ports as seen by the
switch silicon. Network management software (user space apps) is
responsible for enabling, configuring and monitoring optical modules to
match the config files. Kernel drivers provide the access to the GPIO lines
and the EEPROM control registers.

Notably, in this architecture, there are actually NO kernel consumers of the
module EEPROM data. The version of optoe that is in production in these NOS
and switch environments does not even have an entry point callable by the
kernel. All of the consumers are accessing the data via the sysfs file in
/sys/bus/i2c/devices/*. I have closely modeled the updated version of
optoe on the at24 driver (drivers/misc/eeprom/at24.c). Thus, the KAPI is
actually the same as used by other eeprom drivers. It is an eeprom, it is
accessed by the nvmem interfaces, in both kernel and user space.

My primary motivation for creating optoe is to consolidate a bunch of
different implementations by different vendors, to add page support which
most implementations lacked, to extend the reach to all of the architected
pages (the standards describe them as proprietary, not forbidden), to
provide write access, and to enable CMIS devices. Those goals apply to the
netdev/netlink environment as well. I added the kernel access via nvmem to
facilitate adoption in your network stack, to achieve the same goals
(standardization and improvement of access to module EEPROMs).

>
> Moshe is working on the Mellonox MAC drivers. As you say, the current
sfp.c

I love Moshe's KAPI, and am reviewing and commenting on it to ensure its
success.

> code is very limited. But once Moshe code is merged, i will do the work
> needed to extend sfp.c to fully support the KAPI. It will then work for
many
> more MAC drivers, those using phylink.

One piece of that work could be to replace the contents of
drivers/net/phy/sfp.c, functions sfp_i2c_read() and sfp_i2c_write() with
nvmem calls to optoe. That would be an easy change, and provide all of the
features of optoe (pages, access to all of the EEPROM, write support, CMIS),
without writing and maintaining that i2c access code. The actual i2c calls
would be handled by the same code that is supporting at24.

It is plausible that platform vendors would choose not to implement their
own version of these functions if the generic sfp_i2c_read/write worked for
them. Fewer implementations of the same code, with more capability in the
common implementation, is obviously beneficial.

> For me, the KAPI is the important thing, and less so how the
implementation
> underneath works. Ideally, we want one KAPI for accessing SFP EEPROMs.
> Up until now, that KAPI is the ethtool IOCTL.
> But that KAPI is now showing its age, and it causing us problems. So we
need
> to replace that KAPI. ethtool has recently moved to using netlink
messages.
> So any replacement should be based on netlink. The whole network stack is
> pretty much controlled via netlink. So you will find it very difficult to
argue for
> any other form of KAPI within the netdev community. Since optoe's KAPI is
> not netlink based, it is very unlikely to be accepted.
>
> But netlink is much more flexible than the older IOCTL interface.
> Please work with us to ensure this new KAPI can work with your use cases.

I accept all your points from the netdev/netlink perspective. To that end,
I offer optoe as an upgrade to the default implementation of
sfp_i2c_read/write.

I also have partners using a different architecture, for whom a
netdev/netlink based solution would not be useful. These partners have been
using a sysfs based approach to module EEPROMs and have no motivation to
change that. This version of optoe is using the standard eeprom access
method (nvmem) to provide this access.

Acknowledging your objections, I nonetheless request that optoe be accepted
into upstream as an eeprom driver in drivers/misc/eeprom. It is a
legitimate driver, with a legitimate user community, which deserves the
benefits of being managed as a legitimate part of the linux kernel.

>
> Andrew

Don

2021-03-05 22:59:16

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, 5 Mar 2021 11:07:08 -0800 Don Bollinger wrote:
> Acknowledging your objections, I nonetheless request that optoe be accepted
> into upstream as an eeprom driver in drivers/misc/eeprom. It is a
> legitimate driver, with a legitimate user community, which deserves the
> benefits of being managed as a legitimate part of the linux kernel.

It's in the best interest of the community to standardize on how
we expect things to operate. You're free to do whatever you want
in your proprietary systems but please don't expect us to accept
a parallel, in now way superior method of accessing SFPs.

2021-03-06 02:46:46

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, 5 Mar 2021 2:55 PM -0800 Jakub Kicinski wrote:
> On Fri, 5 Mar 2021 11:07:08 -0800 Don Bollinger wrote:
> > Acknowledging your objections, I nonetheless request that optoe be
> > accepted into upstream as an eeprom driver in drivers/misc/eeprom. It
> > is a legitimate driver, with a legitimate user community, which
> > deserves the benefits of being managed as a legitimate part of the linux
> kernel.
>
> It's in the best interest of the community to standardize on how we expect
> things to operate. You're free to do whatever you want in your proprietary
> systems but please don't expect us to accept a parallel, in now way
superior
> method of accessing SFPs.

These are not proprietary systems. The Network Operating Systems that use
optoe are open source projects, Linux based, available on github, and
contributing to the Linux source (see for example
https://github.com/Azure/SONiC). The switches that these NOSs run on are
open spec systems
(https://www.opencompute.org/wiki/Networking/SpecsAndDesigns). The fact
that they use the SDK from the switch silicon vendor should not mean they
are banished from the Linux community.

I am proposing acceptance of a commonly used implementation for accessing
SFPs because the one used by the netdev/netlink community does not fit the
architecture of the white box NOS/switch community. I am not trying to pick
sides, I am trying to make optoe available to both communities, to improve
EEPROM access for both of them.

Don

2021-03-06 03:36:31

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> I am proposing acceptance of a commonly used implementation for accessing
> SFPs because the one used by the netdev/netlink community does not fit the
> architecture of the white box NOS/switch community.

Please could you expand on this. I've given suggests as to how the new
netlink KAPI could be used for this use case, without being attached
to a netdev. And you have said nothing about why it cannot be made to
work. You cannot argue the architecture does not fit without actually
saying why it does not fit.

Andrew

2021-03-12 19:06:37

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, 5 Mar 2021 19:32 -0800 Andrew Lunn wrote:
> > I am proposing acceptance of a commonly used implementation for
> > accessing SFPs because the one used by the netdev/netlink community
> > does not fit the architecture of the white box NOS/switch community.
>
> Please could you expand on this. I've given suggests as to how the new
> netlink KAPI could be used for this use case, without being attached to a
> netdev. And you have said nothing about why it cannot be made to work.
> You cannot argue the architecture does not fit without actually saying why
it
> does not fit.
>
> Andrew

Sorry it took some time to clarify this for myself. I'm using SONiC (the
NOS
Microsoft uses to run the switches in its Azure cloud) as my example. They
are users of optoe, and they actually initiated the request to push optoe
upstream. Other white box NOS vendors are similar.

SONiC manages all aspects of SFP/QSFP/CMIS interaction through user
space. They have specified an API that is implemented by switch platform
vendors, that provides things like presence detection, LowPower mode
up/down/status, raw access to EEPROM content, interpretation of EEPROM
content (including TxPower/RxPower/bias, high/low alarm/warning thresholds,
static content like serial number and part number, and dozens of other
items).
This interface is implemented in python scripts provided by the switch
platform
vendor. Those scripts encode the mapping of CPLD i2c muxes to i2c buses to
port numbers, specific to each switch.

At the bottom of that python stack, all EEPROM access goes through
open/seek/read/close access to the optoe managed file in
/sys/bus/i2c/devices/{num}-0050/eeprom.

You're not going to like this, but ethtool -e and ethtool -m both return
' Ethernet0 Cannot get EEPROM data: Operation not supported', for
every port managed by the big switch silicon.

So, my users are using Linux for all the usual Linux things (memory
management, process management, I/O, IPC, containers), but they don't
use Linux networking to manage the ports that are managed by the big
switch silicon. (Linux networking is still in use for the management port
that talks to the management processor running Linux.)

optoe provides the device EEPROM foundation for this architecture, but
requires the sysfs interface (via nvmem) to provide it. optoe can also
easily
provide the default EEPROM access for the netdev/netlink interface through
the old and new KAPI.

Don

2021-03-12 20:01:28

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> This interface is implemented in python scripts provided by the switch
> platform
> vendor. Those scripts encode the mapping of CPLD i2c muxes to i2c buses to
> port numbers, specific to each switch.
>
> At the bottom of that python stack, all EEPROM access goes through
> open/seek/read/close access to the optoe managed file in
> /sys/bus/i2c/devices/{num}-0050/eeprom.

And this python stack is all open source? So you should be able to
throw away parts of the bottom end and replace it with a different
KAPI, and nobody will notice? In fact, this is probably how it was
designed. Anybody working with out of tree code knows what gets merged
later is going to be different because of review comments. And KAPI
code is even more likely to be different. So nobody really expected
optoe to get merged as is.

> You're not going to like this, but ethtool -e and ethtool -m both
> return ' Ethernet0 Cannot get EEPROM data: Operation not supported',
> for every port managed by the big switch silicon.

You are still missing what i said. The existing IOCTL interface needs
a network interface name. But there is no reason why you cannot extend
the new netlink KAPI to take an alternative identifier, sfp42. That
maps directly to the SFP device, without using an interface name. Your
pile of python can directly use the netlink API, the ethtool command
does not need to make use of this form of identifier, and you don't
need to "screen scrape" ethtool.

It seems very unlikely optoe is going to get merged. The network
maintainers are against it, due to KAPI issues. I'm trying to point
out a path you can take to get code merged. But it is up to you if you
decided to follow it.

Andrew

2021-03-13 21:44:02

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > This interface is implemented in python scripts provided by the switch
> > platform vendor. Those scripts encode the mapping of CPLD i2c muxes
> > to i2c buses to port numbers, specific to each switch.
> >
> > At the bottom of that python stack, all EEPROM access goes through
> > open/seek/read/close access to the optoe managed file in
> > /sys/bus/i2c/devices/{num}-0050/eeprom.
>
> And this python stack is all open source? So you should be able to throw

It is open in the sense that it is accessible to the world on github and the
maintainers would presumably entertain contributions.

In practice, these are developed and maintained by the white box switch
vendors. There is one implementation for each platform on each supported
NOS. There is lots of commonality among them, but each is nonetheless
unique. Optoe has been adopted across this ecosystem as a common piece that
has been factored out of many of the platform specific implementations.

> away parts of the bottom end and replace it with a different KAPI, and
> nobody will notice? In fact, this is probably how it was designed. Anybody

Actually everyone who touches this code would notice, each implementation
would have to be modified, with literally no benefit to this community.

> working with out of tree code knows what gets merged later is going to be
> different because of review comments. And KAPI code is even more likely to
> be different. So nobody really expected optoe to get merged as is.

The list of 'nobody' includes myself, my switch platform partners, my NOS
partners and Greg KH. I did expect to accommodate constructive review of
the code, which I have already done (this is v2).

>
> > You're not going to like this, but ethtool -e and ethtool -m both
> > return ' Ethernet0 Cannot get EEPROM data: Operation not supported',
> > for every port managed by the big switch silicon.
>
> You are still missing what i said. The existing IOCTL interface needs a
network
> interface name. But there is no reason why you cannot extend the new
> netlink KAPI to take an alternative identifier, sfp42. That maps directly
to the
> SFP device, without using an interface name. Your pile of python can
directly
> use the netlink API, the ethtool command does not need to make use of this
> form of identifier, and you don't need to "screen scrape" ethtool.

It is just software, your proposal is certainly technically feasible. It
provides no benefit to the community that is using optoe.

optoe is using a perfectly good KAPI, the nvmem interface that is being
developed and maintained by the folks who manage the EEPROM drivers in the
kernel. It has been updated since the prior submittal in 2018 to use the
nvmem interface and the regmap interface, both from the at24.c driver. This
community isn't using the rest of the netdev/netlink interfaces, and has
adopted (before I wrote optoe) a perfectly reasonable approach of writing a
simple driver to access these simple devices.

optoe does not undermine the netlink KAPI that Moshe is working on. If your
community is interested, it could adopt optoe, WITH your KAPI, to
consolidate and improve module EEPROM access for mainstream netdev
consumers. I am eager to collaborate on the fairly simple integration.

>
> It seems very unlikely optoe is going to get merged. The network
maintainers
> are against it, due to KAPI issues. I'm trying to point out a path you can
take
> to get code merged. But it is up to you if you decided to follow it.

Thank you, I decline. I respectfully request that optoe be accepted as a
useful implementation of the EEPROM driver, with the same access methods as
other EEPROM drivers, customized for the unique memory layout of SFP, QSFP
and CMIS devices. I remain open to improvements in the implementation, but
my community finds no value in an implementation that removes the standard
EEPROM access via sysfs and open/seek/read/close calls.

>
> Andrew

Don

2021-03-15 17:42:36

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Sat, 13 Mar 2021 13:35:56 -0800 Don Bollinger wrote:
> > away parts of the bottom end and replace it with a different KAPI, and
> > nobody will notice? In fact, this is probably how it was designed. Anybody
>
> Actually everyone who touches this code would notice, each implementation
> would have to be modified, with literally no benefit to this community.

You keep saying that kernel API is "of no benefit to this community"
yet you don't want to accept the argument that your code is of no
benefit to the upstream community.

> optoe does not undermine the netlink KAPI that Moshe is working on.

It does, although it may be hard to grasp for a vendor who can just EoL
a product at will once nobody is paying for it. We aim to provide
uniform support for all networking devices and an infinite backward
compatibility guarantee.

People will try to use optoe-based tools on the upstream drivers and
they won't work. Realistically we will need to support both APIs.

> If your community is interested, it could adopt optoe, WITH your
> KAPI, to consolidate and improve module EEPROM access for mainstream
> netdev consumers. I am eager to collaborate on the fairly simple
> integration.

Nacked-by: Jakub Kicinski <[email protected]>

Please move on.

2021-03-15 20:29:21

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Mon, 15 Mar 2021 10:40 -0800 Jakub Kicinski wrote:
> On Sat, 13 Mar 2021 13:35:56 -0800 Don Bollinger wrote:
> > > away parts of the bottom end and replace it with a different KAPI,
> > > and nobody will notice? In fact, this is probably how it was
> > > designed. Anybody
> >
> > Actually everyone who touches this code would notice, each
> > implementation would have to be modified, with literally no benefit to
this
> community.
>
> You keep saying that kernel API is "of no benefit to this community"
> yet you don't want to accept the argument that your code is of no benefit
to
> the upstream community.

I have offered, in every response, to collaborate with the simple
integration to use optoe as the default upstream driver to access the module
EEPROMs. optoe would be superior to the existing default routines in sfp.c
and would allow multiple existing vendor specific upstream drivers to adopt
the default. That would reduce the code base they maintain and provide
better access to module eeproms. I already adopted the existing EEPROM api
to make that integration easy (nvmem). I'm reluctant to submit the changes
to sfp.c because I have no expertise in that stack and no platform to test
it on.

>
> > optoe does not undermine the netlink KAPI that Moshe is working on.
>
> It does, although it may be hard to grasp for a vendor who can just EoL a
> product at will once nobody is paying for it. We aim to provide uniform
> support for all networking devices and an infinite backward compatibility
> guarantee.

I aim to provide uniform support for module EEPROM devices, with no less
reason to expect an infinite backward compatibility guarantee. (Infinite is
a bit much, that technology will inevitably become uninteresting to my
community as well as yours.)

>
> People will try to use optoe-based tools on the upstream drivers and they
> won't work. Realistically we will need to support both APIs.

I assume they "won't work" because of new requirements or newly discovered
defects. At that point optoe would be fixed by people who care, just like
any other upstream code. If your stack adopts optoe, through Moshe's KAPI,
then both communities benefit from ongoing support to maintain and enhance
EEPROM access. If your stack does not adopt optoe, then my community will
manage the support, since they need and use the code.

As for "both APIs", the first API is Moshe's, which we both support
(politically and technically). The second is nvmem, which is supported by
the EEPROM driver folks, led by the support for the at24 driver. I'm
calling the routines they created for this purpose, I'm not creating a new
API.

Bottom line: "Realistically we will need to support both APIs" even if
optoe does not get accepted.

>
> > If your community is interested, it could adopt optoe, WITH your KAPI,
> > to consolidate and improve module EEPROM access for mainstream netdev
> > consumers. I am eager to collaborate on the fairly simple
> > integration.
>
> Nacked-by: Jakub Kicinski <[email protected]>
>
> Please move on.

My community still has useful code that they would like in the upstream
kernel.

Don

2021-03-17 18:17:38

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> I have offered, in every response, to collaborate with the simple
> integration to use optoe as the default upstream driver to access the module
> EEPROMs. optoe would be superior to the existing default routines in sfp.c

Actually, i'm not sure they would be. Since the KAPI issues are pretty
much a NACK on their own, i didn't bother raising other issues. Both
Russell King and I has issues with quirks and hotplug.

Our experience is that a number of SFPs are broken, they don't follow
the standard. Some you cannot perform more than 16 bytes reads without
them locking up. Others will perform a 16 byte read, but only give you
one useful byte of data. So you have to read enough of the EEPROM a
byte at a time to get the vendor and product strings in order to
determine what quirks need to be applied. optoe has nothing like
this. Either you don't care and only support well behaved SFPs, or you
have the quirk handling in user space, in the various vendor code
blobs, repeated again and again. To make optoe generically usable, you
are going to have to push the quirk handling into optoe. The
brokenness should be hidden from userspace.

And then you repeat all the quirk handling sfp.c has. That does not
scale, we don't want the same quirks in two different places. However,
because SFPs are hot pluggable, you need to re-evaluate the quirks
whenever there is a hot-plug event. optoe has no idea if there has
been a hotplug event, since it does not have access to the GPIOs. Your
user space vendor code might know, it has access to the GPIOs. So
maybe you could add an IOCTL call or something, to let optoe know the
module has changed and it needs to update its quirks. Or for every
user space read, you actually re-read the vendor IDs and refresh the
quirks before performing the read the user actually wants. That all
seems ugly and is missing from the current patch.

I fully agree with Jakub NACK.

Andrew


2021-03-20 16:13:12

by Pali Rohár

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

Hello Don!

I have read whole discussion and your EEPROM patch proposal. But for me
it looks like some kernel glue code for some old legacy / proprietary
access method which does not have any usage outside of that old code.

Your code does not contain any quirks which are needed to read different
EEPROMs in different SFP modules. As Andrew wrote there are lot of
broken SFPs which needs special handling and this logic is already
implemented in sfp.c and sfp-bus.c kernel drivers. These drivers then
export EEPROM content to userspace via ethtool -m API in unified way and
userspace does not implement any quirks (nor does not have to deal with
quirks).

If you try to read EEPROM "incorrectly" then SFP module with its EEPROM
chip (or emulation of chip) locks and is fully unusable after you unplug
it and plug it again. Kernel really should not export API to userspace
which can cause "damage" to SFP modules. And currently it does *not* do
it.

I have contributed code for some GPON SFP modules, so their EEPROM can
be correctly read and exported to userspace via ethtool -m. So I know
that this is very fragile area and needs to be properly handled.

So I do not see any reason why can be a new optoe method in _current_
form useful. It does not implemented required things for handling
different EEPROM modules.

I would rather suggest you to use ethtool -m IOCTL API and in case it is
not suitable for QSFP (e.g. because of paging system) then extend this
API.

There were already proposals for using netlink socket interface which is
today de-facto standard interface for new code. sysfs API for such thing
really looks like some legacy code and nowadays we have better access
methods.

If you want, I can help you with these steps and make patches to be in
acceptable state; not written in "legacy" style. As I'm working with
GPON SFP modules with different EEPROMs in them, I'm really interested
in any improvements in their EEPROM area.

2021-03-23 18:48:16

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Tue, Mar 23, 2021 at 7:12AM-0700, Greg KH wrote:
> On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > optoe is an i2c based driver that supports read/write access to all
> > the pages (tables) of MSA standard SFP and similar devices (conforming
> > to the SFF-8472 spec), MSA standard QSFP and similar devices
> > (conforming to the SFF-8636 spec) and CMIS and similar devices
> > (conforming to the Common Management Interface Specfication).
>

I promise not to engage in a drawn out email exchange over this, but I would
like to appeal your decision, just once...

> Given this thread, I think that using the SFP interface/api in the kernel
> already seems like the best idea forward.
>
> That being said, your api here is whack, and I couldn't accept it anyway.

I don't understand. I don't mean you are wrong, I literally don't
understand what is whack about it. The interface is provided by nvmem. I
modeled the calls on at24. The layout of the data provided by the driver is
exactly the same layout that ethtool provides (device, offset, length).
Mapping i2c address, page and offset is exactly what ethtool provides. So,
which part of this is whack?

>
> Not for the least being it's not even documented in Documentation/ABI/
like
> all sysfs files have to be :)

This could obviously be fixed. I wasn't aware of this directory. Now that
you've pointed it out, I see that nvmem is actually documented there, which
is the API I am using. I document that optoe uses the nvmem interface, and
the mapping of paged memory to linear memory in my patch in
Documentation/misc-devices/optoe.rst. If you think it would be useful, I
could provide similar information in Documentation/ABI/stable.

>
> And it feels like you are abusing sysfs for things it was not ment for,
you
> might want to look into configfs?

I'm using nvmem, which in turn uses sysfs, just like at24. Why should optoe
be different? I would think it is actually better to use the same API (and
code) as at24, and NOT to put it in a different place.

>
> But really, these are networking devices, so they should be controllable
using
> the standard networking apis, not one-off sysfs files. Moving to the
Linux-
> standard tools is a good thing, and will work out better in the end
instead of
> having to encode lots of device-specific state in userspace like this
"raw" api
> seems to require.

This is the real issue. It turns out, on these switches, there are two
kinds of networking. Linux kernel networking handles one port, of 1Gb (or
less), which functions as a management port. This is typically used for
console access. It is configured and managed as an ordinary network port,
with a kernel network driver and the usual networking utilities. 'ip addr'
will show this port as well as loopback ports. The linux kernel has no
visibility to the switch networking ports. 'ip addr' will not show any of
the switch networking ports.

The switch functions, switching at 25Tb/s, are completely invisible to the
linux kernel. The switch ASIC is managed by a device driver provided by the
ASIC vendor. That driver is driven by management code from the ASIC vendor
and a host of network applications. Multiple vendors compete to provide the
best, most innovative, most secure, easiest... network capabilities on top
of this architecture. NONE of them use a kernel network driver, or the
layers of control or management that the linux kernel offers. On these
systems, if you ask ethtool to provide EEPROM data, you get 'function not
implemented'.

On these systems, SFP/QSFP/CMIS devices are actually not 'networking
devices' from a Linux kernel perspective. They are GPIO targets and EEPROM
memory. Switch networking just needs the kernel to toggle the GPIO lines
and read/write the EEPROM. optoe is just trying to read/write the EEPROM.

One last note... The networking folks need a better SFP/QSFP/CMIS EEPROM
driver to access more pages, and to support the new CMIS standard. optoe
could provide that, and I would love to collaborate on integrating optoe
into that stack. It wouldn't be hard, and it would be useful. I am not
opposed to netdev/netlink/phylink. I have been commenting on Moshe's KAPI
proposal, with several of my suggestions adopted there. I am not insisting
on my approach INSTEAD of theirs. I am insisting on my approach IN ADDITION
TO theirs. Without the nvmem interface, optoe is useless to my community.

>
> thanks,
>
> greg k-h

Thanks

Don

2021-03-23 19:13:27

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Tue, Mar 23, 2021 at 12:00AM -0700, Greg KH wrote:
> On Tue, Mar 23, 2021 at 11:43:55AM -0700, Don Bollinger wrote:
> > On Tue, Mar 23, 2021 at 7:12AM-0700, Greg KH wrote:
> > > On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > > > optoe is an i2c based driver that supports read/write access to
> > > > all the pages (tables) of MSA standard SFP and similar devices
> > > > (conforming to the SFF-8472 spec), MSA standard QSFP and similar
> > > > devices (conforming to the SFF-8636 spec) and CMIS and similar
> > > > devices (conforming to the Common Management Interface
> Specfication).
> > >
> >
> > I promise not to engage in a drawn out email exchange over this, but I
> > would like to appeal your decision, just once...

Thanks for your response. As promised, I'm done.

Is there a correct protocol for withdrawing a patch, or does it just get
abandoned? Still trying to be a good citizen.

Don

> >
> > > Given this thread, I think that using the SFP interface/api in the
> > > kernel already seems like the best idea forward.
> > >
> > > That being said, your api here is whack, and I couldn't accept it
anyway.
> >
> > I don't understand. I don't mean you are wrong, I literally don't
> > understand what is whack about it. The interface is provided by
> > nvmem. I modeled the calls on at24. The layout of the data provided
> > by the driver is exactly the same layout that ethtool provides (device,
> offset, length).
> > Mapping i2c address, page and offset is exactly what ethtool provides.
> > So, which part of this is whack?
>
> It's sysfs. Does nvmem use sysfs for device discovery and enablement?
>
> nvmem is just a "raw" maping of hardware (memory) to userspace.
>
> You have a "real" device here that you are trying to also map to
userspace,
> but when you just expose the "raw" registers (i.e. memory) to userspace,
> you are forcing userspace to handle all of the device differences, instead
of
> the kernel.
>
> That's fine, for some things, but for anything with a standard, that's not
ok,
> that's what a kernel is for.
>
> In other words, you could do what you want today probably with a UIO
> driver, just get the kernel out of the way and do it all in userspace.
> But that's not a viable or suportable api in the long-run for any standard
> hardware type.
>
> > > Not for the least being it's not even documented in
> > > Documentation/ABI/
> > like
> > > all sysfs files have to be :)
> >
> > This could obviously be fixed. I wasn't aware of this directory. Now
> > that you've pointed it out, I see that nvmem is actually documented
> > there, which is the API I am using. I document that optoe uses the
> > nvmem interface, and the mapping of paged memory to linear memory in
> > my patch in Documentation/misc-devices/optoe.rst. If you think it
> > would be useful, I could provide similar information in
> Documentation/ABI/stable.
>
> Again, nvmem in sysfs is just a dump of the hardware memory. That should
> not be how to control a switch device.
>
> > > And it feels like you are abusing sysfs for things it was not ment
> > > for,
> > you
> > > might want to look into configfs?
> >
> > I'm using nvmem, which in turn uses sysfs, just like at24. Why should
> > optoe be different? I would think it is actually better to use the
> > same API (and
> > code) as at24, and NOT to put it in a different place.
>
> at24 too is just an eeprom behind an i2c bus. Accessing it for simple
things is
> fine for userspace, but not for a standard device type.
>
> The networking developers have said that they feel the kernel should
> properly control devices like this, with a standard api. And I agree with
them
> (note, I'm biased, I like standard APIs, heck, I've even written specs for
> them...) Doing "raw" hardware accesses is great fun for things like
one-off
> devices (I have Linux running in a keyboard for something like that, also
as
> my doorbell), but doing this for a "real"
> set of devices is not ok.
>
> Again, it's the difference between the UIO interface and a real ethernet
> driver in the kernel. You could just say "all PCI network devices should
use
> the UIO interface and put the hardware-specific logic in userspace", but
> that's not what we (i.e. the Linux kernel developers) feel is the proper
way
> to handle the abstraction of device types.
>
> Again, we are kernel developers, we like nice hardware abstractions.
> Bonus is that it lets new hardware companies create new devices and no
> userspace modifications are needed! I think history is on our side here
:)
>
> > > But really, these are networking devices, so they should be
> > > controllable
> > using
> > > the standard networking apis, not one-off sysfs files. Moving to
> > > the
> > Linux-
> > > standard tools is a good thing, and will work out better in the end
> > instead of
> > > having to encode lots of device-specific state in userspace like
> > > this
> > "raw" api
> > > seems to require.
> >
> > This is the real issue. It turns out, on these switches, there are
> > two kinds of networking. Linux kernel networking handles one port, of
> > 1Gb (or less), which functions as a management port. This is
> > typically used for console access. It is configured and managed as an
> > ordinary network port, with a kernel network driver and the usual
> networking utilities. 'ip addr'
> > will show this port as well as loopback ports. The linux kernel has
> > no visibility to the switch networking ports. 'ip addr' will not show
> > any of the switch networking ports.
> >
> > The switch functions, switching at 25Tb/s, are completely invisible to
> > the linux kernel. The switch ASIC is managed by a device driver
> > provided by the ASIC vendor. That driver is driven by management code
> > from the ASIC vendor and a host of network applications. Multiple
> > vendors compete to provide the best, most innovative, most secure,
> > easiest... network capabilities on top of this architecture. NONE of
> > them use a kernel network driver, or the layers of control or
> > management that the linux kernel offers. On these systems, if you ask
> > ethtool to provide EEPROM data, you get 'function not implemented'.
> >
> > On these systems, SFP/QSFP/CMIS devices are actually not 'networking
> > devices' from a Linux kernel perspective. They are GPIO targets and
> > EEPROM memory. Switch networking just needs the kernel to toggle the
> > GPIO lines and read/write the EEPROM. optoe is just trying to
read/write
> the EEPROM.
>
> That sounds like hell. Let's create a proper api for everyone to use, and
NOT
> provide raw access to random device eeproms (i.e. memory). I thought that
> is what switchdev was for. If it is somehow lacking, I'm sure that
patches are
> gladly accepted.
>
> Heck, I did a review of the switchdev api and code a long time ago in
> response to some companies complaining of just this thing. Sad to see
they
> never took my advice of "send patches to get your hardware supported in
> that api", and persisted in wanting "raw memory" access instead.
>
> > One last note... The networking folks need a better SFP/QSFP/CMIS
> > EEPROM driver to access more pages, and to support the new CMIS
> standard.
>
> Great, work on that!
>
> But raw eeprom/nvram/ram access is not that api.
>
> Again, UIO vs. "struct net_device". Think of it that way.
>
> thanks,
>
> greg k-h

2021-03-23 19:16:20

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Tue, Mar 23, 2021 at 12:08:32PM -0700, Don Bollinger wrote:
> On Tue, Mar 23, 2021 at 12:00AM -0700, Greg KH wrote:
> > On Tue, Mar 23, 2021 at 11:43:55AM -0700, Don Bollinger wrote:
> > > On Tue, Mar 23, 2021 at 7:12AM-0700, Greg KH wrote:
> > > > On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > > > > optoe is an i2c based driver that supports read/write access to
> > > > > all the pages (tables) of MSA standard SFP and similar devices
> > > > > (conforming to the SFF-8472 spec), MSA standard QSFP and similar
> > > > > devices (conforming to the SFF-8636 spec) and CMIS and similar
> > > > > devices (conforming to the Common Management Interface
> > Specfication).
> > > >
> > >
> > > I promise not to engage in a drawn out email exchange over this, but I
> > > would like to appeal your decision, just once...
>
> Thanks for your response. As promised, I'm done.
>
> Is there a correct protocol for withdrawing a patch, or does it just get
> abandoned? Still trying to be a good citizen.

Patches just get abandonded, nothing special to do here, we have mailing
lists littered with them :)

thanks,

greg k-h

2021-03-23 22:34:54

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > Our experience is that a number of SFPs are broken, they don't follow the
> > standard. Some you cannot perform more than 16 bytes reads without them
> > locking up. Others will perform a 16 byte read, but only give you one
> useful
> > byte of data. So you have to read enough of the EEPROM a byte at a time to
> > get the vendor and product strings in order to determine what quirks need
> > to be applied. optoe has nothing like this. Either you don't care and only
> > support well behaved SFPs, or you have the quirk handling in user space,
> in
> > the various vendor code blobs, repeated again and again. To make optoe
> > generically usable, you are going to have to push the quirk handling into
> > optoe. The brokenness should be hidden from userspace.
>
> Interesting. I would throw away such devices. That's why switch vendors
> publish supported parts lists.
>
> Can you point me to the code that is handling those quirks? Since I haven't
> seen those problems, I don't know what they are and how to address them.

Take a look in drivers/net/phy/sfp.c

commit f0b4f847673299577c29b71d3f3acd3c313d81b7
Author: Pali Roh?r <[email protected]>
Date: Mon Jan 25 16:02:28 2021 +0100

net: sfp: add mode quirk for GPON module Ubiquiti U-Fiber Instant

The Ubiquiti U-Fiber Instant SFP GPON module has nonsensical information
stored in its EEPROM. It claims to support all transceiver types including
10G Ethernet. Clear all claimed modes and set only 1000baseX_Full, which is
the only one supported.

This module has also phys_id set to SFF, and the SFP subsystem currently
does not allow to use SFP modules detected as SFFs. Add exception for this
module so it can be detected as supported.

This change finally allows to detect and use SFP GPON module Ubiquiti
U-Fiber Instant on Linux system.

EEPROM content of this SFP module is (where XX is serial number):

00: 02 04 0b ff ff ff ff ff ff ff ff 03 0c 00 14 c8 ???........??.??
10: 00 00 00 00 55 42 4e 54 20 20 20 20 20 20 20 20 ....UBNT
20: 20 20 20 20 00 18 e8 29 55 46 2d 49 4e 53 54 41 .??)UF-INSTA
30: 4e 54 20 20 20 20 20 20 34 20 20 20 05 1e 00 36 NT 4 ??.6
40: 00 06 00 00 55 42 4e 54 XX XX XX XX XX XX XX XX .?..UBNTXXXXXXXX
50: 20 20 20 20 31 34 30 31 32 33 20 20 60 80 02 41 140123 `??A


commit 426c6cbc409cbda9ab1a9dbf15d3c2ef947eb8c1
Author: Pali Roh?r <[email protected]>
Date: Mon Jan 25 16:02:27 2021 +0100

net: sfp: add workaround for Realtek RTL8672 and RTL9601C chips

The workaround for VSOL V2801F brand based GPON SFP modules added in commit
0d035bed2a4a ("net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0
workaround") works only for IDs added explicitly to the list. Since there
are rebranded modules where OEM vendors put different strings into the
vendor name field, we cannot base workaround on IDs only.

Moreover the issue which the above mentioned commit tried to work around is
generic not only to VSOL based modules, but rather to all GPON modules
based on Realtek RTL8672 and RTL9601C chips.

These include at least the following GPON modules:
* V-SOL V2801F
* C-Data FD511GX-RM0
* OPTON GP801R
* BAUDCOM BD-1234-SFM
* CPGOS03-0490 v2.0
* Ubiquiti U-Fiber Instant
* EXOT EGS1

These Realtek chips have broken EEPROM emulator which for N-byte read
operation returns just the first byte of EEPROM data, followed by N-1
zeros.

Introduce a new function, sfp_id_needs_byte_io(), which detects SFP modules
with broken EEPROM emulator based on N-1 zeros and switch to 1 byte EEPROM
reading operation.

Function sfp_i2c_read() now always uses single byte reading when it is
required and when function sfp_hwmon_probe() detects single byte access,
it disables registration of hwmon device, because in this case we cannot
reliably and atomically read 2 bytes as is required by the standard for
retrieving values from diagnostic area.

(These Realtek chips are broken in a way that violates SFP standards for
diagnostic interface. Kernel in this case simply cannot do anything less
of skipping registration of the hwmon interface.)

This patch fixes reading of EEPROM content from SFP modules based on
Realtek RTL8672 and RTL9601C chips. Diagnostic interface of EEPROM stays
broken and cannot be fixed.

commit 624407d2cf14ff58e53bf4b2af9595c4f21d606e
Author: Russell King <[email protected]>
Date: Sun Jan 10 10:58:32 2021 +0000

net: sfp: cope with SFPs that set both LOS normal and LOS inverted

The SFP MSA defines two option bits in byte 65 to indicate how the
Rx_LOS signal on SFP pin 8 behaves:

bit 2 - Loss of Signal implemented, signal inverted from standard
definition in SFP MSA (often called "Signal Detect").
bit 1 - Loss of Signal implemented, signal as defined in SFP MSA
(often called "Rx_LOS").

Clearly, setting both bits results in a meaningless situation: it would
mean that LOS is implemented in both the normal sense (1 = signal loss)
and inverted sense (0 = signal loss).

Unfortunately, there are modules out there which set both bits, which
will be initially interpret as "inverted" sense, and then, if the LOS
signal changes state, we will toggle between LINK_UP and WAIT_LOS
states.

Change our LOS handling to give well defined behaviour: only interpret
these bits as meaningful if exactly one is set, otherwise treat it as
if LOS is not implemented.

ommit 0d035bed2a4a6c4878518749348be61bf082d12a
Author: Russell King <[email protected]>
Date: Wed Dec 9 11:22:49 2020 +0000

net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0 workaround

Add a workaround for the detection of VSOL V2801F / CarlitoxxPro
CPGOS03-0490 v2.0 GPON module which CarlitoxxPro states needs single
byte I2C reads to the EEPROM.

Pali Roh?r reports that he also has a CarlitoxxPro-based V2801F module,
which reports a manufacturer of "OEM". This manufacturer can't be
matched as it appears in many different modules, so also match the part
number too.

etc.

Now, it could be you only work with TOR switches and their SFP
modules. And they follow the standard in a better way than others. But
the kernel is used in many more environments that just data centers.
We need to support SFPs in FTTH boxes, in aircraft inflight
entertainment systems, industrial control etc.


Andrew

2021-03-24 02:26:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> optoe is an i2c based driver that supports read/write access to all
> the pages (tables) of MSA standard SFP and similar devices (conforming
> to the SFF-8472 spec), MSA standard QSFP and similar devices (conforming
> to the SFF-8636 spec) and CMIS and similar devices (conforming to the
> Common Management Interface Specfication).

Given this thread, I think that using the SFP interface/api in the
kernel already seems like the best idea forward.

That being said, your api here is whack, and I couldn't accept it
anyway.

Not for the least being it's not even documented in Documentation/ABI/
like all sysfs files have to be :)

And it feels like you are abusing sysfs for things it was not ment for,
you might want to look into configfs?

But really, these are networking devices, so they should be controllable
using the standard networking apis, not one-off sysfs files. Moving to
the Linux-standard tools is a good thing, and will work out better in
the end instead of having to encode lots of device-specific state in
userspace like this "raw" api seems to require.

thanks,

greg k-h

2021-03-24 07:17:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Tue, Mar 23, 2021 at 11:43:55AM -0700, Don Bollinger wrote:
> On Tue, Mar 23, 2021 at 7:12AM-0700, Greg KH wrote:
> > On Mon, Feb 15, 2021 at 11:38:21AM -0800, Don Bollinger wrote:
> > > optoe is an i2c based driver that supports read/write access to all
> > > the pages (tables) of MSA standard SFP and similar devices (conforming
> > > to the SFF-8472 spec), MSA standard QSFP and similar devices
> > > (conforming to the SFF-8636 spec) and CMIS and similar devices
> > > (conforming to the Common Management Interface Specfication).
> >
>
> I promise not to engage in a drawn out email exchange over this, but I would
> like to appeal your decision, just once...
>
> > Given this thread, I think that using the SFP interface/api in the kernel
> > already seems like the best idea forward.
> >
> > That being said, your api here is whack, and I couldn't accept it anyway.
>
> I don't understand. I don't mean you are wrong, I literally don't
> understand what is whack about it. The interface is provided by nvmem. I
> modeled the calls on at24. The layout of the data provided by the driver is
> exactly the same layout that ethtool provides (device, offset, length).
> Mapping i2c address, page and offset is exactly what ethtool provides. So,
> which part of this is whack?

It's sysfs. Does nvmem use sysfs for device discovery and enablement?

nvmem is just a "raw" maping of hardware (memory) to userspace.

You have a "real" device here that you are trying to also map to
userspace, but when you just expose the "raw" registers (i.e. memory) to
userspace, you are forcing userspace to handle all of the device
differences, instead of the kernel.

That's fine, for some things, but for anything with a standard, that's
not ok, that's what a kernel is for.

In other words, you could do what you want today probably with a UIO
driver, just get the kernel out of the way and do it all in userspace.
But that's not a viable or suportable api in the long-run for any
standard hardware type.

> > Not for the least being it's not even documented in Documentation/ABI/
> like
> > all sysfs files have to be :)
>
> This could obviously be fixed. I wasn't aware of this directory. Now that
> you've pointed it out, I see that nvmem is actually documented there, which
> is the API I am using. I document that optoe uses the nvmem interface, and
> the mapping of paged memory to linear memory in my patch in
> Documentation/misc-devices/optoe.rst. If you think it would be useful, I
> could provide similar information in Documentation/ABI/stable.

Again, nvmem in sysfs is just a dump of the hardware memory. That
should not be how to control a switch device.

> > And it feels like you are abusing sysfs for things it was not ment for,
> you
> > might want to look into configfs?
>
> I'm using nvmem, which in turn uses sysfs, just like at24. Why should optoe
> be different? I would think it is actually better to use the same API (and
> code) as at24, and NOT to put it in a different place.

at24 too is just an eeprom behind an i2c bus. Accessing it for simple
things is fine for userspace, but not for a standard device type.

The networking developers have said that they feel the kernel should
properly control devices like this, with a standard api. And I agree
with them (note, I'm biased, I like standard APIs, heck, I've even
written specs for them...) Doing "raw" hardware accesses is great fun
for things like one-off devices (I have Linux running in a keyboard for
something like that, also as my doorbell), but doing this for a "real"
set of devices is not ok.

Again, it's the difference between the UIO interface and a real ethernet
driver in the kernel. You could just say "all PCI network devices
should use the UIO interface and put the hardware-specific logic in
userspace", but that's not what we (i.e. the Linux kernel developers)
feel is the proper way to handle the abstraction of device types.

Again, we are kernel developers, we like nice hardware abstractions.
Bonus is that it lets new hardware companies create new devices and no
userspace modifications are needed! I think history is on our side
here :)

> > But really, these are networking devices, so they should be controllable
> using
> > the standard networking apis, not one-off sysfs files. Moving to the
> Linux-
> > standard tools is a good thing, and will work out better in the end
> instead of
> > having to encode lots of device-specific state in userspace like this
> "raw" api
> > seems to require.
>
> This is the real issue. It turns out, on these switches, there are two
> kinds of networking. Linux kernel networking handles one port, of 1Gb (or
> less), which functions as a management port. This is typically used for
> console access. It is configured and managed as an ordinary network port,
> with a kernel network driver and the usual networking utilities. 'ip addr'
> will show this port as well as loopback ports. The linux kernel has no
> visibility to the switch networking ports. 'ip addr' will not show any of
> the switch networking ports.
>
> The switch functions, switching at 25Tb/s, are completely invisible to the
> linux kernel. The switch ASIC is managed by a device driver provided by the
> ASIC vendor. That driver is driven by management code from the ASIC vendor
> and a host of network applications. Multiple vendors compete to provide the
> best, most innovative, most secure, easiest... network capabilities on top
> of this architecture. NONE of them use a kernel network driver, or the
> layers of control or management that the linux kernel offers. On these
> systems, if you ask ethtool to provide EEPROM data, you get 'function not
> implemented'.
>
> On these systems, SFP/QSFP/CMIS devices are actually not 'networking
> devices' from a Linux kernel perspective. They are GPIO targets and EEPROM
> memory. Switch networking just needs the kernel to toggle the GPIO lines
> and read/write the EEPROM. optoe is just trying to read/write the EEPROM.

That sounds like hell. Let's create a proper api for everyone to use,
and NOT provide raw access to random device eeproms (i.e. memory). I
thought that is what switchdev was for. If it is somehow lacking, I'm
sure that patches are gladly accepted.

Heck, I did a review of the switchdev api and code a long time ago in
response to some companies complaining of just this thing. Sad to see
they never took my advice of "send patches to get your hardware
supported in that api", and persisted in wanting "raw memory" access
instead.

> One last note... The networking folks need a better SFP/QSFP/CMIS EEPROM
> driver to access more pages, and to support the new CMIS standard.

Great, work on that!

But raw eeprom/nvram/ram access is not that api.

Again, UIO vs. "struct net_device". Think of it that way.

thanks,

greg k-h

2021-03-24 07:26:51

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > I have offered, in every response, to collaborate with the simple
> > integration to use optoe as the default upstream driver to access the
> > module EEPROMs. optoe would be superior to the existing default
> > routines in sfp.c
>
> Actually, i'm not sure they would be. Since the KAPI issues are pretty
much a
> NACK on their own, i didn't bother raising other issues. Both Russell King
and
> I has issues with quirks and hotplug.
>
> Our experience is that a number of SFPs are broken, they don't follow the
> standard. Some you cannot perform more than 16 bytes reads without them
> locking up. Others will perform a 16 byte read, but only give you one
useful
> byte of data. So you have to read enough of the EEPROM a byte at a time to
> get the vendor and product strings in order to determine what quirks need
> to be applied. optoe has nothing like this. Either you don't care and only
> support well behaved SFPs, or you have the quirk handling in user space,
in
> the various vendor code blobs, repeated again and again. To make optoe
> generically usable, you are going to have to push the quirk handling into
> optoe. The brokenness should be hidden from userspace.

Interesting. I would throw away such devices. That's why switch vendors
publish supported parts lists.

Can you point me to the code that is handling those quirks? Since I haven't
seen those problems, I don't know what they are and how to address them.

Note there are a VAST number of data items in those EEPROMs, including
proprietary capabilities. Many of the items are configuration dependent,
and mean different things depending on the value of other data items. Most
of these items are not of any interest to kernel networking. I try to
minimize the size of the kernel footprint and move those decoding and
management functions to user space.

>
> And then you repeat all the quirk handling sfp.c has. That does not scale,
we
> don't want the same quirks in two different places. However, because SFPs
> are hot pluggable, you need to re-evaluate the quirks whenever there is a
> hot-plug event. optoe has no idea if there has been a hotplug event, since
it
> does not have access to the GPIOs. Your user space vendor code might
> know, it has access to the GPIOs. So maybe you could add an IOCTL call or
> something, to let optoe know the module has changed and it needs to
> update its quirks. Or for every user space read, you actually re-read the
> vendor IDs and refresh the quirks before performing the read the user
> actually wants. That all seems ugly and is missing from the current patch.

Actually I do need to know whether the device supports paging, that's the
only device state I need. Since I don't detect hotplug events, I read the
'paging supported' bit on every read that changes the page register.

There is a GPIO line to detect 'presence', which presumably could be
accessed via device tree configuration with the GPIO driver. I haven't
figured out how to connect those pieces so I just read the page register on
every access. Adding that would be a useful feature.

>
> I fully agree with Jakub NACK.
>
> Andrew

Don

2021-03-26 18:45:09

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > > Our experience is that a number of SFPs are broken, they don't
> > > follow the standard. Some you cannot perform more than 16 bytes
> > > reads without them locking up. Others will perform a 16 byte read,
> > > but only give you one
> > useful
> > > byte of data. So you have to read enough of the EEPROM a byte at a
> > > time to get the vendor and product strings in order to determine
> > > what quirks need to be applied. optoe has nothing like this. Either
> > > you don't care and only support well behaved SFPs, or you have the
> > > quirk handling in user space,
> > in
> > > the various vendor code blobs, repeated again and again. To make
> > > optoe generically usable, you are going to have to push the quirk
> > > handling into optoe. The brokenness should be hidden from userspace.
> >
> > Interesting. I would throw away such devices. That's why switch
> > vendors publish supported parts lists.
> >
> > Can you point me to the code that is handling those quirks? Since I
> > haven't seen those problems, I don't know what they are and how to
> address them.
>
> Take a look in drivers/net/phy/sfp.c

Thanks for the pointers, I appreciate the chance to understand exactly what
quirks are under discussion.

It turns out that those quirks are not relevant to my patch. I don't mean
you are wrong, or that they don't have to be handled for your use cases,
just that my patch does not need to deal with them.

If optoe were adopted into your framework, it would replace the routines
sfp_i2c_read() and sfp_i2c_write(). By the time these two routines are
called, all of the quirk handling has already been applied (there is no
quirk handling in these routines today). The data requests have been broken
down as necessary into short reads or targeted reads for each device. Those
calls are then sent to the module EEPROM with no further need for quirk
handling. So, optoe does not need to handle quirks to fit into your code.
The quirks are handled, without optoe being involved.

In my community, the SFP/QSFP/CMIS devices (32 to 128 of them per switch)
often cost more than the switch itself. Consumers (both vendors and
customers) extensively test these devices to ensure correct and reliable
operation. Then they buy them literally by the millions. Quirks lead to
quick rejection in favor of reliable parts from reliable vendors. In this
environment, for completely different reasons, optoe does not need to handle
quirks.

>
> commit f0b4f847673299577c29b71d3f3acd3c313d81b7
> Author: Pali Roh?r <[email protected]>
> Date: Mon Jan 25 16:02:28 2021 +0100
>
> net: sfp: add mode quirk for GPON module Ubiquiti U-Fiber Instant
>
> The Ubiquiti U-Fiber Instant SFP GPON module has nonsensical
information
> stored in its EEPROM. It claims to support all transceiver types
including
> 10G Ethernet. Clear all claimed modes and set only 1000baseX_Full,
which
> is
> the only one supported.
>
> This module has also phys_id set to SFF, and the SFP subsystem
currently
> does not allow to use SFP modules detected as SFFs. Add exception for
this
> module so it can be detected as supported.

In my community, all interpretation of the SFP/QSFP/CMIS content is done in
user space. So, these issues don't affect optoe. I know you find that
distasteful, but it is what my community wants and needs.

>
> This change finally allows to detect and use SFP GPON module Ubiquiti
> U-Fiber Instant on Linux system.
>
> EEPROM content of this SFP module is (where XX is serial number):
>
> 00: 02 04 0b ff ff ff ff ff ff ff ff 03 0c 00 14 c8
???........??.??
> 10: 00 00 00 00 55 42 4e 54 20 20 20 20 20 20 20 20 ....UBNT
> 20: 20 20 20 20 00 18 e8 29 55 46 2d 49 4e 53 54 41
.??)UF-INSTA
> 30: 4e 54 20 20 20 20 20 20 34 20 20 20 05 1e 00 36 NT 4
??.6
> 40: 00 06 00 00 55 42 4e 54 XX XX XX XX XX XX XX XX
.?..UBNTXXXXXXXX
> 50: 20 20 20 20 31 34 30 31 32 33 20 20 60 80 02 41 140123
`??A
>
>
> commit 426c6cbc409cbda9ab1a9dbf15d3c2ef947eb8c1
> Author: Pali Roh?r <[email protected]>
> Date: Mon Jan 25 16:02:27 2021 +0100
>
> net: sfp: add workaround for Realtek RTL8672 and RTL9601C chips
>
> The workaround for VSOL V2801F brand based GPON SFP modules added
> in commit
> 0d035bed2a4a ("net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0
> workaround") works only for IDs added explicitly to the list. Since
there
> are rebranded modules where OEM vendors put different strings into the
> vendor name field, we cannot base workaround on IDs only.

You might be fighting a losing battle here. There are companies that buy
cheap SFP devices and reprogram the EEPROM to identify them as higher
quality parts from reliable vendors. Checking the Vendor ID and part number
may tell you have a high quality part, but the code inside still has the
quirks you are trying to avoid.

I developed optoe while working for a (high quality) vendor of these
modules. Their support team runs into these counterfeit parts all the time.
(I don't work for them or anyone else any more.)

>
> Moreover the issue which the above mentioned commit tried to work
> around is
> generic not only to VSOL based modules, but rather to all GPON modules
> based on Realtek RTL8672 and RTL9601C chips.
>
> These include at least the following GPON modules:
> * V-SOL V2801F
> * C-Data FD511GX-RM0
> * OPTON GP801R
> * BAUDCOM BD-1234-SFM
> * CPGOS03-0490 v2.0
> * Ubiquiti U-Fiber Instant
> * EXOT EGS1
>
> These Realtek chips have broken EEPROM emulator which for N-byte read
> operation returns just the first byte of EEPROM data, followed by N-1
> zeros.
>
> Introduce a new function, sfp_id_needs_byte_io(), which detects SFP
> modules
> with broken EEPROM emulator based on N-1 zeros and switch to 1 byte
> EEPROM
> reading operation.
>
> Function sfp_i2c_read() now always uses single byte reading when it is
> required and when function sfp_hwmon_probe() detects single byte
> access,
> it disables registration of hwmon device, because in this case we
cannot
> reliably and atomically read 2 bytes as is required by the standard
for
> retrieving values from diagnostic area.
>
> (These Realtek chips are broken in a way that violates SFP standards
for
> diagnostic interface. Kernel in this case simply cannot do anything
less
> of skipping registration of the hwmon interface.)
>
> This patch fixes reading of EEPROM content from SFP modules based on
> Realtek RTL8672 and RTL9601C chips. Diagnostic interface of EEPROM
stays
> broken and cannot be fixed.
>
> commit 624407d2cf14ff58e53bf4b2af9595c4f21d606e
> Author: Russell King <[email protected]>
> Date: Sun Jan 10 10:58:32 2021 +0000
>
> net: sfp: cope with SFPs that set both LOS normal and LOS inverted
>
> The SFP MSA defines two option bits in byte 65 to indicate how the
> Rx_LOS signal on SFP pin 8 behaves:
>
> bit 2 - Loss of Signal implemented, signal inverted from standard
> definition in SFP MSA (often called "Signal Detect").
> bit 1 - Loss of Signal implemented, signal as defined in SFP MSA
> (often called "Rx_LOS").
>
> Clearly, setting both bits results in a meaningless situation: it
would
> mean that LOS is implemented in both the normal sense (1 = signal
loss)
> and inverted sense (0 = signal loss).
>
> Unfortunately, there are modules out there which set both bits, which
> will be initially interpret as "inverted" sense, and then, if the LOS
> signal changes state, we will toggle between LINK_UP and WAIT_LOS
> states.
>
> Change our LOS handling to give well defined behaviour: only interpret
> these bits as meaningful if exactly one is set, otherwise treat it as
> if LOS is not implemented.
>
> ommit 0d035bed2a4a6c4878518749348be61bf082d12a
> Author: Russell King <[email protected]>
> Date: Wed Dec 9 11:22:49 2020 +0000
>
> net: sfp: VSOL V2801F / CarlitoxxPro CPGOS03-0490 v2.0 workaround
>
> Add a workaround for the detection of VSOL V2801F / CarlitoxxPro
> CPGOS03-0490 v2.0 GPON module which CarlitoxxPro states needs single
> byte I2C reads to the EEPROM.
>
> Pali Roh?r reports that he also has a CarlitoxxPro-based V2801F
module,
> which reports a manufacturer of "OEM". This manufacturer can't be
> matched as it appears in many different modules, so also match the
part
> number too.
>
> etc.
>
> Now, it could be you only work with TOR switches and their SFP modules.
> And they follow the standard in a better way than others. But the kernel
is
> used in many more environments that just data centers.
> We need to support SFPs in FTTH boxes, in aircraft inflight entertainment
> systems, industrial control etc.

Totally agree. As a replacement for sfp_i2c_read() and sfp_i2c_write(),
optoe would achieve everything the existing code does, plus it would support
paged devices. Works the same, but covers more capabilities.

AND, for TOR switches, and their SFP AND QSFP AND CMIS modules, it works
too, even when they don't adopt linux kernel networking. Unfortunately you
have decided that their architecture, handling networking outside the
kernel, is not allowed. My community does not get to have the perfectly
good code they use to access these devices accepted into the mainline
kernel.

>
>
> Andrew

Don

2021-03-26 18:45:09

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> Hello Don!
>
> I have read whole discussion and your EEPROM patch proposal. But for me it
> looks like some kernel glue code for some old legacy / proprietary access
> method which does not have any usage outside of that old code.

I don't know if 'kernel glue code' is good or bad. It is a driver. It
locks access to a device so it can perform multiple accesses without
interference. It organizes the data on a weird device into a simple linear
address space that can be accessed with open(), seek(), read() and write()
calls.

As for 'old code', this code and variations of it are under active
development by multiple Network OS vendors and multiple switch vendors, and
in production on hundreds of thousands of switches with millions of
SFP/QSFP/CMIS devices. This stuff is running the biggest clouds in the
world.

>
> Your code does not contain any quirks which are needed to read different
> EEPROMs in different SFP modules. As Andrew wrote there are lot of broken
> SFPs which needs special handling and this logic is already implemented in
> sfp.c and sfp-bus.c kernel drivers. These drivers then export EEPROM
> content to userspace via ethtool -m API in unified way and userspace does
> not implement any quirks (nor does not have to deal with quirks).

As a technical matter, you handle those quirks in the code that interprets
EEPROM data. You have figured out what devices have what quirks, then coded
up solutions to make them work. Then, after all the quirk handling is done,
you call the actual access routines (sfp_i2c_read() and sfp_i2c_write()) to
access the module EEPROMs. My code works the same way, except in my
community all the interpretation of EEPROM data is done in user space. You
may not like that, but it is not clear why you should care how my community
chooses to handle the data. In this architecture, the only thing needed
from the kernel is the equivalent of sfp_i2c_read() and sfp_i2c_write, which
optoe provides. The key point here is that my community wants the kernel to
just access the data. No interpretation, no identification, no special
cases.

>
> If you try to read EEPROM "incorrectly" then SFP module with its EEPROM
> chip (or emulation of chip) locks and is fully unusable after you unplug
it and
> plug it again. Kernel really should not export API to userspace which can
> cause "damage" to SFP modules. And currently it does *not* do it.

In my community, such devices are not tolerated. Modules which can be
"damaged" should be thrown away.

Please be clear, I am not disagreeing with your implementation. For your
GPON devices, you handle this in kernel code. Cool. Keep it that way.
Just don't make my community implement that where it is not needed and not
wanted. Optoe just accesses the device. Other layers handle interpretation
and quirks. Your handling is in sfp.c, mine is in user space. Not right or
wrong, just different. Both work.

>
> I have contributed code for some GPON SFP modules, so their EEPROM can
> be correctly read and exported to userspace via ethtool -m. So I know that
> this is very fragile area and needs to be properly handled.

My code is in use in cloud datacenters and campus switches around the world.
I know it needs to be properly handled.

>
> So I do not see any reason why can be a new optoe method in _current_
> form useful. It does not implemented required things for handling
different
> EEPROM modules.

optoe would be useful in your current environment as a replacement for
sfp_i2c_read() and sfp_i2c_write(). Those routines just access the EEPROM.
They don't identify or interpret or implement quirk handling. Neither does
optoe.

AND, optoe is useful to my community. An ethtool -m solution could of
course be implemented, and all of the user space code that currently
accesses module EEPROM could be rewritten, but there would be no value in
that to my community. What they have works just fine.

>
> I would rather suggest you to use ethtool -m IOCTL API and in case it is
not
> suitable for QSFP (e.g. because of paging system) then extend this API.

optoe already handles QSFP and CMIS just fine. The API does not need to be
extended for pages. Indeed, the ethtool API has already implemented the
same linear address space to flatten out the two i2c addresses plus pages in
the various SFP/QSFP/CMIS specs. optoe flattens the same way.

>
> There were already proposals for using netlink socket interface which is
> today de-facto standard interface for new code. sysfs API for such thing
> really looks like some legacy code and nowadays we have better access
> methods.

netlink is the de-facto standard interface for kernel networking. My
community does not have kernel networking (except for a puny management
port). All of the switch networking is handled by the switch ASIC, and in
user space network management software. Which is 'better' is complicated,
depending on the needs and requirements of the application. A large and
vibrant community is using a different architecture. All I am asking for is
to submit a simple kernel driver that this community has found useful.

>
> If you want, I can help you with these steps and make patches to be in
> acceptable state; not written in "legacy" style. As I'm working with GPON
SFP
> modules with different EEPROMs in them, I'm really interested in any
> improvements in their EEPROM area.

You will find this odd, but I don't actually have any way to test anything
using the kernel network stack with these devices. The only hardware I have
treats SFP/QSFP/CMIS devices as i2c addresses. There is no concept of these
being network devices in the linux kernels I'm running. So, I'll turn your
offer around... I can help you improve your EEPROM access code, unless you
already have all the good stuff. The things optoe does that
sfp_i2c_read/write() don't do are:

Page support: Check whether pages are supported, then lock the device,
write the page register as needed, and return it to 0 when finished.
Check legal access: Based on page support, and which spec (SFP, QSFP,
CMIS), figure out if the offset and length are legal. This is more
interesting when flattening to a linear address space, less interesting in
the new API with i2c_addr, page, bank, offset, len all specified.
Support all 256 architected pages: There are interesting capabilities that
vendors provide in spec-approved 'proprietary' pages. Don't ask, don't
interpret, don't deny access. Just execute the requested read/write.

Don

2021-03-26 19:48:58

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> In my community, the SFP/QSFP/CMIS devices (32 to 128 of them per switch)
> often cost more than the switch itself. Consumers (both vendors and
> customers) extensively test these devices to ensure correct and reliable
> operation. Then they buy them literally by the millions. Quirks lead to
> quick rejection in favor of reliable parts from reliable vendors. In this
> environment, for completely different reasons, optoe does not need to handle
> quirks.

Well, if optoe were to be merged, it would not be just for your
community. It has to work for everybody who wants to use the Linux
kernel with an SFP. You cannot decide to add a KAPI which just
supports a subset of SFPs. It needs to support as many as possible,
warts and all.

So how would you handle these SFPs with the optoe KAPI?

Andrew

2021-03-26 20:17:48

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > In my community, the SFP/QSFP/CMIS devices (32 to 128 of them per
> > switch) often cost more than the switch itself. Consumers (both
> > vendors and
> > customers) extensively test these devices to ensure correct and
> > reliable operation. Then they buy them literally by the millions.
> > Quirks lead to quick rejection in favor of reliable parts from
> > reliable vendors. In this environment, for completely different
> > reasons, optoe does not need to handle quirks.
>
> Well, if optoe were to be merged, it would not be just for your community.
It
> has to work for everybody who wants to use the Linux kernel with an SFP.
> You cannot decide to add a KAPI which just supports a subset of SFPs. It
> needs to support as many as possible, warts and all.
>
> So how would you handle these SFPs with the optoe KAPI?

Just like they are handled now. Folks who use your stack would filter
through sfp.c, with all the quirk handling, and eventually call optoe for
the actual device access. Folks who don't use kernel networking would call
optoe directly and limit themselves to well behaved SFPs, or would handle
quirks in user space. You think that's dumb, but there is clearly a market
for that approach. It is working, at scale, today.

BTW, why can't we have a driver that only handles devices that work
correctly? Indeed, why would we tolerate an endless stream of special cases
for each new device that comes along?

Don

2021-03-26 20:39:19

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, Mar 26, 2021 at 01:16:14PM -0700, Don Bollinger wrote:
> > > In my community, the SFP/QSFP/CMIS devices (32 to 128 of them per
> > > switch) often cost more than the switch itself. Consumers (both
> > > vendors and
> > > customers) extensively test these devices to ensure correct and
> > > reliable operation. Then they buy them literally by the millions.
> > > Quirks lead to quick rejection in favor of reliable parts from
> > > reliable vendors. In this environment, for completely different
> > > reasons, optoe does not need to handle quirks.
> >
> > Well, if optoe were to be merged, it would not be just for your community.
> It
> > has to work for everybody who wants to use the Linux kernel with an SFP.
> > You cannot decide to add a KAPI which just supports a subset of SFPs. It
> > needs to support as many as possible, warts and all.
> >
> > So how would you handle these SFPs with the optoe KAPI?
>
> Just like they are handled now. Folks who use your stack would filter
> through sfp.c, with all the quirk handling, and eventually call optoe for
> the actual device access. Folks who don't use kernel networking would call
> optoe directly and limit themselves to well behaved SFPs, or would handle
> quirks in user space. You think that's dumb, but there is clearly a market
> for that approach. It is working, at scale, today.
>
> BTW, why can't we have a driver

You keep missing the point. I always refer to the KAPI. The driver we
can rework and rework, throw away and reimplement, as much as we want.
The KAPI cannot be changed, it is ABI. It is pretty much frozen the
day the code is first committed.

The optoe KAPI needs to handle these 'interesting' SFP modules. The
KAPI design needs to be flexible enough that the driver underneath it
can be extended to support these SFPs. The code does not need to be
there, but the KAPI design needs to allow it. And i personally cannot
see how the optoe KAPI can efficiently support these SFPs.

Andrew

2021-03-26 21:15:05

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, Mar 26, 2021 at 01:37PM -0700, Andrew Lunn wrote:
> On Fri, Mar 26, 2021 at 01:16:14PM -0700, Don Bollinger wrote:
> > > > In my community, the SFP/QSFP/CMIS devices (32 to 128 of them per
> > > > switch) often cost more than the switch itself. Consumers (both
> > > > vendors and
> > > > customers) extensively test these devices to ensure correct and
> > > > reliable operation. Then they buy them literally by the millions.
> > > > Quirks lead to quick rejection in favor of reliable parts from
> > > > reliable vendors. In this environment, for completely different
> > > > reasons, optoe does not need to handle quirks.
> > >
> > > Well, if optoe were to be merged, it would not be just for your
> community.
> > It
> > > has to work for everybody who wants to use the Linux kernel with an
> SFP.
> > > You cannot decide to add a KAPI which just supports a subset of
> > > SFPs. It needs to support as many as possible, warts and all.
> > >
> > > So how would you handle these SFPs with the optoe KAPI?
> >
> > Just like they are handled now. Folks who use your stack would filter
> > through sfp.c, with all the quirk handling, and eventually call optoe
> > for the actual device access. Folks who don't use kernel networking
> > would call optoe directly and limit themselves to well behaved SFPs,
> > or would handle quirks in user space. You think that's dumb, but
> > there is clearly a market for that approach. It is working, at scale,
today.
> >
> > BTW, why can't we have a driver
>
> You keep missing the point. I always refer to the KAPI. The driver we can
> rework and rework, throw away and reimplement, as much as we want.
> The KAPI cannot be changed, it is ABI. It is pretty much frozen the day
the
> code is first committed.

Maybe I don't understand what you mean by KAPI. The KAPI that optoe exposes
is in two parts.

First, it makes the EEPROM accessible via the nvmem() interface, an existing
KAPI that I call from optoe. at24 implemented it, I made use of it. This
interface exposes EEPROM data to user space through a defined sysfs() file.
I didn't invent this, nor am I proposing it, it already exists.

Second, it specifies how the data in the EEPROM is accessed. It says that
low memory is in offset 0-127, and paged memory starts at offset (128 +
(page * 128)). For SFP devices, memory at i2c address 0x50 is the first 256
bytes, and everything at 0x51 is pushed up 256 bytes. This is exactly the
behavior of ethtool. Again, I didn't invent this, I adopted the existing
convention for how to flatten the SFP/QSFP/CMIS address space.

With these two parts, EEPROM data can be accessed by standard open(2),
seek(2), read(2), write(2) calls. Nothing special there, the actual syntax
is as old school and standard as you can possibly get.

So, what is wrong with that KAPI?

The only thing wrong I can see is that it doesn't use the kernel network
stack. Here's where "you keep missing the point". The kernel network stack
is not being used in these systems. Implementing EEPROM access through
ethtool is of course possible, but it is a lot of work with no benefit on
these systems. The simple approach works. Let me use it.

>
> The optoe KAPI needs to handle these 'interesting' SFP modules. The KAPI
> design needs to be flexible enough that the driver underneath it can be
> extended to support these SFPs. The code does not need to be there, but
> the KAPI design needs to allow it. And i personally cannot see how the
optoe
> KAPI can efficiently support these SFPs.

Help me understand. Your KAPI specifies ethtool as the KAPI, and the
ethtool/netlink stack as the interface through which the data flows. As I
see it, your KAPI only specifies i2c address, page, bank, offset and length.
How does your KAPI support interesting SFP modules but mine does not? optoe
could be reworked ad infinitum to implement support for quirks, just as
yours can. Neither has any hint of quirks in the KAPI.

Don

2021-03-26 21:57:54

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> The only thing wrong I can see is that it doesn't use the kernel network
> stack. Here's where "you keep missing the point". The kernel network stack
> is not being used in these systems. Implementing EEPROM access through
> ethtool is of course possible, but it is a lot of work with no benefit on
> these systems. The simple approach works. Let me use it.
>
> >
> > The optoe KAPI needs to handle these 'interesting' SFP modules. The KAPI
> > design needs to be flexible enough that the driver underneath it can be
> > extended to support these SFPs. The code does not need to be there, but
> > the KAPI design needs to allow it. And i personally cannot see how the
> optoe
> > KAPI can efficiently support these SFPs.
>
> Help me understand. Your KAPI specifies ethtool as the KAPI, and the
> ethtool/netlink stack as the interface through which the data flows.

Nearly. It specifies netlink and the netlink messages definitions.
They happen to be in the ethtool group, but there is no requirement to
use ethtool(1).

But there is a bit more to it.

Either the MAC driver implements the needed code, or it defers to the
generic sfp.c code. In both cases, you have access to the GPIO
lines.

If you are using the generic sfp.c, the Device Tree definitions of the
GPIOs become part of the KAPI. DT is consider ABI.

So the low level code knows when there was been a hotplug. It then can
determine what quirks to apply for all future reads until the module
is unplugged.

Your KAPI is missing how optoe gets access to the GPIOs. Without
knowing if the module has been hotplugged, in a robust manor, you have
problems with quirks. For every userspace read, you need to assume the
module has been changed, read the ID information from the EEPROM a
byte at a time, figure out what quirks to apply, and then do the user
requested read. I doubt that is good for performance. The design which
has been chosen is that userspace is monitoring the GPIO lines. So to
make it efficient, your KAPI need some way to pass down that the
module has/has not been hot-plugged since the last read.

Or do you see some other way to implement these quirks?

Andrew

2021-03-26 22:34:00

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS



> -----Original Message-----
> From: Andrew Lunn [mailto:[email protected]]
> Sent: Friday, March 26, 2021 2:54 PM
> To: Don Bollinger <[email protected]>
> Cc: 'Jakub Kicinski' <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; 'netdev' <[email protected]>; 'Moshe
> Shemesh' <[email protected]>
> Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS
> EEPROMS
>
> > The only thing wrong I can see is that it doesn't use the kernel
> > network stack. Here's where "you keep missing the point". The kernel
> > network stack is not being used in these systems. Implementing EEPROM
> > access through ethtool is of course possible, but it is a lot of work
> > with no benefit on these systems. The simple approach works. Let me
use
> it.
> >
> > >
> > > The optoe KAPI needs to handle these 'interesting' SFP modules. The
> > > KAPI design needs to be flexible enough that the driver underneath
> > > it can be extended to support these SFPs. The code does not need to
> > > be there, but the KAPI design needs to allow it. And i personally
> > > cannot see how the
> > optoe
> > > KAPI can efficiently support these SFPs.
> >
> > Help me understand. Your KAPI specifies ethtool as the KAPI, and the
> > ethtool/netlink stack as the interface through which the data flows.
>
> Nearly. It specifies netlink and the netlink messages definitions.
> They happen to be in the ethtool group, but there is no requirement to use
> ethtool(1).
>
> But there is a bit more to it.
>
> Either the MAC driver implements the needed code, or it defers to the
> generic sfp.c code. In both cases, you have access to the GPIO lines.
>
> If you are using the generic sfp.c, the Device Tree definitions of the
GPIOs
> become part of the KAPI. DT is consider ABI.
>
> So the low level code knows when there was been a hotplug. It then can
> determine what quirks to apply for all future reads until the module is
> unplugged.

Got it. All good. I agree, I would like optoe to have access to the GPIO
lines, but it is a "nice to have", not a requirement...

>
> Your KAPI is missing how optoe gets access to the GPIOs. Without knowing
if

Right. It doesn't get access to the GPIOs. So, that is not part of its
KAPI.

> the module has been hotplugged, in a robust manor, you have problems
> with quirks. For every userspace read, you need to assume the module has
> been changed, read the ID information from the EEPROM a byte at a time,
> figure out what quirks to apply, and then do the user requested read. I
doubt

Again, optoe does not deal with quirks. Your code filters them out before
calling optoe or sfp_i2c_read. My stack does not deal with them. In my
community, quirky devices are not tolerated. In the 4 years this code has
been in use, nobody has ever asked me to accommodate a weird module.

I inherited a limitation of writing one byte at a time from the ancestor of
optoe, which I have kept. I don't know if it is needed, but someone once
thought it was required. I apply it universally, all devices of all types.

I do need one item of state from the EEPROM, the register that tells me
whether pages are supported by the device. Due to the hotplug risk, I
actually do read that register once for each access to non-zero pages.
(Once per read or write call.) Access to GPIOs would eliminate that. It
turns out the vast majority of calls are to page 0 or lower memory, and the
performance penalty is at most 25% since there is also a page write, data
access and page write in the same call. The penalty goes down as the number
of bytes read increases. Overall, it has never come up as an issue. People
don't expect these things to be fast.

> that is good for performance. The design which has been chosen is that
> userspace is monitoring the GPIO lines. So to make it efficient, your KAPI
> need some way to pass down that the module has/has not been hot-
> plugged since the last read.

Nope. optoe does not need to know, it assumes a new device every time it is
accessed.

>
> Or do you see some other way to implement these quirks?

What I have works. Your consumers get quirk handling, mine don't need it.
No compromise.

>
> Andrew

Don

2021-03-27 12:42:59

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Fri, Mar 26, 2021 at 02:09:36PM -0700, Don Bollinger wrote:
> > You keep missing the point. I always refer to the KAPI. The driver we can
> > rework and rework, throw away and reimplement, as much as we want.
> > The KAPI cannot be changed, it is ABI. It is pretty much frozen the day
> the
> > code is first committed.
>
> Maybe I don't understand what you mean by KAPI. The KAPI that optoe exposes
> is in two parts.
>
> First, it makes the EEPROM accessible via the nvmem() interface, an existing
> KAPI that I call from optoe. at24 implemented it, I made use of it. This
> interface exposes EEPROM data to user space through a defined sysfs() file.
> I didn't invent this, nor am I proposing it, it already exists.

Again, a "raw" interface to a device that is just memory-mapping all of
the device information directly is no sort of a real KABI at all.

It is no different from trying to use /dev/mem/ to write a networking
driver, just because you can mmap in the device's configuration space to
userspace.

That is not a real api, it is only using the kernel as a "pass-through"
which works fine for one-off devices, and other oddities, but is not a
unified user/kernel api for a class of device types at all.

thanks,

greg k-h

2021-03-27 15:32:15

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> What I have works. Your consumers get quirk handling, mine don't need it.
> No compromise.

Hi Don

All this discussion is now a mute point. GregKH has spoken.

But i'm sure there are some on the side lines, eating popcorn, maybe
learning from the discussion.

Would you think it is O.K. to add a KAPI which works for 3 1/2" SCSI
disks, but not 2", because you only make machines with 3 1/2" bays?

This is an extreme, absurd example, but hopefully you get the
point. We don't design KAPIs with the intention to only work for a
subset of devices. It needs to work with as many devices as possible,
even if the first implementation below the KAPI is limited to just a
subset.

Anyway, i'm gratefull you have looked at the new ethtool netlink
KAPI. It will be better for your contributions. And i hope you can
make use of it in the future. But i think this discussion about optoe
in mainline is over.

Andrew

2021-03-27 21:21:54

by Don Bollinger

[permalink] [raw]
Subject: RE: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

> > What I have works. Your consumers get quirk handling, mine don't need
it.
> > No compromise.
>
> Hi Don
>
> All this discussion is now a mute point. GregKH has spoken.

Ack. I actually checked in with Greg a couple of days ago and got that
answer. I just thought the discussion was going in an interesting direction
and didn't want to end it yet. Response below is in the same vein.

>
> But i'm sure there are some on the side lines, eating popcorn, maybe
> learning from the discussion.

Honestly not sure what they are learning from the discussion. I think what
I learned is not what you think you taught me.

>
> Would you think it is O.K. to add a KAPI which works for 3 1/2" SCSI
disks, but
> not 2", because you only make machines with 3 1/2" bays?

No. Not sure how the analogy works. QSFP is a larger form factor than SFP,
and the EEPROM layout changed at the same time, but optoe and my community
had no problem accommodating both. CMIS changed the EEPROM layout again,
but it was easily accommodated.

>
> This is an extreme, absurd example, but hopefully you get the point. We
> don't design KAPIs with the intention to only work for a subset of
devices. It
> needs to work with as many devices as possible, even if the first
> implementation below the KAPI is limited to just a subset.

Let me run with your example... Suppose I used those 3 1/2" SCSI disks to
build storage servers. Let's call them RAID devices. Innovation follows, I
figure out how to make these devices hot swappable, hot backup, massively
parallel... Innovation follows and I evolve this into a distributed
architecture with redundant data, encrypted, compressed, distributed across
servers and data centers. Massively parallel, synchronized, access to the
data anywhere in the world, at bandwidth limited only by the size of your
network pipe. Let's call it RIDSS (Redundant, Independent, Distributed
Storage Servers). And I can use 2" disks. Or non-spinning storage.

My fancy technology thinks the storage is dumb. I present a
track/sector/length and I get back the bits. Just for grins, let's say I
can also query the device for a list of bad blocks (sectors, whatever).

Recently, with the addition of 2" devices, YOU figured out that you can
stuff several disks into a small space, and manage them with software and
call it RAID. You build a nice abstraction for storage, which contains
individual disks, and handles parallelism, redundancy, hot swap. Very cool,
works well, a genuine innovation.

I've been using Linux for RIDSS for years, I get the memo that my linux
driver to access these SCSI devices should be in the upstream kernel.
Oddly, there are no drivers in the kernel that I can just present
track/sector/length and get back the bits. So, I offer mine. The answer is
no, that is a RAID device, you need to access the device through RAID data
structures and APIs.

Sorry, no, that is not a RAID device. That is a storage device. You use it
for RAID storage, I use it for RIDSS storage. We need a layered
architecture with raw data access at the bottom (we both need the same
track/sector/length access). Beyond that, your RAID stack, while brilliant,
has virtually nothing to do with my RIDSS stack. They sound superficially
similar, but the technology and architecture are wildly different. The RAID
stack is unnecessary for my RIDSS architecture, and requires widespread
changes to my software that yield no benefit.

So, no, I don't get your point. I think there is value in a simple layered
architecture, where access to the module EEPROM is independent of the
consumer of that access. You can access it for kernel networking, which is
useful, innovative, valuable. I can access it for TOR switches which do not
use kernel networking but are nonetheless useful, innovative, valuable.
Your decision to reject optoe means the TOR community has to maintain this
simple bit of kernel code outside the mainline tree. The judges have ruled,
case closed.

>
> Anyway, i'm gratefull you have looked at the new ethtool netlink KAPI. It
will
> be better for your contributions. And i hope you can make use of it in the

Thanks for the acknowledgement.

> future. But i think this discussion about optoe in mainline is over.

This discussion is indeed over if you say it is. I'll be moving on :-(.

>
> Andrew

Don

2021-03-29 22:15:30

by Pali Rohár

[permalink] [raw]
Subject: Re: [PATCH v2] eeprom/optoe: driver to read/write SFP/QSFP/CMIS EEPROMS

On Friday 26 March 2021 11:43:10 Don Bollinger wrote:
> > Hello Don!
> >
> > I have read whole discussion and your EEPROM patch proposal. But for me it
> > looks like some kernel glue code for some old legacy / proprietary access
> > method which does not have any usage outside of that old code.
>
> I don't know if 'kernel glue code' is good or bad. It is a driver. It
> locks access to a device so it can perform multiple accesses without
> interference. It organizes the data on a weird device into a simple linear
> address space that can be accessed with open(), seek(), read() and write()
> calls.
>
> As for 'old code', this code and variations of it are under active
> development by multiple Network OS vendors and multiple switch vendors, and
> in production on hundreds of thousands of switches with millions of
> SFP/QSFP/CMIS devices. This stuff is running the biggest clouds in the
> world.
>
> >
> > Your code does not contain any quirks which are needed to read different
> > EEPROMs in different SFP modules. As Andrew wrote there are lot of broken
> > SFPs which needs special handling and this logic is already implemented in
> > sfp.c and sfp-bus.c kernel drivers. These drivers then export EEPROM
> > content to userspace via ethtool -m API in unified way and userspace does
> > not implement any quirks (nor does not have to deal with quirks).
>
> As a technical matter, you handle those quirks in the code that interprets
> EEPROM data. You have figured out what devices have what quirks, then coded
> up solutions to make them work. Then, after all the quirk handling is done,
> you call the actual access routines (sfp_i2c_read() and sfp_i2c_write()) to
> access the module EEPROMs. My code works the same way, except in my
> community all the interpretation of EEPROM data is done in user space. You
> may not like that, but it is not clear why you should care how my community
> chooses to handle the data. In this architecture, the only thing needed
> from the kernel is the equivalent of sfp_i2c_read() and sfp_i2c_write, which
> optoe provides. The key point here is that my community wants the kernel to
> just access the data. No interpretation, no identification, no special
> cases.
>
> >
> > If you try to read EEPROM "incorrectly" then SFP module with its EEPROM
> > chip (or emulation of chip) locks and is fully unusable after you unplug
> it and
> > plug it again. Kernel really should not export API to userspace which can
> > cause "damage" to SFP modules. And currently it does *not* do it.
>
> In my community, such devices are not tolerated. Modules which can be
> "damaged" should be thrown away.

Well, those tested / problematic modules are not damaged by real. They
just stop working until you do power off / on cycle and unplug / plug
them again.

But I agree with you. Russel wrote about those SFP modules that they
should have biohazard label :-)

The best would be if those modules disappear, but that would not happen
in near future.

These modules are already used by lot of users and users wanted support
for them. Hence we wrote what was possible.

The main issue is that we do know know any well-behaved GPON SFP module.

GPON is now very common technology for internet access, so it is not
obvious that more people are asking about support for GPON HW compatible
with Linux kernel.

> Please be clear, I am not disagreeing with your implementation. For your
> GPON devices, you handle this in kernel code. Cool. Keep it that way.
> Just don't make my community implement that where it is not needed and not
> wanted. Optoe just accesses the device. Other layers handle interpretation
> and quirks. Your handling is in sfp.c, mine is in user space. Not right or
> wrong, just different. Both work.
>
> >
> > I have contributed code for some GPON SFP modules, so their EEPROM can
> > be correctly read and exported to userspace via ethtool -m. So I know that
> > this is very fragile area and needs to be properly handled.
>
> My code is in use in cloud datacenters and campus switches around the world.
> I know it needs to be properly handled.
>
> >
> > So I do not see any reason why can be a new optoe method in _current_
> > form useful. It does not implemented required things for handling
> different
> > EEPROM modules.
>
> optoe would be useful in your current environment as a replacement for
> sfp_i2c_read() and sfp_i2c_write(). Those routines just access the EEPROM.
> They don't identify or interpret or implement quirk handling. Neither does
> optoe.

Now that we know that there are SFP modules which needs special handling
without it they lock themselves on i2c bus, we really cannot export to
userspace interface which allows this locking.

I understand that you do not care for these broken GPON SFP modules, but
"generic" API which would be part of mainline kernel really cannot have
known issue that can cause some modules to lock by just simple "read"
operation.

> AND, optoe is useful to my community. An ethtool -m solution could of
> course be implemented, and all of the user space code that currently
> accesses module EEPROM could be rewritten, but there would be no value in
> that to my community. What they have works just fine.

I understand that your API is useful for you and your existing projects.

> >
> > I would rather suggest you to use ethtool -m IOCTL API and in case it is
> not
> > suitable for QSFP (e.g. because of paging system) then extend this API.
>
> optoe already handles QSFP and CMIS just fine. The API does not need to be
> extended for pages. Indeed, the ethtool API has already implemented the
> same linear address space to flatten out the two i2c addresses plus pages in
> the various SFP/QSFP/CMIS specs. optoe flattens the same way.
>
> >
> > There were already proposals for using netlink socket interface which is
> > today de-facto standard interface for new code. sysfs API for such thing
> > really looks like some legacy code and nowadays we have better access
> > methods.
>
> netlink is the de-facto standard interface for kernel networking. My
> community does not have kernel networking (except for a puny management
> port). All of the switch networking is handled by the switch ASIC, and in
> user space network management software. Which is 'better' is complicated,
> depending on the needs and requirements of the application. A large and
> vibrant community is using a different architecture. All I am asking for is
> to submit a simple kernel driver that this community has found useful.

I think that this is the main issue here. You are proposing a new API
for kernel networking code without being user of kernel networking
subsystem. Obviously your code is not directly designed for kernel
networking (as you do not use it and do not need it) and that is why
proposed new API looks like it is -- it perfectly fits your usage.

But for those who are using kernel networking code, proposed new API
does not "fit" into existing networking APIs.

Proposing merging a new API, which is going to replace some existing
API, needs very good arguments. Existing API in this case is ethtool -m.
Replacing it by optoe needs to explain in details why ethtool -m API
cannot be extended for a new functionality which is provided by optoe.

Kernel really do not need two APIs for userspace which provides same set
of functionality.

But I think that main issue is here the fact, that you have a lot of
userspace code already in production which is using this proposed kernel
API and you do not have capacity to change existing code in production
to use different kernel API.

Note that having something in production for years and baked by lot of
vendors, does not mean that would be automatically merged into mainline
kernel.

Anyway, kernel has already API for 'network switching in ASIC' with any
other HW offloading (vlan tagging, ...), so it is possible to use kernel
code and kernel drivers to manage big data center switches. IIRC this
DSA switch architecture was developed initially for Marvell switches.

> >
> > If you want, I can help you with these steps and make patches to be in
> > acceptable state; not written in "legacy" style. As I'm working with GPON
> SFP
> > modules with different EEPROMs in them, I'm really interested in any
> > improvements in their EEPROM area.
>
> You will find this odd, but I don't actually have any way to test anything
> using the kernel network stack with these devices. The only hardware I have
> treats SFP/QSFP/CMIS devices as i2c addresses. There is no concept of these
> being network devices in the linux kernels I'm running. So, I'll turn your
> offer around... I can help you improve your EEPROM access code, unless you
> already have all the good stuff. The things optoe does that
> sfp_i2c_read/write() don't do are:
>
> Page support: Check whether pages are supported, then lock the device,
> write the page register as needed, and return it to 0 when finished.
> Check legal access: Based on page support, and which spec (SFP, QSFP,
> CMIS), figure out if the offset and length are legal. This is more
> interesting when flattening to a linear address space, less interesting in
> the new API with i2c_addr, page, bank, offset, len all specified.
> Support all 256 architected pages: There are interesting capabilities that
> vendors provide in spec-approved 'proprietary' pages. Don't ask, don't
> interpret, don't deny access. Just execute the requested read/write.
>
> Don
>