2013-05-09 22:11:25

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 00/14] Add blockconsole version 1.1 (try 3)

Blockconsole is a console driver very roughly similar to netconsole.
Instead of sending messages out via UDP, they are written to a block
device. Typically a USB stick is chosen, although in principle any
block device will do.

In most cases blockconsole is useful where netconsole is not, i.e.
single machines without network access or without an accessable
netconsole capture server. When using both blockconsole and
netconsole, I have found netconsole to sometimes create a mess under
high message load (sysrq-t, etc.) while blockconsole does not.

Most importantly, a number of bugs were identified and fixed that
would have been unexplained machine reboots without blockconsole.

More highlights:
* reasonably small and self-contained code,
* some 100+ machine years of runtime,
* nice tutorial with a 30-sec guide for the impatient.

Special thanks to Borislav Petkov for many improvements and kicking my
behind to provide a proper git tree and resend patches.

Git tree is on kernel.org and I intend to keep it stable, as people
seem to be using it already. It has been in -next since Mar 7.

git://git.kernel.org/pub/scm/linux/kernel/git/joern/bcon2.git

Joern Engel (10):
do_mounts: constify name_to_dev_t parameter
add blockconsole version 1.1
printk: add CON_ALLDATA console flag
netconsole: use CON_ALLDATA
blockconsole: use CON_ALLDATA
bcon: add a release work struct
bcon: check for hdparm in bcon_tail
bcon: remove version 1.0 support
bcon: Fix wrap-around behaviour
netconsole: s/syslogd/cancd/ in documentation

Takashi Iwai (4):
blockconsole: Allow to pass a device file path to bcon_tail
blockconsole: Fix undefined MAX_RT_PRIO
blockconsole: Rename device_lock with bc_device_lock
blockconsole: Mark a local work struct static

Documentation/block/blockconsole.txt | 94 ++++
Documentation/block/blockconsole/bcon_tail | 82 +++
Documentation/block/blockconsole/mkblockconsole | 29 ++
Documentation/networking/netconsole.txt | 16 +-
block/partitions/Makefile | 1 +
block/partitions/blockconsole.c | 22 +
block/partitions/check.c | 3 +
block/partitions/check.h | 3 +
drivers/block/Kconfig | 6 +
drivers/block/Makefile | 1 +
drivers/block/blockconsole.c | 618 +++++++++++++++++++++++
drivers/net/netconsole.c | 2 +-
include/linux/blockconsole.h | 7 +
include/linux/console.h | 1 +
include/linux/mount.h | 2 +-
init/do_mounts.c | 2 +-
kernel/printk.c | 5 +-
17 files changed, 885 insertions(+), 9 deletions(-)
create mode 100644 Documentation/block/blockconsole.txt
create mode 100755 Documentation/block/blockconsole/bcon_tail
create mode 100755 Documentation/block/blockconsole/mkblockconsole
create mode 100644 block/partitions/blockconsole.c
create mode 100644 drivers/block/blockconsole.c
create mode 100644 include/linux/blockconsole.h

--
1.7.10.4


2013-05-09 22:11:14

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 10/14] blockconsole: Fix undefined MAX_RT_PRIO

From: Takashi Iwai <[email protected]>

Signed-off-by: Takashi Iwai <[email protected]>
---
drivers/block/blockconsole.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index e88b8ee..c22272f 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -16,6 +16,7 @@
#include <linux/string.h>
#include <linux/workqueue.h>
#include <linux/sched.h>
+#include <linux/sched/rt.h>
#include <linux/ctype.h>

#define BLOCKCONSOLE_MAGIC "\nLinux blockconsole version 1.1\n"
--
1.7.10.4

2013-05-09 22:11:23

by Jörn Engel

[permalink] [raw]
Subject: [PATCH] mpt2sas/mpt3sas: prevent double free on error path

I noticed this one when list_del was called with poisoned list
pointers, but the real problem is a double-free (and a use-after-free
just before that).

Both _scsih_probe_boot_devices() and _scsih_sas_device_add() put the
sas_device onto a list, thereby giving up control. Next they call
mpt2sas_transport_port_add() and will list_del and free the object on
errors. If some other function already did the list_del and free, it
will happen again.

This patch adds reference counting to prevent the double free. One
reference count goes to the caller of mpt2sas_transport_port_add(), the
second to the list. Whoever removes the object from the list gets to
drop one reference count. _scsih_probe_boot_devices() and
_scsih_sas_device_add() get a second reference count to ensure the
object is not freed while they are still accessing it.

To prevent the double list_del(), I changed the code to list_del_init()
and added a list_empty() check before that. Since the
list_empty/list_del_init is always called under a lock, this should be
safe.

I hate the complexity this patch adds, but see no alternative to it.

mpt2sas0: failure at drivers/scsi/mpt2sas/mpt2sas_transport.c:708/mpt2sas_transport_port_add()!
general protection fault: 0000 [#1] SMP
CPU 9
Pid: 3097, comm: kworker/u:11 Tainted: G W O 3.6.10+ #31392.trunk /0JP31P
RIP: 0010:[<ffffffffa0309744>] [<ffffffffa0309744>] _scsih_sas_device_remove+0x54/0x90 [mpt2sas]
RSP: 0018:ffff881fed4d7ab0 EFLAGS: 00010046
RAX: dead000000200200 RBX: ffff881ff6a5cd88 RCX: 00000000000010e8
RDX: ffff881ff7dab800 RSI: ffff881ff7daba00 RDI: dead000000100100
RBP: ffff881fed4d7ad0 R08: dead000000200200 R09: ffff880fff802200
R10: ffffffffa0317980 R11: 0000000000000000 R12: ffff881ff7daba00
R13: 0000000000000286 R14: 500605ba006c9d09 R15: ffff881ff7daba00
FS: 0000000000000000(0000) GS:ffff88203fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8ac89ec458 CR3: 0000001ff4c5c000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:11 (pid: 3097, threadinfo ffff881fed4d6000, task ffff881402f3d9c0)
Stack:
0000000000000401 ffff881ff6a5c6b0 0000000000000401 0000000000000016
ffff881fed4d7bb0 ffffffffa030f93e 0000000000000000 ffff881ff6a5cd88
0012000e0f000008 006c9d090002000b 00180009500605ba 0000040100000016
Call Trace:
[<ffffffffa030f93e>] _scsih_add_device.clone.32+0x2fe/0x420 [mpt2sas]
[<ffffffffa03126e5>] _scsih_sas_topology_change_event.clone.38+0x285/0x620 [mpt2sas]
[<ffffffff81078c90>] ? load_balance+0x100/0x7a0
[<ffffffffa0312a80>] ? _scsih_sas_topology_change_event.clone.38+0x620/0x620 [mpt2sas]
[<ffffffffa0312d8a>] _firmware_event_work+0x30a/0xfc0 [mpt2sas]
[<ffffffff810015cc>] ? __switch_to+0x14c/0x410
[<ffffffff8106dfdb>] ? finish_task_switch+0x4b/0xf0
[<ffffffffa0312a80>] ? _scsih_sas_topology_change_event.clone.38+0x620/0x620 [mpt2sas]
[<ffffffff8105bf40>] process_one_work+0x140/0x500
[<ffffffff8105d354>] worker_thread+0x194/0x510
[<ffffffff8106dfdb>] ? finish_task_switch+0x4b/0xf0
[<ffffffff8105d1c0>] ? manage_workers+0x320/0x320
[<ffffffff8106282e>] kthread+0x9e/0xb0
[<ffffffff815bef44>] kernel_thread_helper+0x4/0x10
[<ffffffff815b5e5d>] ? retint_restore_args+0x13/0x13
[<ffffffff81062790>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815bef40>] ? gs_change+0x13/0x13

Signed-off-by: Joern Engel <[email protected]>
---
drivers/scsi/mpt2sas/mpt2sas_base.h | 1 +
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 57 ++++++++++++++++++++++++++++------
drivers/scsi/mpt3sas/mpt3sas_base.h | 1 +
drivers/scsi/mpt3sas/mpt3sas_scsih.c | 57 ++++++++++++++++++++++++++++------
4 files changed, 98 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.h b/drivers/scsi/mpt2sas/mpt2sas_base.h
index 543d8d6..ceb7d41 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_base.h
+++ b/drivers/scsi/mpt2sas/mpt2sas_base.h
@@ -367,6 +367,7 @@ struct _sas_device {
u16 slot;
u8 phy;
u8 responding;
+ struct kref kref;
};

/**
diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index c6bdc92..217660c 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -570,6 +570,19 @@ _scsih_sas_device_find_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
return NULL;
}

+static void free_sas_device(struct kref *kref)
+{
+ struct _sas_device *sas_device = container_of(kref, struct _sas_device,
+ kref);
+
+ kfree(sas_device);
+}
+
+static void put_sas_device(struct _sas_device *sas_device)
+{
+ kref_put(&sas_device->kref, free_sas_device);
+}
+
/**
* _scsih_sas_device_remove - remove sas_device from list.
* @ioc: per adapter object
@@ -583,14 +596,19 @@ _scsih_sas_device_remove(struct MPT2SAS_ADAPTER *ioc,
struct _sas_device *sas_device)
{
unsigned long flags;
+ int was_on_list = 0;

if (!sas_device)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
- list_del(&sas_device->list);
- kfree(sas_device);
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
+ if (was_on_list)
+ put_sas_device(sas_device);
}


@@ -612,6 +630,8 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
"(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
sas_device->handle, (unsigned long long)sas_device->sas_address));

+ /* Get an extra refcount... */
+ kref_get(&sas_device->kref);
spin_lock_irqsave(&ioc->sas_device_lock, flags);
list_add_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
@@ -630,6 +650,12 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
sas_device->sas_address_parent);
_scsih_sas_device_remove(ioc, sas_device);
}
+ /*
+ * ...and drop it again. If an error already happend, this is the
+ * final put and we free the object now. Otherwise whoever removes
+ * the object from the list will do the final put and free.
+ */
+ put_sas_device(sas_device);
}

/**
@@ -5270,6 +5296,7 @@ _scsih_add_device(struct MPT2SAS_ADAPTER *ioc, u16 handle, u8 phy_num, u8 is_pd)
return -1;
}

+ kref_init(&sas_device->kref);
sas_device->handle = handle;
if (_scsih_get_sas_address(ioc, le16_to_cpu
(sas_device_pg0.ParentDevHandle),
@@ -5341,7 +5368,7 @@ _scsih_remove_device(struct MPT2SAS_ADAPTER *ioc,
"handle(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
sas_device->handle, (unsigned long long)
sas_device->sas_address));
- kfree(sas_device);
+ put_sas_device(sas_device);
}
/**
* _scsih_device_remove_by_handle - removing device object by handle
@@ -5355,16 +5382,21 @@ _scsih_device_remove_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}

@@ -5381,6 +5413,7 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;
@@ -5388,10 +5421,14 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = mpt2sas_scsih_sas_device_find_by_sas_address(ioc,
sas_address);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}
#ifdef CONFIG_SCSI_MPT2SAS_LOGGING
@@ -7805,6 +7842,7 @@ _scsih_probe_boot_devices(struct MPT2SAS_ADAPTER *ioc)
handle = sas_device->handle;
sas_address_parent = sas_device->sas_address_parent;
sas_address = sas_device->sas_address;
+ kref_get(&sas_device->kref);
list_move_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);

@@ -7819,6 +7857,7 @@ _scsih_probe_boot_devices(struct MPT2SAS_ADAPTER *ioc)
sas_address_parent);
_scsih_sas_device_remove(ioc, sas_device);
}
+ put_sas_device(sas_device);
}
}

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h
index 994656c..fff7fe7 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -293,6 +293,7 @@ struct _sas_device {
u8 phy;
u8 responding;
u8 fast_path;
+ struct kref kref;
};

/**
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 6421a06..54fdd7c 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -569,6 +569,19 @@ _scsih_sas_device_find_by_handle(struct MPT3SAS_ADAPTER *ioc, u16 handle)
return NULL;
}

+static void free_sas_device(struct kref *kref)
+{
+ struct _sas_device *sas_device = container_of(kref, struct _sas_device,
+ kref);
+
+ kfree(sas_device);
+}
+
+static void put_sas_device(struct _sas_device *sas_device)
+{
+ kref_put(&sas_device->kref, free_sas_device);
+}
+
/**
* _scsih_sas_device_remove - remove sas_device from list.
* @ioc: per adapter object
@@ -582,14 +595,19 @@ _scsih_sas_device_remove(struct MPT3SAS_ADAPTER *ioc,
struct _sas_device *sas_device)
{
unsigned long flags;
+ int was_on_list = 0;

if (!sas_device)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
- list_del(&sas_device->list);
- kfree(sas_device);
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
+ if (was_on_list)
+ put_sas_device(sas_device);
}

/**
@@ -604,16 +622,21 @@ _scsih_device_remove_by_handle(struct MPT3SAS_ADAPTER *ioc, u16 handle)
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}

@@ -630,6 +653,7 @@ mpt3sas_device_remove_by_sas_address(struct MPT3SAS_ADAPTER *ioc,
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;
@@ -637,10 +661,14 @@ mpt3sas_device_remove_by_sas_address(struct MPT3SAS_ADAPTER *ioc,
spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = mpt3sas_scsih_sas_device_find_by_sas_address(ioc,
sas_address);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}

@@ -663,6 +691,8 @@ _scsih_sas_device_add(struct MPT3SAS_ADAPTER *ioc,
ioc->name, __func__, sas_device->handle,
(unsigned long long)sas_device->sas_address));

+ /* Get an extra refcount... */
+ kref_get(&sas_device->kref);
spin_lock_irqsave(&ioc->sas_device_lock, flags);
list_add_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
@@ -682,6 +712,12 @@ _scsih_sas_device_add(struct MPT3SAS_ADAPTER *ioc,
sas_device->sas_address_parent);
_scsih_sas_device_remove(ioc, sas_device);
}
+ /*
+ * ...and drop it again. If an error already happend, this is the
+ * final put and we free the object now. Otherwise whoever removes
+ * the object from the list will do the final put and free.
+ */
+ put_sas_device(sas_device);
}

/**
@@ -4859,6 +4895,7 @@ _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 phy_num,
return 0;
}

+ kref_init(&sas_device->kref);
sas_device->handle = handle;
if (_scsih_get_sas_address(ioc,
le16_to_cpu(sas_device_pg0.ParentDevHandle),
@@ -4935,7 +4972,7 @@ _scsih_remove_device(struct MPT3SAS_ADAPTER *ioc,
sas_device->handle, (unsigned long long)
sas_device->sas_address));

- kfree(sas_device);
+ put_sas_device(sas_device);
}

#ifdef CONFIG_SCSI_MPT3SAS_LOGGING
@@ -7521,6 +7558,7 @@ _scsih_probe_boot_devices(struct MPT3SAS_ADAPTER *ioc)
handle = sas_device->handle;
sas_address_parent = sas_device->sas_address_parent;
sas_address = sas_device->sas_address;
+ kref_get(&sas_device->kref);
list_move_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);

@@ -7533,6 +7571,7 @@ _scsih_probe_boot_devices(struct MPT3SAS_ADAPTER *ioc)
sas_address_parent);
_scsih_sas_device_remove(ioc, sas_device);
}
+ put_sas_device(sas_device);
}
}

--
1.7.10.4

2013-05-09 22:11:34

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 09/14] bcon: remove version 1.0 support

Very few machines ever ran with 1.0 format and by now I doubt whether a
single one still does. No need to carry that code along.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/block/blockconsole.c | 19 ++-----------------
1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index b4730f8..e88b8ee 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -18,7 +18,6 @@
#include <linux/sched.h>
#include <linux/ctype.h>

-#define BLOCKCONSOLE_MAGIC_OLD "\nLinux blockconsole version 1.0\n"
#define BLOCKCONSOLE_MAGIC "\nLinux blockconsole version 1.1\n"
#define BCON_UUID_OFS (32)
#define BCON_ROUND_OFS (41)
@@ -234,18 +233,6 @@ static void bcon_advance_write_bytes(struct blockconsole *bc, int bytes)
}
}

-static int bcon_convert_old_format(struct blockconsole *bc)
-{
- bc->uuid = get_random_int();
- bc->round = 0;
- bc->console_bytes = bc->write_bytes = 0;
- bcon_advance_console_bytes(bc, 0); /* To skip the header */
- bcon_advance_write_bytes(bc, 0); /* To wrap around, if necessary */
- bcon_erase_segment(bc);
- pr_info("converted %s from old format\n", bc->devname);
- return 0;
-}
-
static int bcon_find_end_of_log(struct blockconsole *bc)
{
u64 start = 0, end = bc->max_bytes, middle;
@@ -258,8 +245,8 @@ static int bcon_find_end_of_log(struct blockconsole *bc)
return err;
/* Second sanity check, out of sheer paranoia */
version = bcon_magic_present(sec0);
- if (version == 10)
- return bcon_convert_old_format(bc);
+ if (!version)
+ return -EINVAL;

bc->uuid = simple_strtoull(sec0 + BCON_UUID_OFS, NULL, 16);
bc->round = simple_strtoull(sec0 + BCON_ROUND_OFS, NULL, 16);
@@ -618,8 +605,6 @@ int bcon_magic_present(const void *data)
{
size_t len = strlen(BLOCKCONSOLE_MAGIC);

- if (!memcmp(data, BLOCKCONSOLE_MAGIC_OLD, len))
- return 10;
if (memcmp(data, BLOCKCONSOLE_MAGIC, len))
return 0;
if (!is_four_byte_hex(data + BCON_UUID_OFS))
--
1.7.10.4

2013-05-09 22:11:44

by Jörn Engel

[permalink] [raw]
Subject: [PATCH] mpt2sas: prevent double free on error path

I noticed this one when list_del was called with poisoned list
pointers, but the real problem is a double-free (and a use-after-free
just before that).

Both _scsih_probe_boot_devices() and _scsih_sas_device_add() put the
sas_device onto a list, thereby giving up control. Next they call
mpt2sas_transport_port_add() and will list_del and free the object on
errors. If some other function already did the list_del and free, it
will happen again.

This patch adds reference counting to prevent the double free. One
reference count goes to the caller of mpt2sas_transport_port_add(), the
second to the list. Whoever removes the object from the list gets to
drop one reference count. _scsih_probe_boot_devices() and
_scsih_sas_device_add() get a second reference count to ensure the
object is not freed while they are still accessing it.

To prevent the double list_del(), I changed the code to list_del_init()
and added a list_empty() check before that. Since the
list_empty/list_del_init is always called under a lock, this should be
safe.

I hate the complexity this patch adds, but see no alternative to it.

mpt2sas0: failure at drivers/scsi/mpt2sas/mpt2sas_transport.c:708/mpt2sas_transport_port_add()!
general protection fault: 0000 [#1] SMP
CPU 9
Pid: 3097, comm: kworker/u:11 Tainted: G W O 3.6.10+ #31392.trunk /0JP31P
RIP: 0010:[<ffffffffa0309744>] [<ffffffffa0309744>] _scsih_sas_device_remove+0x54/0x90 [mpt2sas]
RSP: 0018:ffff881fed4d7ab0 EFLAGS: 00010046
RAX: dead000000200200 RBX: ffff881ff6a5cd88 RCX: 00000000000010e8
RDX: ffff881ff7dab800 RSI: ffff881ff7daba00 RDI: dead000000100100
RBP: ffff881fed4d7ad0 R08: dead000000200200 R09: ffff880fff802200
R10: ffffffffa0317980 R11: 0000000000000000 R12: ffff881ff7daba00
R13: 0000000000000286 R14: 500605ba006c9d09 R15: ffff881ff7daba00
FS: 0000000000000000(0000) GS:ffff88203fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8ac89ec458 CR3: 0000001ff4c5c000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:11 (pid: 3097, threadinfo ffff881fed4d6000, task ffff881402f3d9c0)
Stack:
0000000000000401 ffff881ff6a5c6b0 0000000000000401 0000000000000016
ffff881fed4d7bb0 ffffffffa030f93e 0000000000000000 ffff881ff6a5cd88
0012000e0f000008 006c9d090002000b 00180009500605ba 0000040100000016
Call Trace:
[<ffffffffa030f93e>] _scsih_add_device.clone.32+0x2fe/0x420 [mpt2sas]
[<ffffffffa03126e5>] _scsih_sas_topology_change_event.clone.38+0x285/0x620 [mpt2sas]
[<ffffffff81078c90>] ? load_balance+0x100/0x7a0
[<ffffffffa0312a80>] ? _scsih_sas_topology_change_event.clone.38+0x620/0x620 [mpt2sas]
[<ffffffffa0312d8a>] _firmware_event_work+0x30a/0xfc0 [mpt2sas]
[<ffffffff810015cc>] ? __switch_to+0x14c/0x410
[<ffffffff8106dfdb>] ? finish_task_switch+0x4b/0xf0
[<ffffffffa0312a80>] ? _scsih_sas_topology_change_event.clone.38+0x620/0x620 [mpt2sas]
[<ffffffff8105bf40>] process_one_work+0x140/0x500
[<ffffffff8105d354>] worker_thread+0x194/0x510
[<ffffffff8106dfdb>] ? finish_task_switch+0x4b/0xf0
[<ffffffff8105d1c0>] ? manage_workers+0x320/0x320
[<ffffffff8106282e>] kthread+0x9e/0xb0
[<ffffffff815bef44>] kernel_thread_helper+0x4/0x10
[<ffffffff815b5e5d>] ? retint_restore_args+0x13/0x13
[<ffffffff81062790>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815bef40>] ? gs_change+0x13/0x13

Signed-off-by: Joern Engel <[email protected]>
---
drivers/scsi/mpt2sas/mpt2sas_base.h | 1 +
drivers/scsi/mpt2sas/mpt2sas_scsih.c | 55 ++++++++++++++++++++++++++++------
2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.h b/drivers/scsi/mpt2sas/mpt2sas_base.h
index 543d8d6..ceb7d41 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_base.h
+++ b/drivers/scsi/mpt2sas/mpt2sas_base.h
@@ -367,6 +367,7 @@ struct _sas_device {
u16 slot;
u8 phy;
u8 responding;
+ struct kref kref;
};

/**
diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
index c6bdc92..43b3a98 100644
--- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
+++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
@@ -570,6 +570,18 @@ _scsih_sas_device_find_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
return NULL;
}

+static void free_sas_device(struct kref *kref)
+{
+ struct _sas_device *sas_device = container_of(kref, struct _sas_device,
+ kref);
+ kfree(sas_device);
+}
+
+static void put_sas_device(struct _sas_device *sas_device)
+{
+ kref_put(&sas_device->kref, free_sas_device);
+}
+
/**
* _scsih_sas_device_remove - remove sas_device from list.
* @ioc: per adapter object
@@ -583,14 +595,19 @@ _scsih_sas_device_remove(struct MPT2SAS_ADAPTER *ioc,
struct _sas_device *sas_device)
{
unsigned long flags;
+ int was_on_list = 0;

if (!sas_device)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
- list_del(&sas_device->list);
- kfree(sas_device);
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
+ if (was_on_list)
+ put_sas_device(sas_device);
}


@@ -613,6 +630,8 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
sas_device->handle, (unsigned long long)sas_device->sas_address));

spin_lock_irqsave(&ioc->sas_device_lock, flags);
+ /* Get an extra refcount... */
+ kref_get(&sas_device->kref);
list_add_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);

@@ -630,6 +649,12 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
sas_device->sas_address_parent);
_scsih_sas_device_remove(ioc, sas_device);
}
+ /*
+ * ...and drop it again. If an error already happend, this is the
+ * final put and we free the object now. Otherwise whoever removes
+ * the object from the list will do the final put and free.
+ */
+ put_sas_device(sas_device);
}

/**
@@ -5270,6 +5295,7 @@ _scsih_add_device(struct MPT2SAS_ADAPTER *ioc, u16 handle, u8 phy_num, u8 is_pd)
return -1;
}

+ kref_init(&sas_device->kref);
sas_device->handle = handle;
if (_scsih_get_sas_address(ioc, le16_to_cpu
(sas_device_pg0.ParentDevHandle),
@@ -5341,7 +5367,7 @@ _scsih_remove_device(struct MPT2SAS_ADAPTER *ioc,
"handle(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
sas_device->handle, (unsigned long long)
sas_device->sas_address));
- kfree(sas_device);
+ put_sas_device(sas_device);
}
/**
* _scsih_device_remove_by_handle - removing device object by handle
@@ -5355,16 +5381,21 @@ _scsih_device_remove_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;

spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}

@@ -5381,6 +5412,7 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
{
struct _sas_device *sas_device;
unsigned long flags;
+ int was_on_list = 0;

if (ioc->shost_recovery)
return;
@@ -5388,10 +5420,14 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
spin_lock_irqsave(&ioc->sas_device_lock, flags);
sas_device = mpt2sas_scsih_sas_device_find_by_sas_address(ioc,
sas_address);
- if (sas_device)
- list_del(&sas_device->list);
+ if (sas_device) {
+ if (!list_empty(&sas_device->list)) {
+ list_del_init(&sas_device->list);
+ was_on_list = 1;
+ }
+ }
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
- if (sas_device)
+ if (was_on_list)
_scsih_remove_device(ioc, sas_device);
}
#ifdef CONFIG_SCSI_MPT2SAS_LOGGING
@@ -7805,6 +7841,7 @@ _scsih_probe_boot_devices(struct MPT2SAS_ADAPTER *ioc)
handle = sas_device->handle;
sas_address_parent = sas_device->sas_address_parent;
sas_address = sas_device->sas_address;
+ kref_get(&sas_device->kref);
list_move_tail(&sas_device->list, &ioc->sas_device_list);
spin_unlock_irqrestore(&ioc->sas_device_lock, flags);

--
1.7.10.4

2013-05-09 22:11:53

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 01/14] do_mounts: constify name_to_dev_t parameter

Signed-off-by: Joern Engel <[email protected]>
---
include/linux/mount.h | 2 +-
init/do_mounts.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..6b5fa77 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -74,6 +74,6 @@ extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
extern void mark_mounts_for_expiry(struct list_head *mounts);

-extern dev_t name_to_dev_t(char *name);
+extern dev_t name_to_dev_t(const char *name);

#endif /* _LINUX_MOUNT_H */
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 1d1b634..da96f85 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -202,7 +202,7 @@ done:
* bangs.
*/

-dev_t name_to_dev_t(char *name)
+dev_t name_to_dev_t(const char *name)
{
char s[32];
char *p;
--
1.7.10.4

2013-05-09 22:12:00

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 11/14] blockconsole: Rename device_lock with bc_device_lock

From: Takashi Iwai <[email protected]>

Avoid the name conflict with device_lock() defined in linux/device.h.

Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Joern Engel <[email protected]>
---
drivers/block/blockconsole.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index c22272f..40cc96e 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -546,17 +546,17 @@ static void bcon_create_fuzzy(const char *name)
}
}

-static DEFINE_SPINLOCK(device_lock);
+static DEFINE_SPINLOCK(bcon_device_lock);
static char scanned_devices[80];

static void bcon_do_add(struct work_struct *work)
{
char local_devices[80], *name, *remainder = local_devices;

- spin_lock(&device_lock);
+ spin_lock(&bcon_device_lock);
memcpy(local_devices, scanned_devices, sizeof(local_devices));
memset(scanned_devices, 0, sizeof(scanned_devices));
- spin_unlock(&device_lock);
+ spin_unlock(&bcon_device_lock);

while (remainder && remainder[0]) {
name = strsep(&remainder, ",");
@@ -573,11 +573,11 @@ void bcon_add(const char *name)
* to go pick it up asap. Once it is picked up, the buffer is empty
* again, so hopefully it will suffice for all sane users.
*/
- spin_lock(&device_lock);
+ spin_lock(&bcon_device_lock);
if (scanned_devices[0])
strncat(scanned_devices, ",", sizeof(scanned_devices));
strncat(scanned_devices, name, sizeof(scanned_devices));
- spin_unlock(&device_lock);
+ spin_unlock(&bcon_device_lock);
schedule_work(&bcon_add_work);
}

--
1.7.10.4

2013-05-09 22:12:14

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 08/14] blockconsole: Allow to pass a device file path to bcon_tail

From: Takashi Iwai <[email protected]>

... instead of always looking through all devices.

Minor tweak: Moved the "CANDIDATES=" line below Takashi's new code.

Signed-off-by: Takashi Iwai <[email protected]>
Signed-off-by: Joern Engel <[email protected]>
---
Documentation/block/blockconsole/bcon_tail | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/Documentation/block/blockconsole/bcon_tail b/Documentation/block/blockconsole/bcon_tail
index eb3524b..70926c6 100755
--- a/Documentation/block/blockconsole/bcon_tail
+++ b/Documentation/block/blockconsole/bcon_tail
@@ -58,6 +58,21 @@ end_of_log() {
# HEADER contains a newline, so the funny quoting is necessary
HEADER='
Linux blockconsole version 1.1'
+
+DEV="$1"
+if [ -n "$DEV" ]; then
+ if [ ! -b "$DEV" ]; then
+ echo "bcon_tail: No block device file $DEV"
+ exit 1
+ fi
+ if [ "`head -c32 $DEV`" != "$HEADER" ]; then
+ echo "bcon_tail: Invalid device file $DEV"
+ exit 1
+ fi
+ end_of_log $DEV
+ exit 0
+fi
+
CANDIDATES=`lsscsi |sed 's|.*/dev|/dev|'`

for DEV in $CANDIDATES; do
--
1.7.10.4

2013-05-09 22:12:10

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 02/14] add blockconsole version 1.1

Console driver similar to netconsole, except it writes to a block
device. Can be useful in a setup where netconsole, for whatever
reasons, is impractical.

Changes since version 1.0:
- Header format overhaul, addressing several annoyances when actually
using blockconsole for production.
- Steve Hodgson added a panic notifier.
- Added improvements and cleanups from Borislav Petkov.

Signed-off-by: Steve Hodgson <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Joern Engel <[email protected]>
---
Documentation/block/blockconsole.txt | 94 ++++
Documentation/block/blockconsole/bcon_tail | 62 +++
Documentation/block/blockconsole/mkblockconsole | 29 ++
block/partitions/Makefile | 1 +
block/partitions/blockconsole.c | 22 +
block/partitions/check.c | 3 +
block/partitions/check.h | 3 +
drivers/block/Kconfig | 6 +
drivers/block/Makefile | 1 +
drivers/block/blockconsole.c | 621 +++++++++++++++++++++++
include/linux/blockconsole.h | 7 +
11 files changed, 849 insertions(+)
create mode 100644 Documentation/block/blockconsole.txt
create mode 100755 Documentation/block/blockconsole/bcon_tail
create mode 100755 Documentation/block/blockconsole/mkblockconsole
create mode 100644 block/partitions/blockconsole.c
create mode 100644 drivers/block/blockconsole.c
create mode 100644 include/linux/blockconsole.h

diff --git a/Documentation/block/blockconsole.txt b/Documentation/block/blockconsole.txt
new file mode 100644
index 0000000..2b45516
--- /dev/null
+++ b/Documentation/block/blockconsole.txt
@@ -0,0 +1,94 @@
+started by Jörn Engel <[email protected]> 2012.03.17
+
+Blocksonsole for the impatient
+==============================
+
+1. Find an unused USB stick and prepare it for blockconsole by writing
+ the blockconsole signature to it:
+ $ ./mkblockconsole /dev/<usb_stick>
+
+2. USB stick is ready for use, replug it so that the kernel can start
+ logging to it.
+
+3. After you've done logging, read out the logs from it like this:
+ $ ./bcon_tail
+
+ This creates a file called /var/log/bcon.<UUID> which contains the
+ last 16M of the logs. Open it with a sane editor like vim which
+ can display zeroed gaps as a single line and start staring at the
+ logs.
+ For the really impatient, use:
+ $ vi `./bcon_tail`
+
+Introduction:
+=============
+
+This module logs kernel printk messages to block devices, e.g. usb
+sticks. It allows after-the-fact debugging when the main
+disk/filesystem fails and serial consoles and netconsole are
+impractical.
+
+It can currently only be used built-in. Blockconsole hooks into the
+partition scanning code and will bring up configured block devices as
+soon as possible. While this doesn't allow capture of early kernel
+panics, it does capture most of the boot process.
+
+Block device configuration:
+==================================
+
+Blockconsole has no configuration parameter. In order to use a block
+device for logging, the blockconsole header has to be written to the
+device in question. Logging to partitions is not supported.
+
+The example program mkblockconsole can be used to generate such a
+header on a device.
+
+Header format:
+==============
+
+A legal header looks like this:
+
+Linux blockconsole version 1.1
+818cf322
+00000000
+00000000
+
+It consists of a newline, the "Linux blockconsole version 1.1" string
+plus three numbers on separate lines each. Numbers are all 32bit,
+represented as 8-byte hex strings, with letters in lowercase. The
+first number is a uuid for this particular console device. Just pick
+a random number when generating the device. The second number is a
+wrap counter and unlikely to ever increment. The third is a tile
+counter, with a tile being one megabyte in size.
+
+Miscellaneous notes:
+====================
+
+Blockconsole will write a new header for every tile or once every
+megabyte. The header starts with a newline in order to ensure the
+"Linux blockconsole...' string always ends up at the beginning of a
+line if you read the blockconsole in a text editor.
+
+The blockconsole header is constructed such that opening the log
+device in a text editor, ignoring memory constraints due to large
+devices, should just work and be reasonably non-confusing to readers.
+However, the example program bcon_tail can be used to copy the last 16
+tiles of the log device to /var/log/bcon.<uuid>, which should be much
+easier to handle.
+
+The wrap counter is used by blockconsole to determine where to
+continue logging after a reboot. New logs will be written to the
+first tile that wasn't written to by the last instance of
+blockconsole. Similarly bcon_tail is doing a binary search to find
+the end of the log.
+
+Writing to the log device is strictly circular. This should give
+optimal performance and reliability on cheap devices, like usb sticks.
+
+Writing to block devices has to happen in sector granularity, while
+kernel logging happens in byte granularity. In order not to lose
+messages in important cases like kernel crashes, a timer will write
+out partial sectors if no new messages appear for a while. The
+unwritten part of the sector will be filled with spaces and a single
+newline. In a quiet system, these empty lines can make up the bulk of
+the log.
diff --git a/Documentation/block/blockconsole/bcon_tail b/Documentation/block/blockconsole/bcon_tail
new file mode 100755
index 0000000..b4bd660
--- /dev/null
+++ b/Documentation/block/blockconsole/bcon_tail
@@ -0,0 +1,62 @@
+#!/bin/bash
+
+TAIL_LEN=16
+TEMPLATE=/tmp/bcon_template
+BUF=/tmp/bcon_buf
+
+if [ $(whoami) != "root" ]; then
+ echo "You need to be root to run this."
+ exit 1
+fi
+
+if [ -z "$(which lsscsi)" ]; then
+ echo "You need to install the lsscsi package on your distro."
+ exit 1
+fi
+
+end_of_log() {
+ DEV=$1
+ UUID=`head -c40 $DEV|tail -c8`
+ LOGFILE=/var/log/bcon.$UUID
+ SECTORS=`hdparm -g $DEV|grep sectors|sed 's/.*sectors = \([0-9]*\).*/\1/'`
+ #MSIZE=`expr $SECTORS / 2048`
+ dd if=$DEV iflag=direct bs=512 2>/dev/null|head -c50 > $TEMPLATE
+ #START, MIDDLE and END are in sectors
+ START=0
+ MIDDLE=$SECTORS
+ END=$SECTORS
+ while true; do
+ MIDDLE=`expr \( \( $END + $START \) / 4096 \) \* 2048`
+ if [ $MIDDLE -eq $START ]; then
+ break
+ fi
+ dd if=$DEV iflag=direct bs=512 count=1 skip=$MIDDLE 2>/dev/null|head -c50 > $BUF
+ if diff -q $BUF $TEMPLATE > /dev/null; then
+ START=$MIDDLE
+ else
+ END=$MIDDLE
+ fi
+ done
+ #switch to megabytes
+ END=`expr $END / 2048`
+ START=`expr $START / 2048`
+ if [ $START -lt $TAIL_LEN ]; then
+ START=0
+ else
+ START=`expr $START - $TAIL_LEN + 1`
+ fi
+ LEN=`expr $END - $START`
+ dd if=$DEV iflag=direct bs=1M count=$LEN skip=$START >$LOGFILE 2>/dev/null
+ echo $LOGFILE
+}
+
+# HEADER contains a newline, so the funny quoting is necessary
+HEADER='
+Linux blockconsole version 1.1'
+CANDIDATES=`lsscsi |sed 's|.*/dev|/dev|'`
+
+for DEV in $CANDIDATES; do
+ if [ "`head -c32 $DEV`" == "$HEADER" ]; then
+ end_of_log $DEV
+ fi
+done
diff --git a/Documentation/block/blockconsole/mkblockconsole b/Documentation/block/blockconsole/mkblockconsole
new file mode 100755
index 0000000..c48f995
--- /dev/null
+++ b/Documentation/block/blockconsole/mkblockconsole
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+# handle the case when using files for logging instead of
+# real devices (with kvm, for example)
+DD_OPTS="conv=notrunc"
+
+if [ ! $# -eq 1 ]; then
+ echo "Usage: $0 <dev>"
+ exit 1
+elif mount|fgrep -q $1; then
+ echo Device appears to be mounted - aborting
+ exit 1
+else
+ dd if=/dev/zero of=$1 bs=1M count=1 $DD_OPTS
+ # The funky formatting is actually needed!
+ UUID=`head -c4 /dev/urandom |hexdump -e '/4 "%08x"'`
+ echo > /tmp/$UUID
+ echo 'Linux blockconsole version 1.1' >> /tmp/$UUID
+ echo "$UUID" >> /tmp/$UUID
+ echo 00000000 >> /tmp/$UUID
+ echo 00000000 >> /tmp/$UUID
+ for i in `seq 452`; do echo -n " " >> /tmp/$UUID; done
+ echo >> /tmp/$UUID
+
+ dd if=/tmp/$UUID of=$1 $DD_OPTS
+ rm /tmp/$UUID
+ sync
+ exit 0
+fi
diff --git a/block/partitions/Makefile b/block/partitions/Makefile
index 03af8ea..bf26d4a 100644
--- a/block/partitions/Makefile
+++ b/block/partitions/Makefile
@@ -18,3 +18,4 @@ obj-$(CONFIG_IBM_PARTITION) += ibm.o
obj-$(CONFIG_EFI_PARTITION) += efi.o
obj-$(CONFIG_KARMA_PARTITION) += karma.o
obj-$(CONFIG_SYSV68_PARTITION) += sysv68.o
+obj-$(CONFIG_BLOCKCONSOLE) += blockconsole.o
diff --git a/block/partitions/blockconsole.c b/block/partitions/blockconsole.c
new file mode 100644
index 0000000..79796a8
--- /dev/null
+++ b/block/partitions/blockconsole.c
@@ -0,0 +1,22 @@
+#include <linux/blockconsole.h>
+
+#include "check.h"
+
+int blockconsole_partition(struct parsed_partitions *state)
+{
+ Sector sect;
+ void *data;
+ int err = 0;
+
+ data = read_part_sector(state, 0, &sect);
+ if (!data)
+ return -EIO;
+ if (!bcon_magic_present(data))
+ goto out;
+
+ bcon_add(state->name);
+ err = 1;
+out:
+ put_dev_sector(sect);
+ return err;
+}
diff --git a/block/partitions/check.c b/block/partitions/check.c
index bc90867..c3e325b 100644
--- a/block/partitions/check.c
+++ b/block/partitions/check.c
@@ -41,6 +41,9 @@ static int (*check_part[])(struct parsed_partitions *) = {
* Probe partition formats with tables at disk address 0
* that also have an ADFS boot block at 0xdc0.
*/
+#ifdef CONFIG_BLOCKCONSOLE
+ blockconsole_partition,
+#endif
#ifdef CONFIG_ACORN_PARTITION_ICS
adfspart_check_ICS,
#endif
diff --git a/block/partitions/check.h b/block/partitions/check.h
index 52b1003..064bf6f 100644
--- a/block/partitions/check.h
+++ b/block/partitions/check.h
@@ -50,3 +50,6 @@ put_partition(struct parsed_partitions *p, int n, sector_t from, sector_t size)

extern int warn_no_part;

+#ifdef CONFIG_BLOCKCONSOLE
+int blockconsole_partition(struct parsed_partitions *state);
+#endif
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 824e09c..06eb42f 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -544,4 +544,10 @@ config BLK_DEV_RBD

If unsure, say N.

+config BLOCKCONSOLE
+ bool "Block device console logging support"
+ help
+ This enables logging to block devices.
+ See <file:Documentation/block/blockconsole.txt> for details.
+
endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 17e82df..99c5c2e 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -40,5 +40,6 @@ obj-$(CONFIG_XEN_BLKDEV_BACKEND) += xen-blkback/
obj-$(CONFIG_BLK_DEV_DRBD) += drbd/
obj-$(CONFIG_BLK_DEV_RBD) += rbd.o
obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX) += mtip32xx/
+obj-$(CONFIG_BLOCKCONSOLE) += blockconsole.o

swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
new file mode 100644
index 0000000..7f8ac5b
--- /dev/null
+++ b/drivers/block/blockconsole.c
@@ -0,0 +1,621 @@
+/*
+ * Blockconsole - write kernel console to a block device
+ *
+ * Copyright (C) 2012 Joern Engel <[email protected]>
+ */
+#include <linux/bio.h>
+#include <linux/blockconsole.h>
+#include <linux/console.h>
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/kthread.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/workqueue.h>
+#include <linux/sched.h>
+#include <linux/ctype.h>
+
+#define BLOCKCONSOLE_MAGIC_OLD "\nLinux blockconsole version 1.0\n"
+#define BLOCKCONSOLE_MAGIC "\nLinux blockconsole version 1.1\n"
+#define BCON_UUID_OFS (32)
+#define BCON_ROUND_OFS (41)
+#define BCON_TILE_OFS (50)
+#define BCON_HEADERSIZE (50)
+#define BCON_LONG_HEADERSIZE (59) /* with tile index */
+
+#define PAGE_COUNT (256)
+#define SECTOR_COUNT (PAGE_COUNT * (PAGE_SIZE >> 9))
+#define CACHE_PAGE_MASK (PAGE_COUNT - 1)
+#define CACHE_SECTOR_MASK (SECTOR_COUNT - 1)
+#define CACHE_SIZE (PAGE_COUNT << PAGE_SHIFT)
+#define CACHE_MASK (CACHE_SIZE - 1)
+#define SECTOR_SHIFT (9)
+#define SECTOR_SIZE (1 << SECTOR_SHIFT)
+#define SECTOR_MASK (~(SECTOR_SIZE-1))
+#define PG_SECTOR_MASK ((PAGE_SIZE >> 9) - 1)
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+struct bcon_bio {
+ struct bio bio;
+ struct bio_vec bvec;
+ void *sector;
+ int in_flight;
+};
+
+struct blockconsole {
+ char devname[32];
+ spinlock_t end_io_lock;
+ struct timer_list pad_timer;
+ int error_count;
+ struct kref kref;
+ u64 console_bytes;
+ u64 write_bytes;
+ u64 max_bytes;
+ u32 round;
+ u32 uuid;
+ struct bcon_bio bio_array[SECTOR_COUNT];
+ struct page *pages;
+ struct bcon_bio zero_bios[PAGE_COUNT];
+ struct page *zero_page;
+ struct block_device *bdev;
+ struct console console;
+ struct work_struct unregister_work;
+ struct task_struct *writeback_thread;
+ struct notifier_block panic_block;
+};
+
+static void bcon_get(struct blockconsole *bc)
+{
+ kref_get(&bc->kref);
+}
+
+static void bcon_release(struct kref *kref)
+{
+ struct blockconsole *bc = container_of(kref, struct blockconsole, kref);
+
+ __free_pages(bc->zero_page, 0);
+ __free_pages(bc->pages, 8);
+ invalidate_mapping_pages(bc->bdev->bd_inode->i_mapping, 0, -1);
+ blkdev_put(bc->bdev, FMODE_READ|FMODE_WRITE);
+ kfree(bc);
+}
+
+static void bcon_put(struct blockconsole *bc)
+{
+ kref_put(&bc->kref, bcon_release);
+}
+
+static int __bcon_console_ofs(u64 console_bytes)
+{
+ return console_bytes & ~SECTOR_MASK;
+}
+
+static int bcon_console_ofs(struct blockconsole *bc)
+{
+ return __bcon_console_ofs(bc->console_bytes);
+}
+
+static int __bcon_console_sector(u64 console_bytes)
+{
+ return (console_bytes >> SECTOR_SHIFT) & CACHE_SECTOR_MASK;
+}
+
+static int bcon_console_sector(struct blockconsole *bc)
+{
+ return __bcon_console_sector(bc->console_bytes);
+}
+
+static int bcon_write_sector(struct blockconsole *bc)
+{
+ return (bc->write_bytes >> SECTOR_SHIFT) & CACHE_SECTOR_MASK;
+}
+
+static void clear_sector(void *sector)
+{
+ memset(sector, ' ', 511);
+ memset(sector + 511, 10, 1);
+}
+
+static void bcon_init_first_page(struct blockconsole *bc)
+{
+ char *buf = page_address(bc->pages);
+ size_t len = strlen(BLOCKCONSOLE_MAGIC);
+ u32 tile = bc->console_bytes >> 20; /* We overflow after 4TB - fine */
+
+ clear_sector(buf);
+ memcpy(buf, BLOCKCONSOLE_MAGIC, len);
+ sprintf(buf + BCON_UUID_OFS, "%08x", bc->uuid);
+ sprintf(buf + BCON_ROUND_OFS, "%08x", bc->round);
+ sprintf(buf + BCON_TILE_OFS, "%08x", tile);
+ /* replace NUL with newline */
+ buf[BCON_UUID_OFS + 8] = 10;
+ buf[BCON_ROUND_OFS + 8] = 10;
+ buf[BCON_TILE_OFS + 8] = 10;
+}
+
+static void bcon_advance_console_bytes(struct blockconsole *bc, int bytes)
+{
+ u64 old, new;
+
+ do {
+ old = bc->console_bytes;
+ new = old + bytes;
+ if (new >= bc->max_bytes)
+ new = 0;
+ if ((new & CACHE_MASK) == 0) {
+ bcon_init_first_page(bc);
+ new += BCON_LONG_HEADERSIZE;
+ }
+ } while (cmpxchg64(&bc->console_bytes, old, new) != old);
+}
+
+static void request_complete(struct bio *bio, int err)
+{
+ complete((struct completion *)bio->bi_private);
+}
+
+static int sync_read(struct blockconsole *bc, u64 ofs)
+{
+ struct bio bio;
+ struct bio_vec bio_vec;
+ struct completion complete;
+
+ bio_init(&bio);
+ bio.bi_io_vec = &bio_vec;
+ bio_vec.bv_page = bc->pages;
+ bio_vec.bv_len = SECTOR_SIZE;
+ bio_vec.bv_offset = 0;
+ bio.bi_vcnt = 1;
+ bio.bi_idx = 0;
+ bio.bi_size = SECTOR_SIZE;
+ bio.bi_bdev = bc->bdev;
+ bio.bi_sector = ofs >> SECTOR_SHIFT;
+ init_completion(&complete);
+ bio.bi_private = &complete;
+ bio.bi_end_io = request_complete;
+
+ submit_bio(READ, &bio);
+ wait_for_completion(&complete);
+ return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
+}
+
+static void bcon_erase_segment(struct blockconsole *bc)
+{
+ int i;
+
+ for (i = 0; i < PAGE_COUNT; i++) {
+ struct bcon_bio *bcon_bio = bc->zero_bios + i;
+ struct bio *bio = &bcon_bio->bio;
+
+ /*
+ * If the last erase hasn't finished yet, just skip it. The log
+ * will look messy, but that's all.
+ */
+ rmb();
+ if (bcon_bio->in_flight)
+ continue;
+ bio_init(bio);
+ bio->bi_io_vec = &bcon_bio->bvec;
+ bio->bi_vcnt = 1;
+ bio->bi_size = PAGE_SIZE;
+ bio->bi_bdev = bc->bdev;
+ bio->bi_private = bc;
+ bio->bi_idx = 0;
+ bio->bi_sector = (bc->write_bytes + i * PAGE_SIZE) >> 9;
+ bcon_bio->in_flight = 1;
+ wmb();
+ /* We want the erase to go to the device first somehow */
+ submit_bio(WRITE | REQ_SOFTBARRIER, bio);
+ }
+}
+
+static void bcon_advance_write_bytes(struct blockconsole *bc, int bytes)
+{
+ bc->write_bytes += bytes;
+ if (bc->write_bytes >= bc->max_bytes) {
+ bc->write_bytes = 0;
+ bcon_init_first_page(bc);
+ bc->round++;
+ }
+}
+
+static int bcon_convert_old_format(struct blockconsole *bc)
+{
+ bc->uuid = get_random_int();
+ bc->round = 0;
+ bc->console_bytes = bc->write_bytes = 0;
+ bcon_advance_console_bytes(bc, 0); /* To skip the header */
+ bcon_advance_write_bytes(bc, 0); /* To wrap around, if necessary */
+ bcon_erase_segment(bc);
+ pr_info("converted %s from old format\n", bc->devname);
+ return 0;
+}
+
+static int bcon_find_end_of_log(struct blockconsole *bc)
+{
+ u64 start = 0, end = bc->max_bytes, middle;
+ void *sec0 = bc->bio_array[0].sector;
+ void *sec1 = bc->bio_array[1].sector;
+ int err, version;
+
+ err = sync_read(bc, 0);
+ if (err)
+ return err;
+ /* Second sanity check, out of sheer paranoia */
+ version = bcon_magic_present(sec0);
+ if (version == 10)
+ return bcon_convert_old_format(bc);
+
+ bc->uuid = simple_strtoull(sec0 + BCON_UUID_OFS, NULL, 16);
+ bc->round = simple_strtoull(sec0 + BCON_ROUND_OFS, NULL, 16);
+
+ memcpy(sec1, sec0, BCON_HEADERSIZE);
+ for (;;) {
+ middle = (start + end) / 2;
+ middle &= ~CACHE_MASK;
+ if (middle == start)
+ break;
+ err = sync_read(bc, middle);
+ if (err)
+ return err;
+ if (memcmp(sec1, sec0, BCON_HEADERSIZE)) {
+ /* If the two differ, we haven't written that far yet */
+ end = middle;
+ } else {
+ start = middle;
+ }
+ }
+ bc->console_bytes = bc->write_bytes = end;
+ bcon_advance_console_bytes(bc, 0); /* To skip the header */
+ bcon_advance_write_bytes(bc, 0); /* To wrap around, if necessary */
+ bcon_erase_segment(bc);
+ return 0;
+}
+
+static void bcon_unregister(struct work_struct *work)
+{
+ struct blockconsole *bc = container_of(work, struct blockconsole,
+ unregister_work);
+
+ atomic_notifier_chain_unregister(&panic_notifier_list, &bc->panic_block);
+ unregister_console(&bc->console);
+ del_timer_sync(&bc->pad_timer);
+ kthread_stop(bc->writeback_thread);
+ /* No new io will be scheduled anymore now */
+ bcon_put(bc);
+}
+
+#define BCON_MAX_ERRORS 10
+static void bcon_end_io(struct bio *bio, int err)
+{
+ struct bcon_bio *bcon_bio = container_of(bio, struct bcon_bio, bio);
+ struct blockconsole *bc = bio->bi_private;
+ unsigned long flags;
+
+ /*
+ * We want to assume the device broken and free this console if
+ * we accumulate too many errors. But if errors are transient,
+ * we also want to forget about them once writes succeed again.
+ * Oh, and we only want to reset the counter if it hasn't reached
+ * the limit yet, so we don't bcon_put() twice from here.
+ */
+ spin_lock_irqsave(&bc->end_io_lock, flags);
+ if (err) {
+ if (bc->error_count++ == BCON_MAX_ERRORS) {
+ pr_info("no longer logging to %s\n", bc->devname);
+ schedule_work(&bc->unregister_work);
+ }
+ } else {
+ if (bc->error_count && bc->error_count < BCON_MAX_ERRORS)
+ bc->error_count = 0;
+ }
+ /*
+ * Add padding (a bunch of spaces and a newline) early so bcon_pad
+ * only has to advance a pointer.
+ */
+ clear_sector(bcon_bio->sector);
+ bcon_bio->in_flight = 0;
+ spin_unlock_irqrestore(&bc->end_io_lock, flags);
+ bcon_put(bc);
+}
+
+static void bcon_writesector(struct blockconsole *bc, int index)
+{
+ struct bcon_bio *bcon_bio = bc->bio_array + index;
+ struct bio *bio = &bcon_bio->bio;
+
+ rmb();
+ if (bcon_bio->in_flight)
+ return;
+ bcon_get(bc);
+
+ bio_init(bio);
+ bio->bi_io_vec = &bcon_bio->bvec;
+ bio->bi_vcnt = 1;
+ bio->bi_size = SECTOR_SIZE;
+ bio->bi_bdev = bc->bdev;
+ bio->bi_private = bc;
+ bio->bi_end_io = bcon_end_io;
+
+ bio->bi_idx = 0;
+ bio->bi_sector = bc->write_bytes >> 9;
+ bcon_bio->in_flight = 1;
+ wmb();
+ submit_bio(WRITE, bio);
+}
+
+static int bcon_writeback(void *_bc)
+{
+ struct blockconsole *bc = _bc;
+ struct sched_param(sp);
+
+ sp.sched_priority = MAX_RT_PRIO - 1; /* Highest realtime prio */
+ sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule();
+ if (kthread_should_stop())
+ break;
+ while (bcon_write_sector(bc) != bcon_console_sector(bc)) {
+ bcon_writesector(bc, bcon_write_sector(bc));
+ bcon_advance_write_bytes(bc, SECTOR_SIZE);
+ if (bcon_write_sector(bc) == 0)
+ bcon_erase_segment(bc);
+ }
+ }
+ return 0;
+}
+
+static void bcon_pad(unsigned long data)
+{
+ struct blockconsole *bc = (void *)data;
+ unsigned int n;
+
+ /*
+ * We deliberately race against bcon_write here. If we lose the race,
+ * our padding is no longer where we expected it to be, i.e. it is
+ * no longer a bunch of spaces with a newline at the end. There could
+ * not be a newline at all or it could be somewhere in the middle.
+ * Either way, the log corruption is fairly obvious to spot and ignore
+ * for human readers.
+ */
+ n = SECTOR_SIZE - bcon_console_ofs(bc);
+ if (n != SECTOR_SIZE) {
+ bcon_advance_console_bytes(bc, n);
+ wake_up_process(bc->writeback_thread);
+ }
+}
+
+static void bcon_write(struct console *console, const char *msg,
+ unsigned int len)
+{
+ struct blockconsole *bc = container_of(console, struct blockconsole,
+ console);
+ unsigned int n;
+ u64 console_bytes;
+ int i;
+
+ while (len) {
+ console_bytes = bc->console_bytes;
+ i = __bcon_console_sector(console_bytes);
+ rmb();
+ if (bc->bio_array[i].in_flight)
+ break;
+ n = min_t(int, len, SECTOR_SIZE -
+ __bcon_console_ofs(console_bytes));
+ memcpy(bc->bio_array[i].sector +
+ __bcon_console_ofs(console_bytes), msg, n);
+ len -= n;
+ msg += n;
+ bcon_advance_console_bytes(bc, n);
+ }
+ wake_up_process(bc->writeback_thread);
+ mod_timer(&bc->pad_timer, jiffies + HZ);
+}
+
+static void bcon_init_bios(struct blockconsole *bc)
+{
+ int i;
+
+ for (i = 0; i < SECTOR_COUNT; i++) {
+ int page_index = i >> (PAGE_SHIFT - SECTOR_SHIFT);
+ struct page *page = bc->pages + page_index;
+ struct bcon_bio *bcon_bio = bc->bio_array + i;
+ struct bio_vec *bvec = &bcon_bio->bvec;
+
+ bcon_bio->in_flight = 0;
+ bcon_bio->sector = page_address(bc->pages + page_index)
+ + SECTOR_SIZE * (i & PG_SECTOR_MASK);
+ clear_sector(bcon_bio->sector);
+ bvec->bv_page = page;
+ bvec->bv_len = SECTOR_SIZE;
+ bvec->bv_offset = SECTOR_SIZE * (i & PG_SECTOR_MASK);
+ }
+}
+
+static void bcon_init_zero_bio(struct blockconsole *bc)
+{
+ int i;
+
+ memset(page_address(bc->zero_page), 0, PAGE_SIZE);
+ for (i = 0; i < PAGE_COUNT; i++) {
+ struct bcon_bio *bcon_bio = bc->zero_bios + i;
+ struct bio_vec *bvec = &bcon_bio->bvec;
+
+ bcon_bio->in_flight = 0;
+ bvec->bv_page = bc->zero_page;
+ bvec->bv_len = PAGE_SIZE;
+ bvec->bv_offset = 0;
+ }
+}
+
+static int blockconsole_panic(struct notifier_block *this, unsigned long event,
+ void *ptr)
+{
+ struct blockconsole *bc = container_of(this, struct blockconsole,
+ panic_block);
+ unsigned int n;
+
+ n = SECTOR_SIZE - bcon_console_ofs(bc);
+ if (n != SECTOR_SIZE)
+ bcon_advance_console_bytes(bc, n);
+ bcon_writeback(bc);
+ return NOTIFY_DONE;
+}
+
+static int bcon_create(const char *devname)
+{
+ const fmode_t mode = FMODE_READ | FMODE_WRITE;
+ struct blockconsole *bc;
+ int err;
+
+ bc = kzalloc(sizeof(*bc), GFP_KERNEL);
+ if (!bc)
+ return -ENOMEM;
+ memset(bc->devname, ' ', sizeof(bc->devname));
+ strlcpy(bc->devname, devname, sizeof(bc->devname));
+ spin_lock_init(&bc->end_io_lock);
+ strcpy(bc->console.name, "bcon");
+ bc->console.flags = CON_PRINTBUFFER | CON_ENABLED;
+ bc->console.write = bcon_write;
+ bc->bdev = blkdev_get_by_path(devname, mode, NULL);
+#ifndef MODULE
+ if (IS_ERR(bc->bdev)) {
+ dev_t devt = name_to_dev_t(devname);
+ if (devt)
+ bc->bdev = blkdev_get_by_dev(devt, mode, NULL);
+ }
+#endif
+ if (IS_ERR(bc->bdev))
+ goto out;
+ bc->pages = alloc_pages(GFP_KERNEL, 8);
+ if (!bc->pages)
+ goto out;
+ bc->zero_page = alloc_pages(GFP_KERNEL, 0);
+ if (!bc->zero_page)
+ goto out1;
+ bcon_init_bios(bc);
+ bcon_init_zero_bio(bc);
+ setup_timer(&bc->pad_timer, bcon_pad, (unsigned long)bc);
+ bc->max_bytes = bc->bdev->bd_inode->i_size & ~CACHE_MASK;
+ err = bcon_find_end_of_log(bc);
+ if (err)
+ goto out2;
+ kref_init(&bc->kref); /* This reference gets freed on errors */
+ bc->writeback_thread = kthread_run(bcon_writeback, bc, "bcon_%s",
+ devname);
+ if (IS_ERR(bc->writeback_thread))
+ goto out2;
+ INIT_WORK(&bc->unregister_work, bcon_unregister);
+ register_console(&bc->console);
+ bc->panic_block.notifier_call = blockconsole_panic;
+ bc->panic_block.priority = INT_MAX;
+ atomic_notifier_chain_register(&panic_notifier_list, &bc->panic_block);
+ pr_info("now logging to %s at %llx\n", devname, bc->console_bytes >> 20);
+
+ return 0;
+
+out2:
+ __free_pages(bc->zero_page, 0);
+out1:
+ __free_pages(bc->pages, 8);
+out:
+ kfree(bc);
+ /* Not strictly correct, be the caller doesn't care */
+ return -ENOMEM;
+}
+
+static void bcon_create_fuzzy(const char *name)
+{
+ char *longname;
+ int err;
+
+ err = bcon_create(name);
+ if (err) {
+ longname = kzalloc(strlen(name) + 6, GFP_KERNEL);
+ if (!longname)
+ return;
+ strcpy(longname, "/dev/");
+ strcat(longname, name);
+ bcon_create(longname);
+ kfree(longname);
+ }
+}
+
+static DEFINE_SPINLOCK(device_lock);
+static char scanned_devices[80];
+
+static void bcon_do_add(struct work_struct *work)
+{
+ char local_devices[80], *name, *remainder = local_devices;
+
+ spin_lock(&device_lock);
+ memcpy(local_devices, scanned_devices, sizeof(local_devices));
+ memset(scanned_devices, 0, sizeof(scanned_devices));
+ spin_unlock(&device_lock);
+
+ while (remainder && remainder[0]) {
+ name = strsep(&remainder, ",");
+ bcon_create_fuzzy(name);
+ }
+}
+
+DECLARE_WORK(bcon_add_work, bcon_do_add);
+
+void bcon_add(const char *name)
+{
+ /*
+ * We add each name to a small static buffer and ask for a workqueue
+ * to go pick it up asap. Once it is picked up, the buffer is empty
+ * again, so hopefully it will suffice for all sane users.
+ */
+ spin_lock(&device_lock);
+ if (scanned_devices[0])
+ strncat(scanned_devices, ",", sizeof(scanned_devices));
+ strncat(scanned_devices, name, sizeof(scanned_devices));
+ spin_unlock(&device_lock);
+ schedule_work(&bcon_add_work);
+}
+
+/*
+ * Check if we have an 8-digit hex number followed by newline
+ */
+static bool is_four_byte_hex(const void *data)
+{
+ const char *str = data;
+ int len = 0;
+
+ while (isxdigit(*str) && len++ < 9)
+ str++;
+
+ if (len != 8)
+ return false;
+
+ /* str should point to a \n now */
+ if (*str != 0xa)
+ return false;
+
+ return true;
+}
+
+int bcon_magic_present(const void *data)
+{
+ size_t len = strlen(BLOCKCONSOLE_MAGIC);
+
+ if (!memcmp(data, BLOCKCONSOLE_MAGIC_OLD, len))
+ return 10;
+ if (memcmp(data, BLOCKCONSOLE_MAGIC, len))
+ return 0;
+ if (!is_four_byte_hex(data + BCON_UUID_OFS))
+ return 0;
+ if (!is_four_byte_hex(data + BCON_ROUND_OFS))
+ return 0;
+ if (!is_four_byte_hex(data + BCON_TILE_OFS))
+ return 0;
+ return 11;
+}
diff --git a/include/linux/blockconsole.h b/include/linux/blockconsole.h
new file mode 100644
index 0000000..114f7c5
--- /dev/null
+++ b/include/linux/blockconsole.h
@@ -0,0 +1,7 @@
+#ifndef LINUX_BLOCKCONSOLE_H
+#define LINUX_BLOCKCONSOLE_H
+
+int bcon_magic_present(const void *data);
+void bcon_add(const char *name);
+
+#endif
--
1.7.10.4

2013-05-09 22:13:00

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 03/14] printk: add CON_ALLDATA console flag

For consoles like netconsole and blockconsole the loglevel filtering
really doesn't make any sense. If a line gets printed at all, please
send it down to that console, no questions asked.

For vga_con, it is a completely different matter, as the user sitting in
front of his console could get spammed by messages while trying to login
or similar. So ignore_loglevel doesn't work as a one-size-fits-all
approach. Add a per-console flag instead so that netconsole and
blockconsole can opt-in.

Signed-off-by: Joern Engel <[email protected]>
---
include/linux/console.h | 1 +
kernel/printk.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index dedb082..eed92ad 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -116,6 +116,7 @@ static inline int con_debug_leave(void)
#define CON_BOOT (8)
#define CON_ANYTIME (16) /* Safe to call when cpu is offline */
#define CON_BRL (32) /* Used for a braille device */
+#define CON_ALLDATA (64) /* per-console ignore_loglevel */

struct console {
char name[16];
diff --git a/kernel/printk.c b/kernel/printk.c
index 267ce78..5221c59 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -1261,8 +1261,6 @@ static void call_console_drivers(int level, const char *text, size_t len)

trace_console(text, 0, len, len);

- if (level >= console_loglevel && !ignore_loglevel)
- return;
if (!console_drivers)
return;

@@ -1276,6 +1274,9 @@ static void call_console_drivers(int level, const char *text, size_t len)
if (!cpu_online(smp_processor_id()) &&
!(con->flags & CON_ANYTIME))
continue;
+ if (level >= console_loglevel && !ignore_loglevel &&
+ !(con->flags & CON_ALLDATA))
+ continue;
con->write(con, text, len);
}
}
--
1.7.10.4

2013-05-09 22:13:11

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 04/14] netconsole: use CON_ALLDATA

Netconsole should really see every message ever printed. The
alternative is to try debugging with information like this:
[166135.633974] Stack:
[166135.634016] Call Trace:
[166135.634029] <IRQ>
[166135.634156] <EOI>
[166135.634177] Code: 00 00 55 48 89 e5 0f 1f 44 00 00 ff 15 31 49 80 00 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
[166135.634384] 48 8b 14 25 98 24 01 00 48 8d 14 92 48 8d 04 bd 00 00 00 00

Signed-off-by: Joern Engel <[email protected]>
---
drivers/net/netconsole.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 6989ebe..77783fe 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -718,7 +718,7 @@ static void write_msg(struct console *con, const char *msg, unsigned int len)

static struct console netconsole = {
.name = "netcon",
- .flags = CON_ENABLED,
+ .flags = CON_ENABLED | CON_ALLDATA,
.write = write_msg,
};

--
1.7.10.4

2013-05-09 22:13:29

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] mpt2sas/mpt3sas: prevent double free on error path

On Thu, 9 May 2013 16:42:58 -0400, Joern Engel wrote:
> Subject: [PATCH] mpt2sas/mpt3sas: prevent double free on error path

The two mtp[23]sas patches shouldn't have been in here. Sorry about
the noise!

Jörn

--
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing

2013-05-09 22:13:34

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 05/14] blockconsole: use CON_ALLDATA

Blockconsole should really see every message ever printed. The
alternative is to try debugging with information like this:
[166135.633974] Stack:
[166135.634016] Call Trace:
[166135.634029] <IRQ>
[166135.634156] <EOI>
[166135.634177] Code: 00 00 55 48 89 e5 0f 1f 44 00 00 ff 15 31 49 80 00 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
[166135.634384] 48 8b 14 25 98 24 01 00 48 8d 14 92 48 8d 04 bd 00 00 00 00

Signed-off-by: Joern Engel <[email protected]>
---
drivers/block/blockconsole.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index 7f8ac5b..32f6c62 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -481,7 +481,7 @@ static int bcon_create(const char *devname)
strlcpy(bc->devname, devname, sizeof(bc->devname));
spin_lock_init(&bc->end_io_lock);
strcpy(bc->console.name, "bcon");
- bc->console.flags = CON_PRINTBUFFER | CON_ENABLED;
+ bc->console.flags = CON_PRINTBUFFER | CON_ENABLED | CON_ALLDATA;
bc->console.write = bcon_write;
bc->bdev = blkdev_get_by_path(devname, mode, NULL);
#ifndef MODULE
--
1.7.10.4

2013-05-09 22:13:45

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 06/14] bcon: add a release work struct

The final bcon_put() can be called from atomic context, by way of
bio_endio(). In that case we would sleep in invalidate_mapping_pages(),
with the usual unhappy results.

In nearly a year of production use, I have only seen a matching
backtrace once. There was a second known issue that could be reproduced
by "yes h > /proc/sysrq-trigger" and concurrently pulling and replugging
the blockconsole device. It took be somewhere around 30 pulls and sore
thumbs to reproduce and I never found the time to get to the bottom of
it. Quite likely the two issues are identical.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/block/blockconsole.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index 32f6c62..b4730f8 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -65,6 +65,7 @@ struct blockconsole {
struct block_device *bdev;
struct console console;
struct work_struct unregister_work;
+ struct work_struct release_work;
struct task_struct *writeback_thread;
struct notifier_block panic_block;
};
@@ -74,9 +75,10 @@ static void bcon_get(struct blockconsole *bc)
kref_get(&bc->kref);
}

-static void bcon_release(struct kref *kref)
+static void __bcon_release(struct work_struct *work)
{
- struct blockconsole *bc = container_of(kref, struct blockconsole, kref);
+ struct blockconsole *bc = container_of(work, struct blockconsole,
+ release_work);

__free_pages(bc->zero_page, 0);
__free_pages(bc->pages, 8);
@@ -85,6 +87,14 @@ static void bcon_release(struct kref *kref)
kfree(bc);
}

+static void bcon_release(struct kref *kref)
+{
+ struct blockconsole *bc = container_of(kref, struct blockconsole, kref);
+
+ /* bcon_release can be called from atomic context */
+ schedule_work(&bc->release_work);
+}
+
static void bcon_put(struct blockconsole *bc)
{
kref_put(&bc->kref, bcon_release);
@@ -512,6 +522,7 @@ static int bcon_create(const char *devname)
if (IS_ERR(bc->writeback_thread))
goto out2;
INIT_WORK(&bc->unregister_work, bcon_unregister);
+ INIT_WORK(&bc->release_work, __bcon_release);
register_console(&bc->console);
bc->panic_block.notifier_call = blockconsole_panic;
bc->panic_block.priority = INT_MAX;
--
1.7.10.4

2013-05-09 22:13:51

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 07/14] bcon: check for hdparm in bcon_tail

From: Borislav Petkov <[email protected]>
Signed-off-by: Joern Engel <[email protected]>
---
Documentation/block/blockconsole/bcon_tail | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/Documentation/block/blockconsole/bcon_tail b/Documentation/block/blockconsole/bcon_tail
index b4bd660..eb3524b 100755
--- a/Documentation/block/blockconsole/bcon_tail
+++ b/Documentation/block/blockconsole/bcon_tail
@@ -14,6 +14,11 @@ if [ -z "$(which lsscsi)" ]; then
exit 1
fi

+if [ -z "$(which hdparm)" ]; then
+ echo "You need to install the hdparm package on your distro."
+ exit 1
+fi
+
end_of_log() {
DEV=$1
UUID=`head -c40 $DEV|tail -c8`
--
1.7.10.4

2013-05-09 22:14:08

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 13/14] bcon: Fix wrap-around behaviour

This seems to have broken around the introduction of format 1.1. When
wrapping around, we should increment the wrap counter before writing it
out, not after.

Signed-off-by: Joern Engel <[email protected]>
---
drivers/block/blockconsole.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index 01ddbc6..65f8ace 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -229,8 +229,8 @@ static void bcon_advance_write_bytes(struct blockconsole *bc, int bytes)
bc->write_bytes += bytes;
if (bc->write_bytes >= bc->max_bytes) {
bc->write_bytes = 0;
- bcon_init_first_page(bc);
bc->round++;
+ bcon_init_first_page(bc);
}
}

--
1.7.10.4

2013-05-09 22:14:47

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 14/14] netconsole: s/syslogd/cancd/ in documentation

Using syslogd to capture netconsole is known to be broken, see for
example https://bugzilla.redhat.com/show_bug.cgi?id=432160 or any of the
many other bug reports. We should not advertise it, much less as a
first choice. The fact that syslogd tends to initially work makes it
worse, as that creates false hope.

Cancd is a syslog-for-netconsole of sorts and in my experience works
better than any alternative for non-trivial setups, i.e. more than a
single machine sending netconsole traffic.

Since my hacked-up version of cancd is no longer compatible with
Oracle's original, I linked to both.

Signed-off-by: Joern Engel <[email protected]>
---
Documentation/networking/netconsole.txt | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt
index 2e9e0ae2..c59d2bf 100644
--- a/Documentation/networking/netconsole.txt
+++ b/Documentation/networking/netconsole.txt
@@ -54,9 +54,7 @@ address.
The remote host has several options to receive the kernel messages,
for example:

-1) syslogd
-
-2) netcat
+1) netcat

On distributions using a BSD-based netcat version (e.g. Fedora,
openSUSE and Ubuntu) the listening port must be specified without
@@ -65,10 +63,20 @@ for example:
'nc -u -l -p <port>' / 'nc -u -l <port>' or
'netcat -u -l -p <port>' / 'netcat -u -l <port>'

-3) socat
+2) socat

'socat udp-recv:<port> -'

+3) cancd
+
+ A daemon written specifically for netconsole that is good at capturing
+ output from many machines. Using netcat for several machines either
+ interleaves output from all machines or requires the use of per-machine
+ ports.
+
+ https://git.kernel.org/cgit/linux/kernel/git/joern/cancd.git/
+ https://oss.oracle.com/projects/cancd/
+
Dynamic reconfiguration:
========================

--
1.7.10.4

2013-05-09 22:14:53

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 0/9] Add blockconsole version 1.1 (try 2)

Blockconsole is a console driver very roughly similar to netconsole.
Instead of sending messages out via UDP, they are written to a block
device. Typically a USB stick is chosen, although in principle any
block device will do.

In most cases blockconsole is useful where netconsole is not, i.e.
single machines without network access or without an accessable
netconsole capture server. When using both blockconsole and
netconsole, I have found netconsole to sometimes create a mess under
high message load (sysrq-t, etc.) while blockconsole does not.

Most importantly, a number of bugs were identified and fixed that
would have been unexplained machine reboots without blockconsole.

More highlights:
* reasonably small and self-contained code,
* some 100+ machine years of runtime,
* nice tutorial with a 30-sec guide for the impatient.

Special thanks to Borislav Petkov for many improvements and kicking my
behind to provide a proper git tree and resend patches.

A number of cleanup patches could be folded into the main patch, but I
decided not to mess with git history and leave any further mistakes
for the world to laugh at:
git://git.kernel.org/pub/scm/linux/kernel/git/joern/bcon2.git

Joern Engel (8):
do_mounts: constify name_to_dev_t parameter
add blockconsole version 1.1
printk: add CON_ALLDATA console flag
netconsole: use CON_ALLDATA
blockconsole: use CON_ALLDATA
bcon: add a release work struct
bcon: check for hdparm in bcon_tail
bcon: remove version 1.0 support

Takashi Iwai (1):
blockconsole: Allow to pass a device file path to bcon_tail

Documentation/block/blockconsole.txt | 94 ++++
Documentation/block/blockconsole/bcon_tail | 82 +++
Documentation/block/blockconsole/mkblockconsole | 29 ++
block/partitions/Makefile | 1 +
block/partitions/blockconsole.c | 22 +
block/partitions/check.c | 3 +
block/partitions/check.h | 3 +
drivers/block/Kconfig | 6 +
drivers/block/Makefile | 1 +
drivers/block/blockconsole.c | 617 +++++++++++++++++++++++
drivers/net/netconsole.c | 2 +-
include/linux/blockconsole.h | 7 +
include/linux/console.h | 1 +
include/linux/mount.h | 2 +-
init/do_mounts.c | 2 +-
kernel/printk.c | 5 +-
16 files changed, 872 insertions(+), 5 deletions(-)
create mode 100644 Documentation/block/blockconsole.txt
create mode 100755 Documentation/block/blockconsole/bcon_tail
create mode 100755 Documentation/block/blockconsole/mkblockconsole
create mode 100644 block/partitions/blockconsole.c
create mode 100644 drivers/block/blockconsole.c
create mode 100644 include/linux/blockconsole.h

--
1.7.10.4

2013-05-09 22:11:13

by Jörn Engel

[permalink] [raw]
Subject: [PATCH 12/14] blockconsole: Mark a local work struct static

From: Takashi Iwai <[email protected]>

Signed-off-by: Takashi Iwai <[email protected]>
---
drivers/block/blockconsole.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/blockconsole.c b/drivers/block/blockconsole.c
index 40cc96e..01ddbc6 100644
--- a/drivers/block/blockconsole.c
+++ b/drivers/block/blockconsole.c
@@ -564,7 +564,7 @@ static void bcon_do_add(struct work_struct *work)
}
}

-DECLARE_WORK(bcon_add_work, bcon_do_add);
+static DECLARE_WORK(bcon_add_work, bcon_do_add);

void bcon_add(const char *name)
{
--
1.7.10.4

2013-05-22 20:48:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 01/14] do_mounts: constify name_to_dev_t parameter

On Thu, 9 May 2013 16:42:57 -0400 Joern Engel <[email protected]> wrote:

> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -74,6 +74,6 @@ extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
> extern void mark_mounts_for_expiry(struct list_head *mounts);
>
> -extern dev_t name_to_dev_t(char *name);
> +extern dev_t name_to_dev_t(const char *name);
>
> #endif /* _LINUX_MOUNT_H */
> diff --git a/init/do_mounts.c b/init/do_mounts.c
> index 1d1b634..da96f85 100644
> --- a/init/do_mounts.c
> +++ b/init/do_mounts.c
> @@ -202,7 +202,7 @@ done:
> * bangs.
> */
>
> -dev_t name_to_dev_t(char *name)
> +dev_t name_to_dev_t(const char *name)
> {
> char s[32];
> char *p;

A changelog would be nice. There are 1.1 billion places in the kernel
where we could make such a change - why was this one made?

2013-05-22 20:48:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mpt2sas/mpt3sas: prevent double free on error path

On Thu, 9 May 2013 16:42:58 -0400 Joern Engel <[email protected]> wrote:

> I noticed this one when list_del was called with poisoned list
> pointers, but the real problem is a double-free (and a use-after-free
> just before that).
>
> Both _scsih_probe_boot_devices() and _scsih_sas_device_add() put the
> sas_device onto a list, thereby giving up control. Next they call
> mpt2sas_transport_port_add() and will list_del and free the object on
> errors. If some other function already did the list_del and free, it
> will happen again.
>
> This patch adds reference counting to prevent the double free. One
> reference count goes to the caller of mpt2sas_transport_port_add(), the
> second to the list. Whoever removes the object from the list gets to
> drop one reference count. _scsih_probe_boot_devices() and
> _scsih_sas_device_add() get a second reference count to ensure the
> object is not freed while they are still accessing it.
>
> To prevent the double list_del(), I changed the code to list_del_init()
> and added a list_empty() check before that. Since the
> list_empty/list_del_init is always called under a lock, this should be
> safe.
>
> I hate the complexity this patch adds, but see no alternative to it.

It's regrettable that this bugfix is joined at the hip with a new
kernel feature.

> index 543d8d6..ceb7d41 100644
> --- a/drivers/scsi/mpt2sas/mpt2sas_base.h
> +++ b/drivers/scsi/mpt2sas/mpt2sas_base.h
> @@ -367,6 +367,7 @@ struct _sas_device {
> u16 slot;
> u8 phy;
> u8 responding;
> + struct kref kref;
> };
>
> /**
> diff --git a/drivers/scsi/mpt2sas/mpt2sas_scsih.c b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> index c6bdc92..217660c 100644
> --- a/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> +++ b/drivers/scsi/mpt2sas/mpt2sas_scsih.c
> @@ -570,6 +570,19 @@ _scsih_sas_device_find_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
> return NULL;
> }
>
> +static void free_sas_device(struct kref *kref)
> +{
> + struct _sas_device *sas_device = container_of(kref, struct _sas_device,
> + kref);

yuk. Easily fixed:

struct _sas_device *sas_device;

sas_device = container_of(kref, struct _sas_device, kref);


> + kfree(sas_device);
> +}
> +
> +static void put_sas_device(struct _sas_device *sas_device)
> +{
> + kref_put(&sas_device->kref, free_sas_device);
> +}
> +
> /**
> * _scsih_sas_device_remove - remove sas_device from list.
> * @ioc: per adapter object
> @@ -583,14 +596,19 @@ _scsih_sas_device_remove(struct MPT2SAS_ADAPTER *ioc,
> struct _sas_device *sas_device)
> {
> unsigned long flags;
> + int was_on_list = 0;
>
> if (!sas_device)
> return;
>
> spin_lock_irqsave(&ioc->sas_device_lock, flags);
> - list_del(&sas_device->list);
> - kfree(sas_device);
> + if (!list_empty(&sas_device->list)) {
> + list_del_init(&sas_device->list);
> + was_on_list = 1;
> + }
> spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> + if (was_on_list)
> + put_sas_device(sas_device);
> }

This looks a bit awkward.

- list_del() on an empty list_head is acceptable, although
inefficient when used in a high-frequency codepath (whcih this
isn't). So the code could be made simpler-looking.

- afaict put_sas_device() can be called from under that lock, so why
bother with the was_on_list deferral thing?

- it's strange that this function can ever be called when the device
is not on the list. Perhaps somethnig else is screwy?


> @@ -612,6 +630,8 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
> "(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
> sas_device->handle, (unsigned long long)sas_device->sas_address));
>
> + /* Get an extra refcount... */

>From the annals of unuseful comments ;)

> + kref_get(&sas_device->kref);
> spin_lock_irqsave(&ioc->sas_device_lock, flags);
> list_add_tail(&sas_device->list, &ioc->sas_device_list);
> spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> @@ -630,6 +650,12 @@ _scsih_sas_device_add(struct MPT2SAS_ADAPTER *ioc,
> sas_device->sas_address_parent);
> _scsih_sas_device_remove(ioc, sas_device);
> }
> + /*
> + * ...and drop it again. If an error already happend, this is the

"happened"

> + * final put and we free the object now. Otherwise whoever removes
> + * the object from the list will do the final put and free.
> + */
> + put_sas_device(sas_device);
> }

But if this kref_get/put_sas_device pair were not here, this function's
_scsih_sas_device_remove() could free the device for us, if that
was_on_list oddity wasn't there.

Perhaps it would help to write down the refcounting rules. "one for
the object, one for the presence of a list"?

> /**
> @@ -5270,6 +5296,7 @@ _scsih_add_device(struct MPT2SAS_ADAPTER *ioc, u16 handle, u8 phy_num, u8 is_pd)
> return -1;
> }
>
> + kref_init(&sas_device->kref);
> sas_device->handle = handle;
> if (_scsih_get_sas_address(ioc, le16_to_cpu
> (sas_device_pg0.ParentDevHandle),
> @@ -5341,7 +5368,7 @@ _scsih_remove_device(struct MPT2SAS_ADAPTER *ioc,
> "handle(0x%04x), sas_addr(0x%016llx)\n", ioc->name, __func__,
> sas_device->handle, (unsigned long long)
> sas_device->sas_address));
> - kfree(sas_device);
> + put_sas_device(sas_device);
> }
> /**
> * _scsih_device_remove_by_handle - removing device object by handle
> @@ -5355,16 +5382,21 @@ _scsih_device_remove_by_handle(struct MPT2SAS_ADAPTER *ioc, u16 handle)
> {
> struct _sas_device *sas_device;
> unsigned long flags;
> + int was_on_list = 0;
>
> if (ioc->shost_recovery)
> return;
>
> spin_lock_irqsave(&ioc->sas_device_lock, flags);
> sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> - if (sas_device)
> - list_del(&sas_device->list);
> + if (sas_device) {
> + if (!list_empty(&sas_device->list)) {
> + list_del_init(&sas_device->list);
> + was_on_list = 1;
> + }
> + }
> spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> - if (sas_device)
> + if (was_on_list)
> _scsih_remove_device(ioc, sas_device);
> }

hm, duplication. Should this call _scsih_sas_device_remove()?


> @@ -5381,6 +5413,7 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
> {
> struct _sas_device *sas_device;
> unsigned long flags;
> + int was_on_list = 0;
>
> if (ioc->shost_recovery)
> return;
> @@ -5388,10 +5421,14 @@ mpt2sas_device_remove_by_sas_address(struct MPT2SAS_ADAPTER *ioc,
> spin_lock_irqsave(&ioc->sas_device_lock, flags);
> sas_device = mpt2sas_scsih_sas_device_find_by_sas_address(ioc,
> sas_address);
> - if (sas_device)
> - list_del(&sas_device->list);
> + if (sas_device) {
> + if (!list_empty(&sas_device->list)) {
> + list_del_init(&sas_device->list);
> + was_on_list = 1;
> + }
> + }
> spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> - if (sas_device)
> + if (was_on_list)
> _scsih_remove_device(ioc, sas_device);
> }

eek, there it is again!

> ...
>
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -569,6 +569,19 @@ _scsih_sas_device_find_by_handle(struct MPT3SAS_ADAPTER *ioc, u16 handle)
> return NULL;
> }
>
> +static void free_sas_device(struct kref *kref)
> +{
> + struct _sas_device *sas_device = container_of(kref, struct _sas_device,
> + kref);

A ditto here.

> + kfree(sas_device);
> +}
> +
> +static void put_sas_device(struct _sas_device *sas_device)
> +{
> + kref_put(&sas_device->kref, free_sas_device);
> +}
> +
> /**
> * _scsih_sas_device_remove - remove sas_device from list.
> * @ioc: per adapter object
> @@ -582,14 +595,19 @@ _scsih_sas_device_remove(struct MPT3SAS_ADAPTER *ioc,
> struct _sas_device *sas_device)
> {
> unsigned long flags;
> + int was_on_list = 0;
>
> if (!sas_device)
> return;
>
> spin_lock_irqsave(&ioc->sas_device_lock, flags);
> - list_del(&sas_device->list);
> - kfree(sas_device);
> + if (!list_empty(&sas_device->list)) {
> + list_del_init(&sas_device->list);
> + was_on_list = 1;
> + }
> spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> + if (was_on_list)
> + put_sas_device(sas_device);
> }

etcetera.


I dunno, it all seems quite duplicative and awkward. Is there
something exceptional about the problem space here, or is it that the
code is just screwed up and this fix doesn't really unscrew it?

2013-05-22 20:48:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/14] add blockconsole version 1.1

On Thu, 9 May 2013 16:43:00 -0400 Joern Engel <[email protected]> wrote:

> Console driver similar to netconsole, except it writes to a block
> device. Can be useful in a setup where netconsole, for whatever
> reasons, is impractical.

It would be useful to provide a description of how the code works. How
it avoids sleeping memory allocations on the disk-writing path. What
the kernel thread does. General overview of data flow. How we avoid
recursion when it's called from within block/driver/etc code paths.
What's all that special-case handling of panic() for.

>
> ...
>
> Documentation/block/blockconsole.txt | 94 ++++
> Documentation/block/blockconsole/bcon_tail | 62 +++
> Documentation/block/blockconsole/mkblockconsole | 29 ++

We really need somewhere better to put userspace tools.

> block/partitions/Makefile | 1 +
> block/partitions/blockconsole.c | 22 +
> block/partitions/check.c | 3 +
> block/partitions/check.h | 3 +
> drivers/block/Kconfig | 6 +
> drivers/block/Makefile | 1 +
> drivers/block/blockconsole.c | 621 +++++++++++++++++++++++
>
> ...
>
> +#define CACHE_MASK (CACHE_SIZE - 1)
> +#define SECTOR_SHIFT (9)
> +#define SECTOR_SIZE (1 << SECTOR_SHIFT)
> +#define SECTOR_MASK (~(SECTOR_SIZE-1))
> +#define PG_SECTOR_MASK ((PAGE_SIZE >> 9) - 1)
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

This normally goes before the #includes, making the undef unnecessary.

> +struct bcon_bio {
> + struct bio bio;
> + struct bio_vec bvec;
> + void *sector;
> + int in_flight;
> +};
>
> ...
>
> +static int sync_read(struct blockconsole *bc, u64 ofs)
> +{
> + struct bio bio;
> + struct bio_vec bio_vec;
> + struct completion complete;
> +
> + bio_init(&bio);
> + bio.bi_io_vec = &bio_vec;
> + bio_vec.bv_page = bc->pages;
> + bio_vec.bv_len = SECTOR_SIZE;
> + bio_vec.bv_offset = 0;
> + bio.bi_vcnt = 1;
> + bio.bi_idx = 0;
> + bio.bi_size = SECTOR_SIZE;
> + bio.bi_bdev = bc->bdev;
> + bio.bi_sector = ofs >> SECTOR_SHIFT;
> + init_completion(&complete);
> + bio.bi_private = &complete;
> + bio.bi_end_io = request_complete;
> +
> + submit_bio(READ, &bio);
> + wait_for_completion(&complete);
> + return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
> +}

It wouldn't hurt to have a few explanatory comments over these functions.

>
> ...
>
> +static int bcon_convert_old_format(struct blockconsole *bc)
> +{
> + bc->uuid = get_random_int();
> + bc->round = 0;
> + bc->console_bytes = bc->write_bytes = 0;
> + bcon_advance_console_bytes(bc, 0); /* To skip the header */
> + bcon_advance_write_bytes(bc, 0); /* To wrap around, if necessary */
> + bcon_erase_segment(bc);
> + pr_info("converted %s from old format\n", bc->devname);
> + return 0;
> +}

It's strange to have an upgrade-from-old-format feature in something
which hasn't even been merged yet. Can we just ditch all this stuff?

>
> ...
>
> +#define BCON_MAX_ERRORS 10
> +static void bcon_end_io(struct bio *bio, int err)
> +{
> + struct bcon_bio *bcon_bio = container_of(bio, struct bcon_bio, bio);
> + struct blockconsole *bc = bio->bi_private;
> + unsigned long flags;
> +
> + /*
> + * We want to assume the device broken and free this console if
> + * we accumulate too many errors. But if errors are transient,
> + * we also want to forget about them once writes succeed again.
> + * Oh, and we only want to reset the counter if it hasn't reached
> + * the limit yet, so we don't bcon_put() twice from here.
> + */
> + spin_lock_irqsave(&bc->end_io_lock, flags);
> + if (err) {
> + if (bc->error_count++ == BCON_MAX_ERRORS) {
> + pr_info("no longer logging to %s\n", bc->devname);

pr_info() may be too low a severity level?

How does the code handle the obvious recursion concern here?

> + schedule_work(&bc->unregister_work);
> + }
> + } else {
> + if (bc->error_count && bc->error_count < BCON_MAX_ERRORS)
> + bc->error_count = 0;
> + }
> + /*
> + * Add padding (a bunch of spaces and a newline) early so bcon_pad
> + * only has to advance a pointer.
> + */
> + clear_sector(bcon_bio->sector);
> + bcon_bio->in_flight = 0;
> + spin_unlock_irqrestore(&bc->end_io_lock, flags);
> + bcon_put(bc);
> +}
> +
> +static void bcon_writesector(struct blockconsole *bc, int index)
> +{
> + struct bcon_bio *bcon_bio = bc->bio_array + index;
> + struct bio *bio = &bcon_bio->bio;
> +
> + rmb();
> + if (bcon_bio->in_flight)
> + return;
> + bcon_get(bc);
> +
> + bio_init(bio);
> + bio->bi_io_vec = &bcon_bio->bvec;
> + bio->bi_vcnt = 1;
> + bio->bi_size = SECTOR_SIZE;
> + bio->bi_bdev = bc->bdev;
> + bio->bi_private = bc;
> + bio->bi_end_io = bcon_end_io;
> +
> + bio->bi_idx = 0;
> + bio->bi_sector = bc->write_bytes >> 9;
> + bcon_bio->in_flight = 1;
> + wmb();
> + submit_bio(WRITE, bio);
> +}
> +
> +static int bcon_writeback(void *_bc)

So this is actually a kernel thread service loop. The poorly-chosen
name and absence of code comments make this unobvious.

> +{
> + struct blockconsole *bc = _bc;
> + struct sched_param(sp);
> +
> + sp.sched_priority = MAX_RT_PRIO - 1; /* Highest realtime prio */
> + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> + for (;;) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule();
> + if (kthread_should_stop())
> + break;

This looks wrong. The sequence should be

set_current_state(TASK_INTERRUPTIBLE);
if (!kthread_should_stop())
schedule();

to avoid missed-wakeup races.

> + while (bcon_write_sector(bc) != bcon_console_sector(bc)) {
> + bcon_writesector(bc, bcon_write_sector(bc));
> + bcon_advance_write_bytes(bc, SECTOR_SIZE);
> + if (bcon_write_sector(bc) == 0)
> + bcon_erase_segment(bc);
> + }
> + }
> + return 0;
> +}
> +
> +static void bcon_pad(unsigned long data)
> +{
> + struct blockconsole *bc = (void *)data;
> + unsigned int n;
> +
> + /*
> + * We deliberately race against bcon_write here. If we lose the race,
> + * our padding is no longer where we expected it to be, i.e. it is
> + * no longer a bunch of spaces with a newline at the end. There could
> + * not be a newline at all or it could be somewhere in the middle.
> + * Either way, the log corruption is fairly obvious to spot and ignore
> + * for human readers.
> + */
> + n = SECTOR_SIZE - bcon_console_ofs(bc);
> + if (n != SECTOR_SIZE) {
> + bcon_advance_console_bytes(bc, n);
> + wake_up_process(bc->writeback_thread);
> + }
> +}

wake_up_process() from within printk is problematic - it'll deadlock if
the caller is holding certain scheduler locks. That's why
wake_up_klogd() does weird things.

> +static void bcon_write(struct console *console, const char *msg,
> + unsigned int len)
> +{
> + struct blockconsole *bc = container_of(console, struct blockconsole,
> + console);

struct blockconsole *bc;

bc = container_of(console, struct blockconsole, console);

> + unsigned int n;
> + u64 console_bytes;
> + int i;
> +
> + while (len) {
> + console_bytes = bc->console_bytes;
> + i = __bcon_console_sector(console_bytes);
> + rmb();
> + if (bc->bio_array[i].in_flight)
> + break;
> + n = min_t(int, len, SECTOR_SIZE -
> + __bcon_console_ofs(console_bytes));

Yes, the types are a bit random in all this code. A mix of int,
unsigned, u64 in various places, often where one would expect a size_t.

> + memcpy(bc->bio_array[i].sector +
> + __bcon_console_ofs(console_bytes), msg, n);
> + len -= n;
> + msg += n;
> + bcon_advance_console_bytes(bc, n);
> + }
> + wake_up_process(bc->writeback_thread);
> + mod_timer(&bc->pad_timer, jiffies + HZ);
> +}
> +
> +static void bcon_init_bios(struct blockconsole *bc)

This code really needs some comments :(

Oh I see. It's BIOs, not BIOS. Had me fooled there.

> +{
> + int i;
> +
> + for (i = 0; i < SECTOR_COUNT; i++) {
> + int page_index = i >> (PAGE_SHIFT - SECTOR_SHIFT);
> + struct page *page = bc->pages + page_index;
> + struct bcon_bio *bcon_bio = bc->bio_array + i;
> + struct bio_vec *bvec = &bcon_bio->bvec;
> +
> + bcon_bio->in_flight = 0;
> + bcon_bio->sector = page_address(bc->pages + page_index)
> + + SECTOR_SIZE * (i & PG_SECTOR_MASK);
> + clear_sector(bcon_bio->sector);
> + bvec->bv_page = page;
> + bvec->bv_len = SECTOR_SIZE;
> + bvec->bv_offset = SECTOR_SIZE * (i & PG_SECTOR_MASK);
> + }
> +}
>
> ...
>
> +static int blockconsole_panic(struct notifier_block *this, unsigned long event,
> + void *ptr)
> +{
> + struct blockconsole *bc = container_of(this, struct blockconsole,
> + panic_block);

ditto

> + unsigned int n;
> +
> + n = SECTOR_SIZE - bcon_console_ofs(bc);
> + if (n != SECTOR_SIZE)
> + bcon_advance_console_bytes(bc, n);
> + bcon_writeback(bc);
> + return NOTIFY_DONE;
> +}
> +
> +static int bcon_create(const char *devname)
> +{
> + const fmode_t mode = FMODE_READ | FMODE_WRITE;
> + struct blockconsole *bc;
> + int err;
> +
> + bc = kzalloc(sizeof(*bc), GFP_KERNEL);
> + if (!bc)
> + return -ENOMEM;
> + memset(bc->devname, ' ', sizeof(bc->devname));
> + strlcpy(bc->devname, devname, sizeof(bc->devname));
> + spin_lock_init(&bc->end_io_lock);
> + strcpy(bc->console.name, "bcon");
> + bc->console.flags = CON_PRINTBUFFER | CON_ENABLED;
> + bc->console.write = bcon_write;
> + bc->bdev = blkdev_get_by_path(devname, mode, NULL);
> +#ifndef MODULE
> + if (IS_ERR(bc->bdev)) {
> + dev_t devt = name_to_dev_t(devname);
> + if (devt)
> + bc->bdev = blkdev_get_by_dev(devt, mode, NULL);
> + }
> +#endif
> + if (IS_ERR(bc->bdev))
> + goto out;
> + bc->pages = alloc_pages(GFP_KERNEL, 8);
> + if (!bc->pages)
> + goto out;
> + bc->zero_page = alloc_pages(GFP_KERNEL, 0);
> + if (!bc->zero_page)
> + goto out1;
> + bcon_init_bios(bc);
> + bcon_init_zero_bio(bc);
> + setup_timer(&bc->pad_timer, bcon_pad, (unsigned long)bc);
> + bc->max_bytes = bc->bdev->bd_inode->i_size & ~CACHE_MASK;
> + err = bcon_find_end_of_log(bc);
> + if (err)
> + goto out2;
> + kref_init(&bc->kref); /* This reference gets freed on errors */
> + bc->writeback_thread = kthread_run(bcon_writeback, bc, "bcon_%s",
> + devname);
> + if (IS_ERR(bc->writeback_thread))
> + goto out2;

Should propagate the error, not assume ENOMEM. Minor.

> + INIT_WORK(&bc->unregister_work, bcon_unregister);
> + register_console(&bc->console);
> + bc->panic_block.notifier_call = blockconsole_panic;
> + bc->panic_block.priority = INT_MAX;
> + atomic_notifier_chain_register(&panic_notifier_list, &bc->panic_block);
> + pr_info("now logging to %s at %llx\n", devname, bc->console_bytes >> 20);
> +
> + return 0;
> +
> +out2:
> + __free_pages(bc->zero_page, 0);
> +out1:
> + __free_pages(bc->pages, 8);
> +out:
> + kfree(bc);
> + /* Not strictly correct, be the caller doesn't care */
> + return -ENOMEM;
> +}
> +
> +static void bcon_create_fuzzy(const char *name)

comments, comments, comments. What's this do and why does it do it and
why does it make assumptions about userspace namespace configuration.

> +{
> + char *longname;
> + int err;
> +
> + err = bcon_create(name);
> + if (err) {
> + longname = kzalloc(strlen(name) + 6, GFP_KERNEL);
> + if (!longname)
> + return;
> + strcpy(longname, "/dev/");
> + strcat(longname, name);

kasprintf()?

> + bcon_create(longname);
> + kfree(longname);
> + }
> +}
> +
> +static DEFINE_SPINLOCK(device_lock);
> +static char scanned_devices[80];
> +
> +static void bcon_do_add(struct work_struct *work)
> +{
> + char local_devices[80], *name, *remainder = local_devices;
> +
> + spin_lock(&device_lock);
> + memcpy(local_devices, scanned_devices, sizeof(local_devices));
> + memset(scanned_devices, 0, sizeof(scanned_devices));
> + spin_unlock(&device_lock);
> +
> + while (remainder && remainder[0]) {
> + name = strsep(&remainder, ",");
> + bcon_create_fuzzy(name);
> + }
> +}
> +
> +DECLARE_WORK(bcon_add_work, bcon_do_add);
> +
> +void bcon_add(const char *name)
> +{
> + /*
> + * We add each name to a small static buffer and ask for a workqueue
> + * to go pick it up asap. Once it is picked up, the buffer is empty
> + * again, so hopefully it will suffice for all sane users.
> + */
> + spin_lock(&device_lock);
> + if (scanned_devices[0])
> + strncat(scanned_devices, ",", sizeof(scanned_devices));
> + strncat(scanned_devices, name, sizeof(scanned_devices));
> + spin_unlock(&device_lock);
> + schedule_work(&bcon_add_work);
> +}

What's all this do?

>
> ...
>

2013-05-22 20:48:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 06/14] bcon: add a release work struct

On Thu, 9 May 2013 16:43:04 -0400 Joern Engel <[email protected]> wrote:

> The final bcon_put() can be called from atomic context, by way of
> bio_endio(). In that case we would sleep in invalidate_mapping_pages(),
> with the usual unhappy results.
>
> In nearly a year of production use, I have only seen a matching
> backtrace once. There was a second known issue that could be reproduced
> by "yes h > /proc/sysrq-trigger" and concurrently pulling and replugging
> the blockconsole device. It took be somewhere around 30 pulls and sore
> thumbs to reproduce and I never found the time to get to the bottom of
> it. Quite likely the two issues are identical.

This fix should be folded into the base patch?

2013-05-22 20:48:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/14] bcon: check for hdparm in bcon_tail

On Thu, 9 May 2013 16:43:05 -0400 Joern Engel <[email protected]> wrote:

> ...
>
> --- a/Documentation/block/blockconsole/bcon_tail
> +++ b/Documentation/block/blockconsole/bcon_tail
> @@ -14,6 +14,11 @@ if [ -z "$(which lsscsi)" ]; then
> exit 1
> fi
>
> +if [ -z "$(which hdparm)" ]; then
> + echo "You need to install the hdparm package on your distro."
> + exit 1
> +fi
> +
> end_of_log() {
> DEV=$1
> UUID=`head -c40 $DEV|tail -c8`

Fold this one as well?

2013-05-22 20:48:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/14] bcon: remove version 1.0 support

On Thu, 9 May 2013 16:43:07 -0400 Joern Engel <[email protected]> wrote:

> Very few machines ever ran with 1.0 format and by now I doubt whether a
> single one still does. No need to carry that code along.

aw geeze. Can we please clean up the patch series?

2013-05-22 21:19:31

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 09/14] bcon: remove version 1.0 support

On Wed, 22 May 2013 13:48:37 -0700, Andrew Morton wrote:
> On Thu, 9 May 2013 16:43:07 -0400 Joern Engel <[email protected]> wrote:
>
> > Very few machines ever ran with 1.0 format and by now I doubt whether a
> > single one still does. No need to carry that code along.
>
> aw geeze. Can we please clean up the patch series?

People started using my git tree, so I decided to keep it stable and
not fold fixes into the main patch. That admittedly sucks for review.

Would you like me to create a single patch for review? And after
that, do you want to carry a single patch in your series or should I
keep my tree and try to get that merged? Either way works for me.

Jörn

--
Ninety percent of everything is crap.
-- Sturgeon's Law

2013-05-22 21:21:10

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] mpt2sas/mpt3sas: prevent double free on error path

On Wed, 22 May 2013 13:48:17 -0700, Andrew Morton wrote:
>
> It's regrettable that this bugfix is joined at the hip with a new
> kernel feature.

Not merged at the hip, I merely ended up reusing an old temp directory
for the new patch series (hips for brains?). Ignore this patch.

Jörn

--
No art, however minor, demands less than total dedication if you want
to excel in it.
-- Leon Battista Alberti

2013-05-22 21:25:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/14] bcon: remove version 1.0 support

On Wed, 22 May 2013 15:51:09 -0400 J__rn Engel <[email protected]> wrote:

> On Wed, 22 May 2013 13:48:37 -0700, Andrew Morton wrote:
> > On Thu, 9 May 2013 16:43:07 -0400 Joern Engel <[email protected]> wrote:
> >
> > > Very few machines ever ran with 1.0 format and by now I doubt whether a
> > > single one still does. No need to carry that code along.
> >
> > aw geeze. Can we please clean up the patch series?
>
> People started using my git tree, so I decided to keep it stable and
> not fold fixes into the main patch. That admittedly sucks for review.
>
> Would you like me to create a single patch for review? And after
> that, do you want to carry a single patch in your series or should I
> keep my tree and try to get that merged? Either way works for me.
>

Well I do think a significant amount of change is needed, in particular
some more use of the /* operator.

Can you set up a new tree for that and for linux-next and leave the
existing one in place?

2013-05-22 21:26:46

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 01/14] do_mounts: constify name_to_dev_t parameter

On Wed, 22 May 2013 13:48:12 -0700, Andrew Morton wrote:
> On Thu, 9 May 2013 16:42:57 -0400 Joern Engel <[email protected]> wrote:
>
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -74,6 +74,6 @@ extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> > extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list);
> > extern void mark_mounts_for_expiry(struct list_head *mounts);
> >
> > -extern dev_t name_to_dev_t(char *name);
> > +extern dev_t name_to_dev_t(const char *name);
> >
> > #endif /* _LINUX_MOUNT_H */
> > diff --git a/init/do_mounts.c b/init/do_mounts.c
> > index 1d1b634..da96f85 100644
> > --- a/init/do_mounts.c
> > +++ b/init/do_mounts.c
> > @@ -202,7 +202,7 @@ done:
> > * bangs.
> > */
> >
> > -dev_t name_to_dev_t(char *name)
> > +dev_t name_to_dev_t(const char *name)
> > {
> > char s[32];
> > char *p;
>
> A changelog would be nice. There are 1.1 billion places in the kernel
> where we could make such a change - why was this one made?

I happened to hit this, got a compiler warning and decided that the
proper fix was the patch above. The alternative would have been to
cast away the const in users of name_to_dev_t(). I can add some
comment, sure.

Jörn

--
It is better to die of hunger having lived without grief and fear,
than to live with a troubled spirit amid abundance.
-- Epictetus

2013-05-22 21:29:05

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 09/14] bcon: remove version 1.0 support

On Wed, 22 May 2013 14:25:32 -0700, Andrew Morton wrote:
>
> Well I do think a significant amount of change is needed, in particular
> some more use of the /* operator.

Ack.

> Can you set up a new tree for that and for linux-next and leave the
> existing one in place?

Will do. May take a few days, this is a busy time.

Jörn

--
Fancy algorithms are slow when n is small, and n is usually small.
Fancy algorithms have big constants. Until you know that n is
frequently going to be big, don't get fancy.
-- Rob Pike

2013-05-22 22:32:41

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 02/14] add blockconsole version 1.1

On Wed, 22 May 2013 13:48:22 -0700, Andrew Morton wrote:
> On Thu, 9 May 2013 16:43:00 -0400 Joern Engel <[email protected]> wrote:
>
> > Console driver similar to netconsole, except it writes to a block
> > device. Can be useful in a setup where netconsole, for whatever
> > reasons, is impractical.
>
> It would be useful to provide a description of how the code works. How
> it avoids sleeping memory allocations on the disk-writing path. What
> the kernel thread does. General overview of data flow. How we avoid
> recursion when it's called from within block/driver/etc code paths.
> What's all that special-case handling of panic() for.

Documentation/block/blockconsole.txt has some of this, but not all.
Will add more inline documentation.

> > Documentation/block/blockconsole.txt | 94 ++++
> > Documentation/block/blockconsole/bcon_tail | 62 +++
> > Documentation/block/blockconsole/mkblockconsole | 29 ++
>
> We really need somewhere better to put userspace tools.

Agreed. It started as "use this horrible code as a bad example" and
turned into a repository for userspace code, which it should have
been.

Do you have an opinion on tools/ vs. a standalong git tree for the
tools?

> > +#define CACHE_MASK (CACHE_SIZE - 1)
> > +#define SECTOR_SHIFT (9)
> > +#define SECTOR_SIZE (1 << SECTOR_SHIFT)
> > +#define SECTOR_MASK (~(SECTOR_SIZE-1))
> > +#define PG_SECTOR_MASK ((PAGE_SIZE >> 9) - 1)
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> This normally goes before the #includes, making the undef unnecessary.

Will do.

> It wouldn't hurt to have a few explanatory comments over these functions.

Sure.

> > +static int bcon_convert_old_format(struct blockconsole *bc)
> > +{
> > + bc->uuid = get_random_int();
> > + bc->round = 0;
> > + bc->console_bytes = bc->write_bytes = 0;
> > + bcon_advance_console_bytes(bc, 0); /* To skip the header */
> > + bcon_advance_write_bytes(bc, 0); /* To wrap around, if necessary */
> > + bcon_erase_segment(bc);
> > + pr_info("converted %s from old format\n", bc->devname);
> > + return 0;
> > +}
>
> It's strange to have an upgrade-from-old-format feature in something
> which hasn't even been merged yet. Can we just ditch all this stuff?

Done in a later patch, as you noticed.

> > +#define BCON_MAX_ERRORS 10
> > +static void bcon_end_io(struct bio *bio, int err)
> > +{
> > + struct bcon_bio *bcon_bio = container_of(bio, struct bcon_bio, bio);
> > + struct blockconsole *bc = bio->bi_private;
> > + unsigned long flags;
> > +
> > + /*
> > + * We want to assume the device broken and free this console if
> > + * we accumulate too many errors. But if errors are transient,
> > + * we also want to forget about them once writes succeed again.
> > + * Oh, and we only want to reset the counter if it hasn't reached
> > + * the limit yet, so we don't bcon_put() twice from here.
> > + */
> > + spin_lock_irqsave(&bc->end_io_lock, flags);
> > + if (err) {
> > + if (bc->error_count++ == BCON_MAX_ERRORS) {
> > + pr_info("no longer logging to %s\n", bc->devname);
>
> pr_info() may be too low a severity level?

There is no distinction between "someone pulled the usb device" and
"broken cable". Blockconsole will continue writing and retire the
device if errors accumulate. The code is deliberately stupid, because
stupid code tends to be more robust and devoid of strange
corner-cases.

> How does the code handle the obvious recursion concern here?

I think your next comment answers that.

> > +static int bcon_writeback(void *_bc)
>
> So this is actually a kernel thread service loop. The poorly-chosen
> name and absence of code comments make this unobvious.
>
> > +{
> > + struct blockconsole *bc = _bc;
> > + struct sched_param(sp);
> > +
> > + sp.sched_priority = MAX_RT_PRIO - 1; /* Highest realtime prio */
> > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> > + for (;;) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule();
> > + if (kthread_should_stop())
> > + break;
>
> This looks wrong. The sequence should be
>
> set_current_state(TASK_INTERRUPTIBLE);
> if (!kthread_should_stop())
> schedule();
>
> to avoid missed-wakeup races.

Ack. I may have copied that from somewhere else. Will see if there
are any other bad examples in the kernel.

> > +static void bcon_pad(unsigned long data)
> > +{
> > + struct blockconsole *bc = (void *)data;
> > + unsigned int n;
> > +
> > + /*
> > + * We deliberately race against bcon_write here. If we lose the race,
> > + * our padding is no longer where we expected it to be, i.e. it is
> > + * no longer a bunch of spaces with a newline at the end. There could
> > + * not be a newline at all or it could be somewhere in the middle.
> > + * Either way, the log corruption is fairly obvious to spot and ignore
> > + * for human readers.
> > + */
> > + n = SECTOR_SIZE - bcon_console_ofs(bc);
> > + if (n != SECTOR_SIZE) {
> > + bcon_advance_console_bytes(bc, n);
> > + wake_up_process(bc->writeback_thread);
> > + }
> > +}
>
> wake_up_process() from within printk is problematic - it'll deadlock if
> the caller is holding certain scheduler locks. That's why
> wake_up_klogd() does weird things.

Will take a look. My lame answer is "but I am running this on X
machines and it seems to work." Would be nice if deadlocks were not
just rare in practice, but impossible - to the extend one can prove
anything in a codebase the size of the kernel.

> > +static void bcon_write(struct console *console, const char *msg,
> > + unsigned int len)
> > +{
> > + struct blockconsole *bc = container_of(console, struct blockconsole,
> > + console);
>
> struct blockconsole *bc;
>
> bc = container_of(console, struct blockconsole, console);

Ack.

> > + unsigned int n;
> > + u64 console_bytes;
> > + int i;
> > +
> > + while (len) {
> > + console_bytes = bc->console_bytes;
> > + i = __bcon_console_sector(console_bytes);
> > + rmb();
> > + if (bc->bio_array[i].in_flight)
> > + break;
> > + n = min_t(int, len, SECTOR_SIZE -
> > + __bcon_console_ofs(console_bytes));
>
> Yes, the types are a bit random in all this code. A mix of int,
> unsigned, u64 in various places, often where one would expect a size_t.

Will take a look.

> > + memcpy(bc->bio_array[i].sector +
> > + __bcon_console_ofs(console_bytes), msg, n);
> > + len -= n;
> > + msg += n;
> > + bcon_advance_console_bytes(bc, n);
> > + }
> > + wake_up_process(bc->writeback_thread);
> > + mod_timer(&bc->pad_timer, jiffies + HZ);
> > +}
> > +
> > +static void bcon_init_bios(struct blockconsole *bc)
>
> This code really needs some comments :(
>
> Oh I see. It's BIOs, not BIOS. Had me fooled there.

But, but, but isn't this obvious to someone who has just written said
function? It has been a year now, so my own definition of obvious may
have changed as well.

> > + if (IS_ERR(bc->writeback_thread))
> > + goto out2;
>
> Should propagate the error, not assume ENOMEM. Minor.

Agreed. The error to propagate is actually ENOMEM - until someone
changes the code in one place and fails to notice the other.

> > +static void bcon_create_fuzzy(const char *name)
>
> comments, comments, comments. What's this do and why does it do it and
> why does it make assumptions about userspace namespace configuration.

Ack.

> > +{
> > + char *longname;
> > + int err;
> > +
> > + err = bcon_create(name);
> > + if (err) {
> > + longname = kzalloc(strlen(name) + 6, GFP_KERNEL);
> > + if (!longname)
> > + return;
> > + strcpy(longname, "/dev/");
> > + strcat(longname, name);
>
> kasprintf()?

Much nicer! Will do.

> > +void bcon_add(const char *name)
> > +{
> > + /*
> > + * We add each name to a small static buffer and ask for a workqueue
> > + * to go pick it up asap. Once it is picked up, the buffer is empty
> > + * again, so hopefully it will suffice for all sane users.
> > + */
> > + spin_lock(&device_lock);
> > + if (scanned_devices[0])
> > + strncat(scanned_devices, ",", sizeof(scanned_devices));
> > + strncat(scanned_devices, name, sizeof(scanned_devices));
> > + spin_unlock(&device_lock);
> > + schedule_work(&bcon_add_work);
> > +}
>
> What's all this do?

Work around init ordering problems. Quite likely there is a much
nicer way to this that I should know about and don't.

Jörn

--
Optimizations always bust things, because all optimizations are, in
the long haul, a form of cheating, and cheaters eventually get caught.
-- Larry Wall

2013-05-22 22:43:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/14] add blockconsole version 1.1

On Wed, 22 May 2013 17:04:16 -0400 J__rn Engel <[email protected]> wrote:

> > > Documentation/block/blockconsole.txt | 94 ++++
> > > Documentation/block/blockconsole/bcon_tail | 62 +++
> > > Documentation/block/blockconsole/mkblockconsole | 29 ++
> >
> > We really need somewhere better to put userspace tools.
>
> Agreed. It started as "use this horrible code as a bad example" and
> turned into a repository for userspace code, which it should have
> been.
>
> Do you have an opinion on tools/ vs. a standalong git tree for the
> tools?

./tools/blockconsole/ sounds nice. I like maintaining simple userspace
utils in the main kernel tree - it's very easy and efficient for us.

> > > +#define BCON_MAX_ERRORS 10
> > > +static void bcon_end_io(struct bio *bio, int err)
> > > +{
> > > + struct bcon_bio *bcon_bio = container_of(bio, struct bcon_bio, bio);
> > > + struct blockconsole *bc = bio->bi_private;
> > > + unsigned long flags;
> > > +
> > > + /*
> > > + * We want to assume the device broken and free this console if
> > > + * we accumulate too many errors. But if errors are transient,
> > > + * we also want to forget about them once writes succeed again.
> > > + * Oh, and we only want to reset the counter if it hasn't reached
> > > + * the limit yet, so we don't bcon_put() twice from here.
> > > + */
> > > + spin_lock_irqsave(&bc->end_io_lock, flags);
> > > + if (err) {
> > > + if (bc->error_count++ == BCON_MAX_ERRORS) {
> > > + pr_info("no longer logging to %s\n", bc->devname);
> >
> > pr_info() may be too low a severity level?
>
> There is no distinction between "someone pulled the usb device" and
> "broken cable". Blockconsole will continue writing and retire the
> device if errors accumulate. The code is deliberately stupid, because
> stupid code tends to be more robust and devoid of strange
> corner-cases.

I meant should we use pr_err() or pr_emerg() rather than pr_info().
Chances are that pr_info() is being hidden.

> > How does the code handle the obvious recursion concern here?
>
> I think your next comment answers that.

hm, that was clever of me. I meant more generally, if we do a printk()
from within this code how do we prevent arbitrarily deep recursion?

> > > + memcpy(bc->bio_array[i].sector +
> > > + __bcon_console_ofs(console_bytes), msg, n);
> > > + len -= n;
> > > + msg += n;
> > > + bcon_advance_console_bytes(bc, n);
> > > + }
> > > + wake_up_process(bc->writeback_thread);
> > > + mod_timer(&bc->pad_timer, jiffies + HZ);
> > > +}
> > > +
> > > +static void bcon_init_bios(struct blockconsole *bc)
> >
> > This code really needs some comments :(
> >
> > Oh I see. It's BIOs, not BIOS. Had me fooled there.
>
> But, but, but isn't this obvious to someone who has just written said
> function? It has been a year now, so my own definition of obvious may
> have changed as well.

heh. I blame Jens for calling it a "bio".

> > > +void bcon_add(const char *name)
> > > +{
> > > + /*
> > > + * We add each name to a small static buffer and ask for a workqueue
> > > + * to go pick it up asap. Once it is picked up, the buffer is empty
> > > + * again, so hopefully it will suffice for all sane users.
> > > + */
> > > + spin_lock(&device_lock);
> > > + if (scanned_devices[0])
> > > + strncat(scanned_devices, ",", sizeof(scanned_devices));
> > > + strncat(scanned_devices, name, sizeof(scanned_devices));
> > > + spin_unlock(&device_lock);
> > > + schedule_work(&bcon_add_work);
> > > +}
> >
> > What's all this do?
>
> Work around init ordering problems. Quite likely there is a much
> nicer way to this that I should know about and don't.

Well, without adequate description of the problem, nobody can help
solve it!

2013-05-22 23:02:30

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 02/14] add blockconsole version 1.1

On Wed, 22 May 2013 15:43:22 -0700, Andrew Morton wrote:
> On Wed, 22 May 2013 17:04:16 -0400 J__rn Engel <[email protected]> wrote:
>
> > > > Documentation/block/blockconsole.txt | 94 ++++
> > > > Documentation/block/blockconsole/bcon_tail | 62 +++
> > > > Documentation/block/blockconsole/mkblockconsole | 29 ++
> > >
> > > We really need somewhere better to put userspace tools.
> >
> > Agreed. It started as "use this horrible code as a bad example" and
> > turned into a repository for userspace code, which it should have
> > been.
> >
> > Do you have an opinion on tools/ vs. a standalong git tree for the
> > tools?
>
> ./tools/blockconsole/ sounds nice. I like maintaining simple userspace
> utils in the main kernel tree - it's very easy and efficient for us.

tools/blockconsole/ is shall be!

> > > > +#define BCON_MAX_ERRORS 10
> > > > +static void bcon_end_io(struct bio *bio, int err)
> > > > +{
> > > > + struct bcon_bio *bcon_bio = container_of(bio, struct bcon_bio, bio);
> > > > + struct blockconsole *bc = bio->bi_private;
> > > > + unsigned long flags;
> > > > +
> > > > + /*
> > > > + * We want to assume the device broken and free this console if
> > > > + * we accumulate too many errors. But if errors are transient,
> > > > + * we also want to forget about them once writes succeed again.
> > > > + * Oh, and we only want to reset the counter if it hasn't reached
> > > > + * the limit yet, so we don't bcon_put() twice from here.
> > > > + */
> > > > + spin_lock_irqsave(&bc->end_io_lock, flags);
> > > > + if (err) {
> > > > + if (bc->error_count++ == BCON_MAX_ERRORS) {
> > > > + pr_info("no longer logging to %s\n", bc->devname);
> > >
> > > pr_info() may be too low a severity level?
> >
> > There is no distinction between "someone pulled the usb device" and
> > "broken cable". Blockconsole will continue writing and retire the
> > device if errors accumulate. The code is deliberately stupid, because
> > stupid code tends to be more robust and devoid of strange
> > corner-cases.
>
> I meant should we use pr_err() or pr_emerg() rather than pr_info().
> Chances are that pr_info() is being hidden.

Agreed. Trouble is that pr_info() is about right for "someone pulled
the usb device", which pr_err() or pr_emerg() is about right for
"broken hardware". So either choice will be wrong some of the time.

I guess in the end very few people care about debugging blockconsole
failures. Those that run blockconsole tend to care about debugging
something else already. So if you lean one way or the other, you have
a stronger opinion than me.

> > > How does the code handle the obvious recursion concern here?
> >
> > I think your next comment answers that.
>
> hm, that was clever of me. I meant more generally, if we do a printk()
> from within this code how do we prevent arbitrarily deep recursion?

I think in general you have to be careful and think twice before
adding a printk. Which is why there are only two instances so far.
The more interesting case is in functions called from blockconsole (or
netconsole or serial console or...). And here the small set of
functions when writing to the buffer should be safe and the much
larger and complex set of functions for writeback is in the context of
a different thread.

I have some fuzzy memory of this coming up in the past for netconsole.
Wasn't hard to fix, which explains the lack of detailed memory.

> > > > + memcpy(bc->bio_array[i].sector +
> > > > + __bcon_console_ofs(console_bytes), msg, n);
> > > > + len -= n;
> > > > + msg += n;
> > > > + bcon_advance_console_bytes(bc, n);
> > > > + }
> > > > + wake_up_process(bc->writeback_thread);
> > > > + mod_timer(&bc->pad_timer, jiffies + HZ);
> > > > +}
> > > > +
> > > > +static void bcon_init_bios(struct blockconsole *bc)
> > >
> > > This code really needs some comments :(
> > >
> > > Oh I see. It's BIOs, not BIOS. Had me fooled there.
> >
> > But, but, but isn't this obvious to someone who has just written said
> > function? It has been a year now, so my own definition of obvious may
> > have changed as well.
>
> heh. I blame Jens for calling it a "bio".

I wonder what his response is when a recruiter asks him for a recent
bio.

> > > > +void bcon_add(const char *name)
> > > > +{
> > > > + /*
> > > > + * We add each name to a small static buffer and ask for a workqueue
> > > > + * to go pick it up asap. Once it is picked up, the buffer is empty
> > > > + * again, so hopefully it will suffice for all sane users.
> > > > + */
> > > > + spin_lock(&device_lock);
> > > > + if (scanned_devices[0])
> > > > + strncat(scanned_devices, ",", sizeof(scanned_devices));
> > > > + strncat(scanned_devices, name, sizeof(scanned_devices));
> > > > + spin_unlock(&device_lock);
> > > > + schedule_work(&bcon_add_work);
> > > > +}
> > >
> > > What's all this do?
> >
> > Work around init ordering problems. Quite likely there is a much
> > nicer way to this that I should know about and don't.
>
> Well, without adequate description of the problem, nobody can help
> solve it!

Will dig through the dark corners of my memory or reproduce it,
whichever is easier.

Jörn

--
When in doubt, use brute force.
-- Ken Thompson