2022-10-13 19:17:30

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1] kernel/module: allocate module vmap space after making sure the module is unique

We already make sure to allocate percpu data only after we verified that
the module we're loading hasn't already been loaded and isn't
concurrently getting loaded -- that it's unique.

On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
phasing a similar issue with the module vmap space.

When KASAN_INLINE is enabled (resulting in large module size), plenty
of devices that udev wants to probe and plenty (> 400) of CPUs that can
carry out that probing concurrently, we can actually run out of module
vmap space and trigger vmap allocation errors:

[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
[ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[ 165.947343] Call Trace:
[ 165.950075] <TASK>
[ 165.952425] dump_stack_lvl+0x57/0x81
[ 165.956532] warn_alloc.cold+0x95/0x18a
[ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
[ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
[ 165.971461] ? __get_vm_area_node+0x2af/0x360
[ 165.976341] ? __get_vm_area_node+0x2af/0x360
[ 165.981219] __vmalloc_node_range+0x291/0x560
[ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
[ 165.991447] ? move_module+0x4c/0x630
[ 165.995547] ? vfree_atomic+0xa0/0xa0
[ 165.999647] ? move_module+0x4c/0x630
[ 166.003741] module_alloc+0xe7/0x170
[ 166.007747] ? move_module+0x4c/0x630
[ 166.011840] move_module+0x4c/0x630
[ 166.015751] layout_and_allocate+0x32c/0x560
[ 166.020519] load_module+0x8e0/0x25c0
[ 166.024623] ? layout_and_allocate+0x560/0x560
[ 166.029586] ? kernel_read_file+0x286/0x6b0
[ 166.034269] ? __x64_sys_fspick+0x290/0x290
[ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
[ 166.044203] ? lock_downgrade+0x130/0x130
[ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
[ 166.053854] __do_sys_finit_module+0x11a/0x1c0
[ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
[ 166.063882] ? __seccomp_filter+0x92/0x930
[ 166.068494] do_syscall_64+0x59/0x90
[ 166.072492] ? do_syscall_64+0x69/0x90
[ 166.076679] ? do_syscall_64+0x69/0x90
[ 166.080864] ? do_syscall_64+0x69/0x90
[ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
[ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size

Interestingly, when reducing the number of CPUs (nosmt), it works as
expected.

The underlying issue is that we first allocate memory (including module
vmap space) in layout_and_allocate(), and then verify whether the module
is unique in add_unformed_module(). So we end up allocating module vmap
space even though we might not need it -- which is a problem when modules
are big and we can have a lot of concurrent probing of the same set of
modules as on the big system at hand.

Unfortunately, we cannot simply add the module earlier, because
move_module() -- that allocates the module vmap space -- essentially
brings the module to life from a temporary one. Adding the temporary one
and replacing it is also sub-optimal (because replacing it would require
to synchronize against RCU) and feels kind of dangerous judging that we
end up copying it.

So instead, add a second list (pending_load_infos) that tracks the modules
(via their load_info) that are unique and are still getting loaded
("pending"), but haven't made it to the actual module list yet. This
shouldn't have a notable runtime overhead when concurrently loading
modules: the new list is expected to usually either be empty or contain
very few entries for a short time.

Thanks to Uladzislau for his help to verify that it's not actually a
vmap code issue.

Reported-by: Lin Liu <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
kernel/module/internal.h | 2 +
kernel/module/main.c | 122 +++++++++++++++++++++++++++------------
2 files changed, 88 insertions(+), 36 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 680d980a4fb2..9d5cc9b1d56a 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -76,6 +76,8 @@ struct load_info {
struct {
unsigned int sym, str, mod, vers, info, pcpu;
} index;
+
+ struct list_head next;
};

enum mod_license {
diff --git a/kernel/module/main.c b/kernel/module/main.c
index a4e4d84b6f4e..b473228136eb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -65,10 +65,20 @@
* 2) module_use links,
* 3) mod_tree.addr_min/mod_tree.addr_max.
* (delete and add uses RCU list operations).
+ *
+ * 4) List of pending load infos
*/
DEFINE_MUTEX(module_mutex);
LIST_HEAD(modules);

+/*
+ * Modules (via load_info) that are currently being loaded but cannot be added
+ * to the module list yet are kept in a separate list. This list, combined with
+ * the module list makes sure that modules are unique: a module name has to be
+ * unique across both lists, protected by the module_mutex.
+ */
+LIST_HEAD(pending_load_infos);
+
/* Work queue for freeing init sections in success case */
static void do_free_init(struct work_struct *w);
static DECLARE_WORK(init_free_wq, do_free_init);
@@ -762,7 +772,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));

free_module(mod);
- /* someone could wait for the module in add_unformed_module() */
+ /* someone could wait for the module in add_pending_load_info() */
wake_up_all(&module_wq);
return 0;
out:
@@ -2374,6 +2384,16 @@ static int post_relocation(struct module *mod, const struct load_info *info)
return module_finalize(info->hdr, info->sechdrs, mod);
}

+static bool __is_pending_load_info_name(const char *name)
+{
+ struct load_info *info;
+
+ list_for_each_entry(info, &pending_load_infos, next)
+ if (!strcmp(info->name, name))
+ return true;
+ return false;
+}
+
/* Is this module of this name done loading? No locks held. */
static bool finished_loading(const char *name)
{
@@ -2388,7 +2408,11 @@ static bool finished_loading(const char *name)
sched_annotate_sleep();
mutex_lock(&module_mutex);
mod = find_module_all(name, strlen(name), true);
- ret = !mod || mod->state == MODULE_STATE_LIVE;
+ if (!mod)
+ /* It might still be in the early process of loading. */
+ ret = !__is_pending_load_info_name(name);
+ else
+ ret = mod->state == MODULE_STATE_LIVE;
mutex_unlock(&module_mutex);

return ret;
@@ -2552,43 +2576,58 @@ static int may_init_module(void)
return 0;
}

-/*
- * We try to place it in the list now to make sure it's unique before
- * we dedicate too many resources. In particular, temporary percpu
- * memory exhaustion.
- */
-static int add_unformed_module(struct module *mod)
+static int add_pending_load_info(struct load_info *info)
{
+ struct module *mod;
int err;
- struct module *old;
-
- mod->state = MODULE_STATE_UNFORMED;

-again:
- mutex_lock(&module_mutex);
- old = find_module_all(mod->name, strlen(mod->name), true);
- if (old != NULL) {
- if (old->state != MODULE_STATE_LIVE) {
- /* Wait in case it fails to load. */
+ while (true) {
+ mutex_lock(&module_mutex);
+ mod = find_module_all(info->name, strlen(info->name), true);
+ if (!mod && !__is_pending_load_info_name(info->name))
+ break;
+ if (mod && mod->state == MODULE_STATE_LIVE) {
mutex_unlock(&module_mutex);
- err = wait_event_interruptible(module_wq,
- finished_loading(mod->name));
- if (err)
- goto out_unlocked;
- goto again;
+ return -EEXIST;
}
- err = -EEXIST;
- goto out;
+
+ /*
+ * The module is in some phase of getting loaded/unloaded;
+ * wait and retry.
+ */
+ mutex_unlock(&module_mutex);
+ err = wait_event_interruptible(module_wq,
+ finished_loading(info->name));
+ if (err)
+ return err;
}
+
+ INIT_LIST_HEAD(&info->next);
+ list_add(&info->next, &pending_load_infos);
+ mutex_unlock(&module_mutex);
+ return 0;
+}
+
+static void remove_pending_load_info(struct load_info *info)
+{
+ mutex_lock(&module_mutex);
+ list_del(&info->next);
+ /* someone could wait for the module name in finished_loading(). */
+ wake_up_all(&module_wq);
+ mutex_unlock(&module_mutex);
+}
+
+static void add_unformed_module(struct load_info *info, struct module *mod)
+{
+ mod->state = MODULE_STATE_UNFORMED;
+
+ mutex_lock(&module_mutex);
mod_update_bounds(mod);
list_add_rcu(&mod->list, &modules);
+ /* The module is on the module list now. */
+ list_del(&info->next);
mod_tree_insert(mod);
- err = 0;
-
-out:
mutex_unlock(&module_mutex);
-out_unlocked:
- return err;
}

static int complete_formation(struct module *mod, struct load_info *info)
@@ -2720,12 +2759,24 @@ static int load_module(struct load_info *info, const char __user *uargs,
goto free_copy;
}

- err = rewrite_section_headers(info, flags);
+ /*
+ * We make sure the module name is unique before we dedicate too many
+ * resources. In particular, avoid temporary percpu memory and module
+ * vmap space exhaustion.
+ */
+ err = add_pending_load_info(info);
if (err)
goto free_copy;

+ err = rewrite_section_headers(info, flags);
+ if (err) {
+ remove_pending_load_info(info);
+ goto free_copy;
+ }
+
/* Check module struct version now, before we try to use module. */
if (!check_modstruct_version(info, info->mod)) {
+ remove_pending_load_info(info);
err = -ENOEXEC;
goto free_copy;
}
@@ -2739,10 +2790,11 @@ static int load_module(struct load_info *info, const char __user *uargs,

audit_log_kern_module(mod->name);

- /* Reserve our place in the list. */
- err = add_unformed_module(mod);
- if (err)
- goto free_module;
+ /*
+ * Add the module to the module list as unformed. This will remove the
+ * load_info from the pending load_info list.
+ */
+ add_unformed_module(info, mod);

#ifdef CONFIG_MODULE_SIG
mod->sig_ok = info->sig_ok;
@@ -2754,7 +2806,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
}
#endif

- /* To avoid stressing percpu allocator, do this once we're unique. */
err = percpu_modalloc(mod, info);
if (err)
goto unlink_mod;
@@ -2890,7 +2941,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
/* Wait for RCU-sched synchronizing before releasing mod->list. */
synchronize_rcu();
mutex_unlock(&module_mutex);
- free_module:
/* Free lock-classes; relies on the preceding sync_rcu() */
lockdep_free_key_range(mod->data_layout.base, mod->data_layout.size);

--
2.37.3


2022-10-14 05:08:29

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v1] kernel/module: allocate module vmap space after making sure the module is unique

Hi David,

I love your patch! Perhaps something to improve:

[auto build test WARNING on mcgrof/modules-next]
[also build test WARNING on linus/master v6.0 next-20221014]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/David-Hildenbrand/kernel-module-allocate-module-vmap-space-after-making-sure-the-module-is-unique/20221014-020756
base: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git modules-next
config: x86_64-randconfig-s022
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-39-gce1a6720-dirty
# https://github.com/intel-lab-lkp/linux/commit/8b7bfecba8ad77c19c0c857314df6f8e675f6f61
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review David-Hildenbrand/kernel-module-allocate-module-vmap-space-after-making-sure-the-module-is-unique/20221014-020756
git checkout 8b7bfecba8ad77c19c0c857314df6f8e675f6f61
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/module/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

sparse warnings: (new ones prefixed by >>)
>> kernel/module/main.c:80:1: sparse: sparse: symbol 'pending_load_infos' was not declared. Should it be static?

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (1.72 kB)
config (133.27 kB)
Download all attachments

2022-10-14 06:17:42

by Miroslav Benes

[permalink] [raw]
Subject: Re: [PATCH v1] kernel/module: allocate module vmap space after making sure the module is unique

Hi,

On Thu, 13 Oct 2022, David Hildenbrand wrote:

> We already make sure to allocate percpu data only after we verified that
> the module we're loading hasn't already been loaded and isn't
> concurrently getting loaded -- that it's unique.
>
> On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
> phasing a similar issue with the module vmap space.
>
> When KASAN_INLINE is enabled (resulting in large module size), plenty
> of devices that udev wants to probe and plenty (> 400) of CPUs that can
> carry out that probing concurrently, we can actually run out of module
> vmap space and trigger vmap allocation errors:
>
> [ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
> [ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
> [ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
> [ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
> [ 165.947343] Call Trace:
> [ 165.950075] <TASK>
> [ 165.952425] dump_stack_lvl+0x57/0x81
> [ 165.956532] warn_alloc.cold+0x95/0x18a
> [ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
> [ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
> [ 165.971461] ? __get_vm_area_node+0x2af/0x360
> [ 165.976341] ? __get_vm_area_node+0x2af/0x360
> [ 165.981219] __vmalloc_node_range+0x291/0x560
> [ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
> [ 165.991447] ? move_module+0x4c/0x630
> [ 165.995547] ? vfree_atomic+0xa0/0xa0
> [ 165.999647] ? move_module+0x4c/0x630
> [ 166.003741] module_alloc+0xe7/0x170
> [ 166.007747] ? move_module+0x4c/0x630
> [ 166.011840] move_module+0x4c/0x630
> [ 166.015751] layout_and_allocate+0x32c/0x560
> [ 166.020519] load_module+0x8e0/0x25c0
> [ 166.024623] ? layout_and_allocate+0x560/0x560
> [ 166.029586] ? kernel_read_file+0x286/0x6b0
> [ 166.034269] ? __x64_sys_fspick+0x290/0x290
> [ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
> [ 166.044203] ? lock_downgrade+0x130/0x130
> [ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
> [ 166.053854] __do_sys_finit_module+0x11a/0x1c0
> [ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
> [ 166.063882] ? __seccomp_filter+0x92/0x930
> [ 166.068494] do_syscall_64+0x59/0x90
> [ 166.072492] ? do_syscall_64+0x69/0x90
> [ 166.076679] ? do_syscall_64+0x69/0x90
> [ 166.080864] ? do_syscall_64+0x69/0x90
> [ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
> [ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>
> Interestingly, when reducing the number of CPUs (nosmt), it works as
> expected.
>
> The underlying issue is that we first allocate memory (including module
> vmap space) in layout_and_allocate(), and then verify whether the module
> is unique in add_unformed_module(). So we end up allocating module vmap
> space even though we might not need it -- which is a problem when modules
> are big and we can have a lot of concurrent probing of the same set of
> modules as on the big system at hand.
>
> Unfortunately, we cannot simply add the module earlier, because
> move_module() -- that allocates the module vmap space -- essentially
> brings the module to life from a temporary one. Adding the temporary one
> and replacing it is also sub-optimal (because replacing it would require
> to synchronize against RCU) and feels kind of dangerous judging that we
> end up copying it.
>
> So instead, add a second list (pending_load_infos) that tracks the modules
> (via their load_info) that are unique and are still getting loaded
> ("pending"), but haven't made it to the actual module list yet. This
> shouldn't have a notable runtime overhead when concurrently loading
> modules: the new list is expected to usually either be empty or contain
> very few entries for a short time.
>
> Thanks to Uladzislau for his help to verify that it's not actually a
> vmap code issue.

this seems to be related to what
https://lore.kernel.org/all/[email protected]/
tries to solve. Just your symptoms are different. Does the patch set fix
your issue too?

Regards
Miroslav

2022-10-14 07:40:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1] kernel/module: allocate module vmap space after making sure the module is unique

On 14.10.22 08:09, Miroslav Benes wrote:
> Hi,
>
> On Thu, 13 Oct 2022, David Hildenbrand wrote:
>
>> We already make sure to allocate percpu data only after we verified that
>> the module we're loading hasn't already been loaded and isn't
>> concurrently getting loaded -- that it's unique.
>>
>> On big systems (> 400 CPUs and many devices) with KASAN enabled, we're now
>> phasing a similar issue with the module vmap space.
>>
>> When KASAN_INLINE is enabled (resulting in large module size), plenty
>> of devices that udev wants to probe and plenty (> 400) of CPUs that can
>> carry out that probing concurrently, we can actually run out of module
>> vmap space and trigger vmap allocation errors:
>>
>> [ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
>> [ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>> [ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
>> [ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
>> [ 165.947343] Call Trace:
>> [ 165.950075] <TASK>
>> [ 165.952425] dump_stack_lvl+0x57/0x81
>> [ 165.956532] warn_alloc.cold+0x95/0x18a
>> [ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
>> [ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
>> [ 165.971461] ? __get_vm_area_node+0x2af/0x360
>> [ 165.976341] ? __get_vm_area_node+0x2af/0x360
>> [ 165.981219] __vmalloc_node_range+0x291/0x560
>> [ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
>> [ 165.991447] ? move_module+0x4c/0x630
>> [ 165.995547] ? vfree_atomic+0xa0/0xa0
>> [ 165.999647] ? move_module+0x4c/0x630
>> [ 166.003741] module_alloc+0xe7/0x170
>> [ 166.007747] ? move_module+0x4c/0x630
>> [ 166.011840] move_module+0x4c/0x630
>> [ 166.015751] layout_and_allocate+0x32c/0x560
>> [ 166.020519] load_module+0x8e0/0x25c0
>> [ 166.024623] ? layout_and_allocate+0x560/0x560
>> [ 166.029586] ? kernel_read_file+0x286/0x6b0
>> [ 166.034269] ? __x64_sys_fspick+0x290/0x290
>> [ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
>> [ 166.044203] ? lock_downgrade+0x130/0x130
>> [ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
>> [ 166.053854] __do_sys_finit_module+0x11a/0x1c0
>> [ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
>> [ 166.063882] ? __seccomp_filter+0x92/0x930
>> [ 166.068494] do_syscall_64+0x59/0x90
>> [ 166.072492] ? do_syscall_64+0x69/0x90
>> [ 166.076679] ? do_syscall_64+0x69/0x90
>> [ 166.080864] ? do_syscall_64+0x69/0x90
>> [ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>> [ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
>> [ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
>>
>> Interestingly, when reducing the number of CPUs (nosmt), it works as
>> expected.
>>
>> The underlying issue is that we first allocate memory (including module
>> vmap space) in layout_and_allocate(), and then verify whether the module
>> is unique in add_unformed_module(). So we end up allocating module vmap
>> space even though we might not need it -- which is a problem when modules
>> are big and we can have a lot of concurrent probing of the same set of
>> modules as on the big system at hand.
>>
>> Unfortunately, we cannot simply add the module earlier, because
>> move_module() -- that allocates the module vmap space -- essentially
>> brings the module to life from a temporary one. Adding the temporary one
>> and replacing it is also sub-optimal (because replacing it would require
>> to synchronize against RCU) and feels kind of dangerous judging that we
>> end up copying it.
>>
>> So instead, add a second list (pending_load_infos) that tracks the modules
>> (via their load_info) that are unique and are still getting loaded
>> ("pending"), but haven't made it to the actual module list yet. This
>> shouldn't have a notable runtime overhead when concurrently loading
>> modules: the new list is expected to usually either be empty or contain
>> very few entries for a short time.
>>
>> Thanks to Uladzislau for his help to verify that it's not actually a
>> vmap code issue.
>
> this seems to be related to what
> https://lore.kernel.org/all/[email protected]/
> tries to solve. Just your symptoms are different. Does the patch set fix
> your issue too?

Hi Miroslav,

the underlying approach with a load_info list is similar (which is nice
to see), so I assume it will similarly fix the issue.

I'm not sure if merging the requests (adding the refcount logic and the
-EBUSY change is really required/wanted), though. Looks like some of
these changes that might have been factored out into separate patches.

Not my call to make. I'll give the set a churn on the machine where I
can reproduce the issue.

--
Thanks,

David / dhildenb