Skip to content

Kernel panic - after drop caches  #8

@msharmavikram

Description

@msharmavikram

Hi @pakmarkthub

When I run the vectorAdd program repeatedly (manually and not using a run script), I end up getting a kernel panic error. I upgraded the kernel to 5.6.3 and is using Nvidia driver 440.82 in CentOS 8 and this time I ensured it is ext4 :)

I am trying to understand what is causing this issue and unable to figure out. Any thoughts on what might be going wrong.

Let me tell you exactly what I did in a step by step process.

  1. generate data 1000K entries in ext4 disk and load the dragon driver and activate it.
  2. execute nvmgpu vectorAdd program with following field
    ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  3. The step 2 completes and generates correct output.
  4. sync
  5. drop caches
  6. execute nvmgpu vectorAdd program with following field
    ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  7. KERNEL PANIC with below error:
[  +0.513839] BUG: Bad page state in process vectorAdd  pfn:3f5f5d0
[  +0.000035] page:ffffede0fd7d7400 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000043] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000013] flags: 0x17ffffc0000000()
[  +0.000011] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000020] raw: 0000000000000001 ffff908f69aee068 00000000ffffffff ffff909170526000
[  +0.000019] page dumped because: page still charged to cgroup
[  +0.000014] page->mem_cgroup:ffff909170526000
[  +0.000011] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000028]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.000275] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P           O      5.6.3.dragon #5
[  +0.000020] Hardware name: ******
[  +0.000018] Call Trace:
[  +0.000015]  dump_stack+0x66/0x90
[  +0.000014]  bad_page.cold.125+0x7f/0xb2
[  +0.000012]  free_pcppages_bulk+0x178/0x660
[  +0.000013]  free_unref_page_list+0x101/0x180
[  +0.000015]  release_pages+0x382/0x400
[  +0.000013]  tlb_flush_mmu+0x44/0x150
[  +0.000012]  unmap_page_range+0x87f/0xde0
[  +0.000838]  unmap_vmas+0x91/0xf0
[  +0.000783]  exit_mmap+0xaa/0x180
[  +0.000779]  mmput+0x52/0x120
[  +0.000778]  do_exit+0x337/0xae0
[  +0.000769]  do_group_exit+0x3a/0xa0
[  +0.000762]  __x64_sys_exit_group+0x14/0x20
[  +0.000751]  do_syscall_64+0x5b/0x1e0
[  +0.000738]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000736] RIP: 0033:0x7f58bfbec7f6
[  +0.000741] Code: Bad RIP value.
[  +0.000733] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000745] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000755] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000753] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000747] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000743] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000
[  +0.000753] BUG: Bad page state in process vectorAdd  pfn:3f5f5d1
[  +0.000749] page:ffffede0fd7d7440 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000778] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000760] flags: 0x17ffffc0000000()
[  +0.000757] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000774] raw: 0000000000000001 ffff908f69aeeea0 00000000ffffffff ffff909170526000
[  +0.000784] page dumped because: page still charged to cgroup
[  +0.000792] page->mem_cgroup:ffff909170526000
[  +0.000787] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000023]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.009132] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P    B      O      5.6.3.dragon #5
[  +0.001018] Hardware name: ******
[  +0.001019] Call Trace:
[  +0.001012]  dump_stack+0x66/0x90
[  +0.001004]  bad_page.cold.125+0x7f/0xb2
[  +0.001003]  free_pcppages_bulk+0x178/0x660
[  +0.000996]  free_unref_page_list+0x101/0x180
[  +0.000994]  release_pages+0x382/0x400
[  +0.000985]  tlb_flush_mmu+0x44/0x150
[  +0.000980]  unmap_page_range+0x87f/0xde0
[  +0.000962]  unmap_vmas+0x91/0xf0
[  +0.000935]  exit_mmap+0xaa/0x180
[  +0.000913]  mmput+0x52/0x120
[  +0.000887]  do_exit+0x337/0xae0
[  +0.000864]  do_group_exit+0x3a/0xa0
[  +0.000840]  __x64_sys_exit_group+0x14/0x20
[  +0.000820]  do_syscall_64+0x5b/0x1e0
[  +0.000795]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000777] RIP: 0033:0x7f58bfbec7f6
[  +0.000754] Code: Bad RIP value.
[  +0.000745] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000753] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000753] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000758] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000758] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000761] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions