[Kernel-meetup] Blog posts and slides for past meetups
Nitin Bansal
nb at randomsearch.org
Wed Apr 19 07:23:49 PDT 2017
Hi Siddhesh,
On Wed, Apr 19, 2017 at 11:29 AM, Siddhesh Poyarekar
<sid at reserved-bit.com> wrote:
> Hi,
>
> If you have written blog posts or uploaded slides of your sessions at
> the kernel meetup in the past months, please don't forget to add a link
> to it on the relevant wiki page.
>
> https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017
> https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017
>
I am attaching a file which contains write-up about vmcore analysis
session that I presented in last meetup,
I don't maintain any blog etc so don't really have any place to
publish it, but feel free to post/attach it to
reserved-bit wiki if you like.
-------------- next part --------------
I attended the Pune Kernel Meetup on 8th April and presented a session on
Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a
few things about the vmcore analysis, hope it helps.
Before starting with the vmcore analysis, we need following three things:
o crash utility [https://people.redhat.com/anderson/crash_whitepaper/]
o vmlinux
o vmcore
vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux
(RHEL) at least). One can either install kernel-debuginfo or simply extract
vmlinux file using rpm2cpio:
# yum install kernel-debuginfo-2.6.32-642.13.1.el6.x86_64
Or
# rpm2cpio kernel-debuginfo-2.6.32-642.13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo
package of same version is required.
Now let's use "crash" utility to open this vmcore, which after printing a few
details i.e. copyright information, GDB version information etc drops us to
"crash>" prompt:
# crash /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux vmcore
crash 7.1.0-6.el6
..
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 2
DATE: Fri Mar 31 11:15:47 2017
UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
TASKS: 232
NODENAME: rhel6
RELEASE: 2.6.32-642.13.1.el6.x86_64
VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
MACHINE: x86_64 (2808 Mhz)
MEMORY: 2 GB
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
PID: 46
COMMAND: "khungtaskd"
TASK: ffff880078d16ab0 [THREAD_INFO: ffff880078d20000]
CPU: 1
STATE: TASK_RUNNING (PANIC)
crash>
"sys" command can be used to print most of the information that crash prints
when invoked e.g.:
crash> sys
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 2
DATE: Fri Mar 31 11:15:47 2017
UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
TASKS: 232
NODENAME: rhel6
RELEASE: 2.6.32-642.13.1.el6.x86_64
VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
MACHINE: x86_64 (2808 Mhz)
MEMORY: 2 GB
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
To list all the available crash commands, type "?" and hit enter:
crash> ?
* files mach repeat timer
alias foreach mod runq tree
ascii fuser mount search union
bt gdb net set vm
btop help p sig vtop
dev ipcs ps struct waitq
dis irq pte swap whatis
eval kmem ptob sym wr
exit list ptov sys q
extend log rd task
Use "help" to get details about a command and parameters it accepts e.g.:
crash>help sys
Now, let's examine kernel ring buffer (that's where we find panic message) with
the help of "log" command.
And following is the panic message, which tells us that khungtaskd crashed the
system:
Call Trace:
<#DB[1]> <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
<NMI> [<ffffffff81009c47>] ? show_regs+0x27/0x30
[<ffffffff8154d909>] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0
[<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80
[<ffffffff8154f20a>] ? atomic_notifier_call_chain+0x1a/0x20
[<ffffffff810acc8e>] ? notify_die+0x2e/0x30
[<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350
[<ffffffff8154c6e3>] ? nmi+0x83/0x90
[<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10
<<EOE>> [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0
[<ffffffff81039b9a>] ? arch_trigger_all_cpu_backtrace+0x27a/0x2c0
[<ffffffff810f2740>] ? watchdog+0x250/0x2a0
[<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
[<ffffffff810a640e>] ? kthread+0x9e/0xc0
[<ffffffff8100c28a>] ? child_rip+0xa/0x20
[<ffffffff810a6370>] ? kthread+0x0/0xc0
[<ffffffff8100c280>] ? child_rip+0x0/0x20
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
[<ffffffff81548301>] ? panic+0xa7/0x179
[<ffffffff81039aa5>] ? arch_trigger_all_cpu_backtrace+0x185/0x2c0
[<ffffffff810f274e>] ? watchdog+0x25e/0x2a0
[<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
[<ffffffff810a640e>] ? kthread+0x9e/0xc0
[<ffffffff8100c28a>] ? child_rip+0xa/0x20
[<ffffffff810a6370>] ? kthread+0x0/0xc0
[<ffffffff8100c280>] ? child_rip+0x0/0x20
khungtaskd is a kernel thread which checks for uninterruptible tasks, few
relevant snips from kernel/hung_task.c:
static int __init hung_task_init(void)
{
atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");
return 0;
}
..
static int watchdog(void *dummy)
{
set_user_nice(current, 0);
for ( ; ; ) {
unsigned long timeout = sysctl_hung_task_timeout_secs;
while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
timeout = sysctl_hung_task_timeout_secs;
check_hung_uninterruptible_tasks(timeout);
}
return 0;
}
By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be
seen along with the following backtrace:
1 > INFO: task dd:7224 blocked for more than 10 seconds.
2 > Not tainted 2.6.32-642.13.1.el6.x86_64 #1
3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
4> dd D 0000000000000001 0 7224 6702 0x00000080
ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8
5> Call Trace:
[<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80
[<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120
[<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff811bcc14>] mnt_want_write+0x24/0x50
[<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190
[<ffffffff811aca99>] do_filp_open+0x2b9/0xd20
[<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90
[<ffffffff811ba072>] ? alloc_fd+0x92/0x160
[<ffffffff811969f7>] do_sys_open+0x67/0x130
[<ffffffff81196b00>] sys_open+0x20/0x30
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Let's look at it line by line:
Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.
Line 2 tells us the kernel version and that it is not tainted (i.e. no third
party modules are loaded etc.).
Line 3 is just telling us how to disable this message, not a great idea IMO.
Line 4 contains command name, state, PID etc.
Line 5 shows the associated call trace which suggests that dd became
uninterruptible while trying to write to a filesystem.
Process details can be checked using "ps":
crash> ps -m | grep UN
[0 00:00:14.117] [UN] PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd"
^^^^^^^^^^^^^^^^
|_____This is the duration of dd in uninterruptible state (14.117 seconds)
-m switch displays timestamp in days, hours, minutes, seconds and milliseconds.
Next step is to examine dd process, in order to do that, we first need to
change the process context (currently it is set to khungtaskd and needs to be
switched to dd), which can be done using "set":
crash> set 7224
PID: 7224
COMMAND: "dd"
TASK: ffff880037b70ab0 [THREAD_INFO: ffff88007c470000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE
Let's print backtrace along with kernel source line numbers (source line
numbers probably wouldn't be displayed if kernel-debuginfo is not installed):
crash> bt -l
PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd"
#0 [ffff88007c473c18] schedule at ffffffff815489d0
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 3119
#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 677
#2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
#3 [ffff88007c473db0] do_filp_open at ffffffff811aca99
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namei.c: 2267
#4 [ffff88007c473f20] do_sys_open at ffffffff811969f7
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 908
#5 [ffff88007c473f70] sys_open at ffffffff81196b00
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 932
#6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/arch/x86/kernel/entry_64.S: 492
RIP: 000000358a8db480 RSP: 00007ffcd7d37ca8 RFLAGS: 00010246
RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000000
RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007ffcd7d3870b
RBP: 0000000000000240 R8: 000000358ab8ee40 R9: 00007ffaa33c368c
R10: 1999999999999999 R11: 0000000000000246 R12: ffffffff81196b00
R13: ffff88007c473f78 R14: ffffffffffffffff R15: 0000000000000064
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b
This backtrace needs to be read from bottom to top, following chain shows how
these functions were called:
921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
922 {
923 long ret;
924
925 if (force_o_largefile())
926 flags |= O_LARGEFILE;
927
928 ret = do_sys_open(AT_FDCWD, filename, flags, mode);--------------------
|
----------------------------------------------------------------
|
899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
900 {
901 struct filename *tmp = getname(filename);
902 int fd = PTR_ERR(tmp);
903
904 if (!IS_ERR(tmp)) {
905 fd = get_unused_fd_flags(flags);
906 if (fd >= 0) {
907 struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------
|
---------------------------------------------------------
|
2166 struct file *do_filp_open(int dfd, struct filename *filename,
2167 int open_flag, int mode, int acc_mode)
2168 {
2169 struct file *filp;
..
..
2256 /*
2257 * This write is needed to ensure that a
2258 * ro->rw transition does not occur between
2259 * the time when the file is created and when
2260 * a permanent write count is taken through
2261 * the 'struct file' in nameidata_to_filp().
2262 */
2263 error = mnt_want_write(nd.path.mnt); ----------------------------------
|
-----------------------------------------------------------------
|
295 int mnt_want_write(struct vfsmount *m)
296 {
297 int ret;
298
299 sb_start_write(m->mnt_sb); -------------------------------------------
|
-------------------------------------------------
|
1569 static inline void sb_start_write(struct super_block *sb)
1570 {
1571 __sb_start_write(sb, SB_FREEZE_WRITE, true); --------------------------
|
----------------------------------------------------------------
|
668 int __sb_start_write(struct super_block *sb, int level, bool wait)
669 {
670 /* Out of tree modules don't use this mechanism */
671 if (unlikely(!sb_has_new_freeze(sb)))
672 return 1;
673 retry:
674 if (unlikely(sb->s_writers.frozen >= level)) {
675 if (!wait)
676 return 0;
677 wait_event(sb->s_writers.wait_unfrozen, <<<<<<<<<<<<<<< dd was here
678 sb->s_writers.frozen < level);
679 }
In short, mnt_want_write() called sb_start_write() which then called
__sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when
this vmcore was collected.
From this code it is clear that dd was trying to write to a filesystem but
somehow it couldn't and became uninterruptible, now we need to find out which
filesystem it was and why dd couldn't write to it. Can we do it? Yes,
absolutely, if we can determine the first argument i.e. super_block that is
passed to __sb_start_write(), it will tell us about the filesystem/mount-point.
If we go back to backtrace, we can see that mnt_want_write() called
__sb_start_write() and passed it three parameters, struct super_block (sb),
level (SB_FREEZE_WRITE) and a Boolean called wait (true):
#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
#2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
Note: "whatis command can be used to check the function prototype (simply
entering the function name or px also does the job):
crash> whatis __sb_start_write
int __sb_start_write(struct super_block *, int, bool);
Before we go any further, it is important to know how arguments are passed to
the called function. First argument goes to %rdi, second to %rsi, third to
%rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed
with this knowledge now we can turn our attention to disassembly. Let's
disassemble mnt_start_write to find out arguments passed to __sb_start_write:
crash> dis -lr ffffffff811bcc14
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcbf0 <mnt_want_write>: push %rbp
0xffffffff811bcbf1 <mnt_want_write+1>: mov %rsp,%rbp
0xffffffff811bcbf4 <mnt_want_write+4>: push %rbx
0xffffffff811bcbf5 <mnt_want_write+5>: sub $0x18,%rsp
0xffffffff811bcbf9 <mnt_want_write+9>: nopl 0x0(%rax,%rax,1)
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcbfe <mnt_want_write+14>: mov $0x1,%edx <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3rd argument i.e. wait
0xffffffff811bcc03 <mnt_want_write+19>: mov $0x1,%esi <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd argument i.e. SB_FREEZE_WRITE
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcc08 <mnt_want_write+24>: mov %rdi,%rbx
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcc0b <mnt_want_write+27>: mov 0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)
0xffffffff811bcc0f <mnt_want_write+31>: callq 0xffffffff8119bd20 <__sb_start_write>
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
0xffffffff811bcc14 <mnt_want_write+36>: mov %rbx,%rdi
The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are
pretty straightforward, both are set to 1.
But to determine super_block which is %rdi we have to back-peddle a little bit,
28th offset of %rdi was copied to %rdi:
mov 0x28(%rdi),%rdi
we need to determine what's inside %rdi and one line above that tells us that
%rdi was copied to %rbx.
mov %rdi,%rbx
If we can determine value inside %rbx, it will give us %rdi and then 28th
offset of %rdi can be calculated.
To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved
somewhere:
crash> dis -lr ffffffff8119bdfc | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 669
0xffffffff8119bd20 <__sb_start_write>: push %rbp
0xffffffff8119bd21 <__sb_start_write+1>: mov %rsp,%rbp
0xffffffff8119bd24 <__sb_start_write+4>: add $0xffffffffffffff80,%rsp
0xffffffff8119bd28 <__sb_start_write+8>: mov %rbx,-0x28(%rbp) <<<<<<< rbx
0xffffffff8119bd2c <__sb_start_write+12>: mov %r12,-0x20(%rbp)
0xffffffff8119bd30 <__sb_start_write+16>: mov %r13,-0x18(%rbp)
0xffffffff8119bd34 <__sb_start_write+20>: mov %r14,-0x10(%rbp)
0xffffffff8119bd38 <__sb_start_write+24>: mov %r15,-0x8(%rbp)
0xffffffff8119bd3c <__sb_start_write+28>: nopl 0x0(%rax,%rax,1)
%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the
stack data of __sb_start_write or schedule, I am going to use stack data of
schedule for now to make it simpler. Disassembly of schedule (next function
after __sb_start_write) suggests that %rbp is pushed to the stack:
crash> dis -lr ffffffff815489d0 | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 6133
0xffffffff81548630 <schedule>: push %rbp <<<<<<<<<<<<<<<<<<< 1st push
0xffffffff81548631 <schedule+1>: mov %rsp,%rbp
0xffffffff81548634 <schedule+4>: push %r15
0xffffffff81548636 <schedule+6>: push %r14
0xffffffff81548638 <schedule+8>: push %r13
0xffffffff8154863a <schedule+10>: push %r12
0xffffffff8154863c <schedule+12>: push %rbx
0xffffffff8154863d <schedule+13>: sub $0x98,%rsp
0xffffffff81548644 <schedule+20>: nopl 0x0(%rax,%rax,1)
first push at stack will give us %rbp, and bt -f can be used to print stack:
crash> bt -f | grep -A15 schedule
#0 [ffff88007c473c18] schedule at ffffffff815489d0
ffff88007c473c20: 0000000000000082 0000000000000000
ffff88007c473c30: 0000000000000000 0000000000000000
ffff88007c473c40: 0000000000000000 0000000000000000
ffff88007c473c50: 0000000000000000 0000000000000000
ffff88007c473c60: 0000000000000000 ffff880037b71068
ffff88007c473c70: ffff88007c473fd8 0000000000010068
ffff88007c473c80: ffff880037b71068 ffff88007e527520
ffff88007c473c90: ffff880037b70ab0 0000000000000002
ffff88007c473ca0: 0000000000000246 ffff88007c473ce8
ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0
ffff88007c473cc0: ffff8800379e7400 0000000000000001
ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8
ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp
ffff88007c473cf0: ffffffff8119bdfc
Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:
crash> eval (0xffff88007c473d78 - 0x28)
hexadecimal: ffff88007c473d50
decimal: 18446612134399262032 (-131939310289584)
octal: 1777774200017421636520
binary: 1111111111111111100010000000000001111100010001110011110101010000
crash> rd ffff88007c473d50
ffff88007c473d50: ffff88007ef64380 .C.~....
%rbx == ffff88007ef64380
Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we
calculate 28th offset to %rdi we will get super_block:
crash> eval (0xffff88007ef64380 + 0x28)
hexadecimal: ffff88007ef643a8
decimal: 18446612134444286888 (-131939265264728)
octal: 1777774200017675441650
binary: 1111111111111111100010000000000001111110111101100100001110101000
crash> rd ffff88007ef643a8
ffff88007ef643a8: ffff8800379e7400 .t.7....
super_block == ffff8800379e7400.
super_block can be passed to the "mount" command to get the mount point:
crash> mount ffff8800379e7400
VFSMOUNT SUPERBLK TYPE DEVNAME DIRNAME
ffff88007ef64380 ffff8800379e7400 ext4 /dev/vda1 /boot
This tells us that that dd was trying to write to /boot. Next part of our job
is to find out why couldn't it write to /boot. In order to answer that, let's
take a step back and revisit the kernel code at line 677 (where dd was):
677 wait_event(sb->s_writers.wait_unfrozen,
678 sb->s_writers.frozen < level);
Following wait_event() definition tells us that dd was put to sleep
(uninterruptible) and will be woken up when "sb->s_writers.frozen < level"
condition becomes true:
/**
* wait_event - sleep until a condition gets true
* @wq: the waitqueue to wait on
* @condition: a C expression for the event to wait for
*
* The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
* @condition evaluates to true. The @condition is checked each time
* the waitqueue @wq is woken up.
*
* wake_up() has to be called after changing any variable that could
* change the result of the wait condition.
*/
#define wait_event(wq, condition) \
do { \
if (condition) \
break; \
__wait_event(wq, condition); \
} while (0)
We know that "level" is 1 (second argument SB_FREEZE_WRITE), and
sb->s_writers.frozen needs to be less than 1 for this condition to become true.
Let's see what exactly is the current value of sb->s_writers.frozen, since we
already have sb, it is pretty straightforward to obtain this value:
crash> pd ((struct super_block *)0xffff8800379e7400)->s_writers.frozen
$5 = 4
Or do this (I prefer the 1st one):
crash> super_block.s_writers ffff8800379e7400 | grep frozen
frozen = 4,
sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is
uninterruptible, but what does this value 4 mean? This is defined in
include/linux/fs.h ("cscope" can be used to examine kernel source):
/* Possible states of 'frozen' field */
enum {
SB_UNFROZEN = 0, /* FS is unfrozen */
SB_FREEZE_WRITE = 1, /* Writes, dir ops, ioctls frozen */
SB_FREEZE_TRANS = 2,
SB_FREEZE_PAGEFAULT = 2, /* Page faults stopped as well */
SB_FREEZE_FS = 3, /* For internal FS use (e.g. to stop
* internal threads if needed) */
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */ <<<<<<<
};
4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So,
now we can say that dd became uninterruptible because /boot filesystem was
frozen, this triggered hung_task_timeout_secs which was caught by
hung_task_panic and this caused the crash.
Next step should be to check what exactly froze the filesystem, in this
particular case I did it using fsfreeze to collect this vmcore so I can
demonstrate a few things about vmcore analysis.
In order to collect this demo vmcore I used a sysctl parameter named
hung_task_panic. This parameter when enabled will trigger a crash if a process
becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs
sysctl parameter controls this duration and is set to 120 seconds by default in
RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to
10 seconds:
# echo 1 > /proc/sys/kernel/hung_task_panic
# echo 10 > /proc/sys/kernel/hung_task_timeout_secs
I simply executed following dd command on /boot:
# dd if=/dev/zero of=/boot/test
And in the second terminal this:
# fsfreeze -f /boot
This made dd uninterruptible, hung_task_panic kicked in and kdump did its job
by saving a vmcore under /var/crash.
We can print these values from vmcore:
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $1 = 10
crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $2 = 1
Please send an email to the kernel meetup list for any queries and if this
write-up needs any corrections.
More information about the Kernel-meetup
mailing list