[Kernel-meetup] Blog posts and slides for past meetups
    Nitin Bansal 
    nb at randomsearch.org
       
    Wed Apr 19 07:23:49 PDT 2017
    
    
  
Hi Siddhesh,
On Wed, Apr 19, 2017 at 11:29 AM, Siddhesh Poyarekar
<sid at reserved-bit.com> wrote:
> Hi,
>
> If you have written blog posts or uploaded slides of your sessions at
> the kernel meetup in the past months, please don't forget to add a link
> to it on the relevant wiki page.
>
> https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017
> https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017
>
I am attaching a file which contains write-up about vmcore analysis
session that I presented in last meetup,
I don't maintain any blog etc so don't really have any place to
publish it, but feel free to post/attach it to
reserved-bit wiki if you like.
-------------- next part --------------
I attended the Pune Kernel Meetup on 8th April and presented a session on
Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a
few things about the vmcore analysis, hope it helps. 
Before starting with the vmcore analysis, we need following three things:
o crash utility [https://people.redhat.com/anderson/crash_whitepaper/]
o vmlinux
o vmcore
 
vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux
(RHEL) at least). One can either install kernel-debuginfo or simply extract
vmlinux file using rpm2cpio:
# yum install kernel-debuginfo-2.6.32-642.13.1.el6.x86_64
Or
# rpm2cpio kernel-debuginfo-2.6.32-642.13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo
package of same version is required. 
Now let's use "crash" utility to open this vmcore, which after printing a few
details i.e. copyright information, GDB version information etc drops us to
"crash>" prompt:
# crash /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux  vmcore
crash 7.1.0-6.el6
..
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Mar 31 11:15:47 2017
      UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
       TASKS: 232
    NODENAME: rhel6
     RELEASE: 2.6.32-642.13.1.el6.x86_64
     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
     MACHINE: x86_64  (2808 Mhz)
      MEMORY: 2 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 46
     COMMAND: "khungtaskd"
        TASK: ffff880078d16ab0  [THREAD_INFO: ffff880078d20000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)
crash> 
"sys" command can be used to print most of the information that crash prints
when invoked e.g.:
crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Mar 31 11:15:47 2017
      UPTIME: 00:06:01
LOAD AVERAGE: 0.40, 0.30, 0.13
       TASKS: 232
    NODENAME: rhel6
     RELEASE: 2.6.32-642.13.1.el6.x86_64
     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
     MACHINE: x86_64  (2808 Mhz)
      MEMORY: 2 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
To list all the available crash commands, type "?" and hit enter:
crash> ?
*              files          mach           repeat         timer          
alias          foreach        mod            runq           tree           
ascii          fuser          mount          search         union          
bt             gdb            net            set            vm             
btop           help           p              sig            vtop           
dev            ipcs           ps             struct         waitq          
dis            irq            pte            swap           whatis         
eval           kmem           ptob           sym            wr             
exit           list           ptov           sys            q              
extend         log            rd             task           
Use "help" to get details about a command and parameters it accepts e.g.:
crash>help sys
Now, let's examine kernel ring buffer (that's where we find panic message) with
the help of "log" command. 
And following is the panic message, which tells us that khungtaskd crashed the
system:
Call Trace:
 <#DB[1]>  <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 <NMI>  [<ffffffff81009c47>] ? show_regs+0x27/0x30
 [<ffffffff8154d909>] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0
 [<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8154f20a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<ffffffff810acc8e>] ? notify_die+0x2e/0x30
 [<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350
 [<ffffffff8154c6e3>] ? nmi+0x83/0x90
 [<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10
 <<EOE>>  [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0
 [<ffffffff81039b9a>] ? arch_trigger_all_cpu_backtrace+0x27a/0x2c0
 [<ffffffff810f2740>] ? watchdog+0x250/0x2a0
 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
 [<ffffffff810a640e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a6370>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81548301>] ? panic+0xa7/0x179
 [<ffffffff81039aa5>] ? arch_trigger_all_cpu_backtrace+0x185/0x2c0
 [<ffffffff810f274e>] ? watchdog+0x25e/0x2a0
 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
 [<ffffffff810a640e>] ? kthread+0x9e/0xc0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a6370>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
khungtaskd is a kernel thread which checks for uninterruptible tasks, few
relevant snips from kernel/hung_task.c:
static int __init hung_task_init(void)
{
        atomic_notifier_chain_register(&panic_notifier_list, &panic_block);
        watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");
        return 0;
}
..
static int watchdog(void *dummy)
{
        set_user_nice(current, 0); 
 
        for ( ; ; ) {
                unsigned long timeout = sysctl_hung_task_timeout_secs;     
                while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
                        timeout = sysctl_hung_task_timeout_secs;
                check_hung_uninterruptible_tasks(timeout);
        }
        return 0;
}
By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be
seen along with the following backtrace:
1 > INFO: task dd:7224 blocked for more than 10 seconds.
2 >      Not tainted 2.6.32-642.13.1.el6.x86_64 #1
3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
4> dd            D 0000000000000001     0  7224   6702 0x00000080
 ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8
5> Call Trace:
 [<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80
 [<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120
 [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff811bcc14>] mnt_want_write+0x24/0x50
 [<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190
 [<ffffffff811aca99>] do_filp_open+0x2b9/0xd20
 [<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811ba072>] ? alloc_fd+0x92/0x160
 [<ffffffff811969f7>] do_sys_open+0x67/0x130
 [<ffffffff81196b00>] sys_open+0x20/0x30
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
Let's look at it line by line:
Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.
Line 2 tells us the kernel version and that it is not tainted (i.e. no third
party modules are loaded etc.).
Line 3 is just telling us how to disable this message, not a great idea IMO.
Line 4 contains command name, state, PID etc.
Line 5 shows the associated call trace which suggests that dd became
uninterruptible while trying to write to a filesystem.
Process details can be checked using "ps":
crash> ps -m | grep UN
[0 00:00:14.117] [UN]  PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"
^^^^^^^^^^^^^^^^
     |_____This is the duration of dd in uninterruptible state (14.117 seconds)
-m switch displays timestamp in days, hours, minutes, seconds and milliseconds. 
Next step is to examine dd process, in order to do that, we first need to
change the process context (currently it is set to khungtaskd and needs to be
switched to dd), which can be done using "set":
crash> set 7224
    PID: 7224
COMMAND: "dd"
   TASK: ffff880037b70ab0  [THREAD_INFO: ffff88007c470000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE 
Let's print backtrace along with kernel source line numbers (source line
numbers probably wouldn't be displayed if kernel-debuginfo is not installed):
crash> bt -l
PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"
 #0 [ffff88007c473c18] schedule at ffffffff815489d0
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 3119
 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 677
 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
 #3 [ffff88007c473db0] do_filp_open at ffffffff811aca99
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namei.c: 2267
 #4 [ffff88007c473f20] do_sys_open at ffffffff811969f7
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 908
 #5 [ffff88007c473f70] sys_open at ffffffff81196b00
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c: 932
 #6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2
    /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/arch/x86/kernel/entry_64.S: 492
    RIP: 000000358a8db480  RSP: 00007ffcd7d37ca8  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff8100b0d2  RCX: 0000000000000000
    RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 00007ffcd7d3870b
    RBP: 0000000000000240   R8: 000000358ab8ee40   R9: 00007ffaa33c368c
    R10: 1999999999999999  R11: 0000000000000246  R12: ffffffff81196b00
    R13: ffff88007c473f78  R14: ffffffffffffffff  R15: 0000000000000064
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b
This backtrace needs to be read from bottom to top, following chain shows how
these functions were called:
 921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)
 922 {
 923     long ret;
 924 
 925     if (force_o_largefile())
 926         flags |= O_LARGEFILE;
 927 
 928     ret = do_sys_open(AT_FDCWD, filename, flags, mode);--------------------
                                                                                | 
                ----------------------------------------------------------------
                |                                         
 899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
 900 {
 901     struct filename *tmp = getname(filename);
 902     int fd = PTR_ERR(tmp);
 903 
 904     if (!IS_ERR(tmp)) {
 905         fd = get_unused_fd_flags(flags);
 906         if (fd >= 0) {
 907             struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------
                                                                                |
                       ---------------------------------------------------------
                      |
2166 struct file *do_filp_open(int dfd, struct filename *filename,
2167         int open_flag, int mode, int acc_mode)
2168 {
2169     struct file *filp;
..
..
2256     /*
2257      * This write is needed to ensure that a
2258      * ro->rw transition does not occur between
2259      * the time when the file is created and when
2260      * a permanent write count is taken through
2261      * the 'struct file' in nameidata_to_filp().
2262      */
2263     error = mnt_want_write(nd.path.mnt); ----------------------------------
                                                                                |
               -----------------------------------------------------------------
              |
 295 int mnt_want_write(struct vfsmount *m)
 296 {
 297     int ret;
 298 
 299     sb_start_write(m->mnt_sb);  -------------------------------------------
                                                                                |
                               -------------------------------------------------
                              |
1569 static inline void sb_start_write(struct super_block *sb)
1570 {
1571     __sb_start_write(sb, SB_FREEZE_WRITE, true); --------------------------
                                                                                |
                ----------------------------------------------------------------
               |
668 int __sb_start_write(struct super_block *sb, int level, bool wait)
669 {   
670     /* Out of tree modules don't use this mechanism */
671     if (unlikely(!sb_has_new_freeze(sb)))
672         return 1;
673 retry:
674     if (unlikely(sb->s_writers.frozen >= level)) {
675         if (!wait)
676             return 0;
677         wait_event(sb->s_writers.wait_unfrozen,  <<<<<<<<<<<<<<< dd was here
678                sb->s_writers.frozen < level);
679     }
In short, mnt_want_write() called sb_start_write() which then called
__sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when
this vmcore was collected.
From this code it is clear that dd was trying to write to a filesystem but
somehow it couldn't and became uninterruptible, now we need to find out which
filesystem it was and why dd couldn't write to it. Can we do it? Yes,
absolutely, if we can determine the first argument i.e. super_block that is
passed to __sb_start_write(), it will tell us about the filesystem/mount-point. 
If we go back to backtrace, we can see that mnt_want_write() called
__sb_start_write() and passed it three parameters, struct super_block (sb),
level (SB_FREEZE_WRITE) and a Boolean called wait (true):
 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
Note: "whatis command can be used to check the function prototype (simply
entering the function name or px also does the job):
crash> whatis __sb_start_write
int __sb_start_write(struct super_block *, int, bool);
Before we go any further, it is important to know how arguments are passed to
the called function. First argument goes to %rdi, second to %rsi, third to
%rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed
with this knowledge now we can turn our attention to disassembly. Let's
disassemble mnt_start_write to find out arguments passed to __sb_start_write:
crash> dis -lr ffffffff811bcc14 
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcbf0 <mnt_want_write>:    push   %rbp
0xffffffff811bcbf1 <mnt_want_write+1>:  mov    %rsp,%rbp
0xffffffff811bcbf4 <mnt_want_write+4>:  push   %rbx
0xffffffff811bcbf5 <mnt_want_write+5>:  sub    $0x18,%rsp
0xffffffff811bcbf9 <mnt_want_write+9>:  nopl   0x0(%rax,%rax,1)
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcbfe <mnt_want_write+14>: mov    $0x1,%edx  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3rd argument i.e. wait
0xffffffff811bcc03 <mnt_want_write+19>: mov    $0x1,%esi  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd argument i.e. SB_FREEZE_WRITE
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 296
0xffffffff811bcc08 <mnt_want_write+24>: mov    %rdi,%rbx 
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/include/linux/fs.h: 1571
0xffffffff811bcc0b <mnt_want_write+27>: mov    0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)
0xffffffff811bcc0f <mnt_want_write+31>: callq  0xffffffff8119bd20 <__sb_start_write>
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namespace.c: 300
0xffffffff811bcc14 <mnt_want_write+36>: mov    %rbx,%rdi
The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are
pretty straightforward, both are set to 1. 
But to determine super_block which is %rdi we have to back-peddle a little bit,
28th offset of %rdi was copied to %rdi:
mov    0x28(%rdi),%rdi
we need to determine what's inside %rdi and one line above that tells us that
%rdi was copied to %rbx. 
mov    %rdi,%rbx
If we can determine value inside %rbx, it will give us %rdi and then 28th
offset of %rdi can be calculated.
To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved
somewhere:
crash> dis -lr ffffffff8119bdfc | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c: 669
0xffffffff8119bd20 <__sb_start_write>:	push   %rbp
0xffffffff8119bd21 <__sb_start_write+1>:	mov    %rsp,%rbp
0xffffffff8119bd24 <__sb_start_write+4>:	add    $0xffffffffffffff80,%rsp
0xffffffff8119bd28 <__sb_start_write+8>:	mov    %rbx,-0x28(%rbp)   <<<<<<< rbx
0xffffffff8119bd2c <__sb_start_write+12>:	mov    %r12,-0x20(%rbp)
0xffffffff8119bd30 <__sb_start_write+16>:	mov    %r13,-0x18(%rbp)
0xffffffff8119bd34 <__sb_start_write+20>:	mov    %r14,-0x10(%rbp)
0xffffffff8119bd38 <__sb_start_write+24>:	mov    %r15,-0x8(%rbp)
0xffffffff8119bd3c <__sb_start_write+28>:	nopl   0x0(%rax,%rax,1)
%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the
stack data of __sb_start_write or schedule, I am going to use stack data of
schedule for now to make it simpler. Disassembly of schedule (next function
after __sb_start_write) suggests that %rbp is pushed to the stack:
crash> dis -lr ffffffff815489d0 | head
/usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/kernel/sched.c: 6133
0xffffffff81548630 <schedule>:	push   %rbp  <<<<<<<<<<<<<<<<<<< 1st push
0xffffffff81548631 <schedule+1>:	mov    %rsp,%rbp
0xffffffff81548634 <schedule+4>:	push   %r15
0xffffffff81548636 <schedule+6>:	push   %r14
0xffffffff81548638 <schedule+8>:	push   %r13
0xffffffff8154863a <schedule+10>:	push   %r12
0xffffffff8154863c <schedule+12>:	push   %rbx
0xffffffff8154863d <schedule+13>:	sub    $0x98,%rsp
0xffffffff81548644 <schedule+20>:	nopl   0x0(%rax,%rax,1)
first push at stack will give us %rbp, and bt -f can be used to print stack:
crash> bt -f | grep -A15 schedule
 #0 [ffff88007c473c18] schedule at ffffffff815489d0
    ffff88007c473c20: 0000000000000082 0000000000000000 
    ffff88007c473c30: 0000000000000000 0000000000000000 
    ffff88007c473c40: 0000000000000000 0000000000000000 
    ffff88007c473c50: 0000000000000000 0000000000000000 
    ffff88007c473c60: 0000000000000000 ffff880037b71068 
    ffff88007c473c70: ffff88007c473fd8 0000000000010068 
    ffff88007c473c80: ffff880037b71068 ffff88007e527520 
    ffff88007c473c90: ffff880037b70ab0 0000000000000002 
    ffff88007c473ca0: 0000000000000246 ffff88007c473ce8 
    ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0 
    ffff88007c473cc0: ffff8800379e7400 0000000000000001 
    ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8 
    ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp
    ffff88007c473cf0: ffffffff8119bdfc 
Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:
crash> eval (0xffff88007c473d78 - 0x28)
hexadecimal: ffff88007c473d50  
    decimal: 18446612134399262032  (-131939310289584)
      octal: 1777774200017421636520
     binary: 1111111111111111100010000000000001111100010001110011110101010000
crash> rd ffff88007c473d50
ffff88007c473d50:  ffff88007ef64380                    .C.~....
%rbx == ffff88007ef64380
Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we
calculate 28th offset to %rdi we will get super_block:
crash> eval (0xffff88007ef64380 + 0x28)
hexadecimal: ffff88007ef643a8  
    decimal: 18446612134444286888  (-131939265264728)
      octal: 1777774200017675441650
     binary: 1111111111111111100010000000000001111110111101100100001110101000
crash> rd ffff88007ef643a8
ffff88007ef643a8:  ffff8800379e7400                    .t.7....
super_block == ffff8800379e7400.
super_block can be passed to the "mount" command to get the mount point:
crash> mount ffff8800379e7400
    VFSMOUNT         SUPERBLK     TYPE   DEVNAME   DIRNAME
ffff88007ef64380 ffff8800379e7400 ext4   /dev/vda1 /boot   
This tells us that that dd was trying to write to /boot. Next part of our job
is to find out why couldn't it write to /boot. In order to answer that, let's
take a step back and revisit the kernel code at line 677 (where dd was):
677         wait_event(sb->s_writers.wait_unfrozen, 
678                sb->s_writers.frozen < level);
Following wait_event() definition tells us that dd was put to sleep
(uninterruptible) and will be woken up when "sb->s_writers.frozen < level"
condition becomes true:
/**
 * wait_event - sleep until a condition gets true
 * @wq: the waitqueue to wait on
 * @condition: a C expression for the event to wait for
 *
 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
 * @condition evaluates to true. The @condition is checked each time
 * the waitqueue @wq is woken up.
 *
 * wake_up() has to be called after changing any variable that could
 * change the result of the wait condition.
 */
#define wait_event(wq, condition)                   \
do {                                    \
    if (condition)                          \
        break;                          \
    __wait_event(wq, condition);                    \
} while (0)
We know that "level" is 1 (second argument SB_FREEZE_WRITE), and
sb->s_writers.frozen needs to be less than 1 for this condition to become true.
Let's see what exactly is the current value of sb->s_writers.frozen, since we
already have sb, it is pretty straightforward to obtain this value:
crash> pd ((struct super_block *)0xffff8800379e7400)->s_writers.frozen
$5 = 4
Or do this (I prefer the 1st one):
crash> super_block.s_writers ffff8800379e7400 | grep frozen
    frozen = 4, 
sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is
uninterruptible, but what does this value 4 mean? This is defined in
include/linux/fs.h ("cscope" can be used to examine kernel source):
/* Possible states of 'frozen' field */
enum {
    SB_UNFROZEN = 0,        /* FS is unfrozen */
    SB_FREEZE_WRITE = 1,        /* Writes, dir ops, ioctls frozen */
    SB_FREEZE_TRANS = 2,
    SB_FREEZE_PAGEFAULT = 2,    /* Page faults stopped as well */
    SB_FREEZE_FS = 3,       /* For internal FS use (e.g. to stop
                     * internal threads if needed) */
    SB_FREEZE_COMPLETE = 4,     /* ->freeze_fs finished successfully */  <<<<<<<
};
4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So,
now we can say that dd became uninterruptible because /boot filesystem was
frozen, this triggered hung_task_timeout_secs which was caught by
hung_task_panic and this caused the crash. 
Next step should be to check what exactly froze the filesystem, in this
particular case I did it using fsfreeze to collect this vmcore so I can
demonstrate a few things about vmcore analysis. 
In order to collect this demo vmcore I used a sysctl parameter named
hung_task_panic. This parameter when enabled will trigger a crash if a process
becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs
sysctl parameter controls this duration and is set to 120 seconds by default in
RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to
10 seconds:
# echo 1 > /proc/sys/kernel/hung_task_panic
# echo 10 > /proc/sys/kernel/hung_task_timeout_secs
I simply executed following dd command on /boot:
# dd if=/dev/zero of=/boot/test
And in the second terminal this:
# fsfreeze -f /boot
This made dd uninterruptible, hung_task_panic kicked in and kdump did its job
by saving a vmcore under /var/crash.
We can print these values from vmcore:
crash> p sysctl_hung_task_timeout_secs
sysctl_hung_task_timeout_secs = $1 = 10
crash> p sysctl_hung_task_panic
sysctl_hung_task_panic = $2 = 1
Please send an email to the kernel meetup list for any queries and if this
write-up needs any corrections. 
    
    
More information about the Kernel-meetup
mailing list