[Kernel-meetup] Kernel-meetup Digest, Vol 4, Issue 8

Wed Apr 19 07:30:24 PDT 2017

Hi,

Shall we create Kernel-meetup official blog, to update by some maintainer
after each meetup.

Thanks,
Amit G
On Apr 19, 2017 7:53 PM, <kernel-meetup-request at lists.reserved-bit.com>
wrote:

> Send Kernel-meetup mailing list submissions to
>         kernel-meetup at lists.reserved-bit.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-
> reserved-bit.com
>
> or, via email, send a message with subject or body 'help' to
>         kernel-meetup-request at lists.reserved-bit.com
>
> You can reach the person managing the list at
>         kernel-meetup-owner at lists.reserved-bit.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Kernel-meetup digest..."
>
>
> Today's Topics:
>
>    1. Blog posts and slides for past meetups (Siddhesh Poyarekar)
>    2. Re: Blog posts and slides for past meetups (Nitin Bansal)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 19 Apr 2017 11:29:59 +0530
> From: Siddhesh Poyarekar <sid at reserved-bit.com>
> To: Kernel Meetup Pune <kernel-meetup at lists.reserved-bit.com>
> Subject: [Kernel-meetup] Blog posts and slides for past meetups
> Message-ID: <264d86b2-6a56-4d41-9477-f68299adfc2b at reserved-bit.com>
> Content-Type: text/plain; charset=utf-8
>
> Hi,
>
> If you have written blog posts or uploaded slides of your sessions at
> the kernel meetup in the past months, please don't forget to add a link
> to it on the relevant wiki page.
>
> https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017
> https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017
>
> Siddhesh
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 19 Apr 2017 19:53:49 +0530
> From: Nitin Bansal <nb at randomsearch.org>
> To: Kernel Meetup Pune <kernel-meetup at lists.reserved-bit.com>
> Subject: Re: [Kernel-meetup] Blog posts and slides for past meetups
> Message-ID:
>         <CAPx4u31Z3vFFCqr1We2zopihTNOxWnPCuX2c9+9SVpSZ8taQ3g at mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Siddhesh,
>
> On Wed, Apr 19, 2017 at 11:29 AM, Siddhesh Poyarekar
> <sid at reserved-bit.com> wrote:
> > Hi,
> >
> > If you have written blog posts or uploaded slides of your sessions at
> > the kernel meetup in the past months, please don't forget to add a link
> > to it on the relevant wiki page.
> >
> > https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017
> > https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017
> >
>
> I am attaching a file which contains write-up about vmcore analysis
> session that I presented in last meetup,
> I don't maintain any blog etc so don't really have any place to
> publish it, but feel free to post/attach it to
> reserved-bit wiki if you like.
> -------------- next part --------------
> I attended the Pune Kernel Meetup on 8th April and presented a session on
> Kernel crash dump (vmcore) analysis. In this write-up I have tried to
> explain a
> few things about the vmcore analysis, hope it helps.
>
> Before starting with the vmcore analysis, we need following three things:
>
> o crash utility [https://people.redhat.com/anderson/crash_whitepaper/]
> o vmlinux
> o vmcore
>
> vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux
> (RHEL) at least). One can either install kernel-debuginfo or simply extract
> vmlinux file using rpm2cpio:
>
> # yum install kernel-debuginfo-2.6.32-642.13.1.el6.x86_64
>
> Or
>
> # rpm2cpio kernel-debuginfo-2.6.32-642.13.1.el6.x86_64.rpm | cpio -idv
> ./usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
>
> vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from
> debuginfo
> package of same version is required.
>
> Now let's use "crash" utility to open this vmcore, which after printing a
> few
> details i.e. copyright information, GDB version information etc drops us to
> "crash>" prompt:
>
> # crash /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/vmlinux
> vmcore
>
> crash 7.1.0-6.el6
> ..
> GNU gdb (GDB) 7.6
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.
> html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
>
>       KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/
> vmlinux
>     DUMPFILE: vmcore  [PARTIAL DUMP]
>         CPUS: 2
>         DATE: Fri Mar 31 11:15:47 2017
>       UPTIME: 00:06:01
> LOAD AVERAGE: 0.40, 0.30, 0.13
>        TASKS: 232
>     NODENAME: rhel6
>      RELEASE: 2.6.32-642.13.1.el6.x86_64
>      VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
>      MACHINE: x86_64  (2808 Mhz)
>       MEMORY: 2 GB
>        PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
>          PID: 46
>      COMMAND: "khungtaskd"
>         TASK: ffff880078d16ab0  [THREAD_INFO: ffff880078d20000]
>          CPU: 1
>        STATE: TASK_RUNNING (PANIC)
>
> crash>
>
>
> "sys" command can be used to print most of the information that crash
> prints
> when invoked e.g.:
>
> crash> sys
>       KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.13.1.el6.x86_64/
> vmlinux
>     DUMPFILE: vmcore  [PARTIAL DUMP]
>         CPUS: 2
>         DATE: Fri Mar 31 11:15:47 2017
>       UPTIME: 00:06:01
> LOAD AVERAGE: 0.40, 0.30, 0.13
>        TASKS: 232
>     NODENAME: rhel6
>      RELEASE: 2.6.32-642.13.1.el6.x86_64
>      VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016
>      MACHINE: x86_64  (2808 Mhz)
>       MEMORY: 2 GB
>        PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
>
>
> To list all the available crash commands, type "?" and hit enter:
>
> crash> ?
>
> *              files          mach           repeat         timer
> alias          foreach        mod            runq           tree
> ascii          fuser          mount          search         union
> bt             gdb            net            set            vm
> btop           help           p              sig            vtop
> dev            ipcs           ps             struct         waitq
> dis            irq            pte            swap           whatis
> eval           kmem           ptob           sym            wr
> exit           list           ptov           sys            q
> extend         log            rd             task
>
> Use "help" to get details about a command and parameters it accepts e.g.:
>
> crash>help sys
>
> Now, let's examine kernel ring buffer (that's where we find panic message)
> with
> the help of "log" command.
>
> And following is the panic message, which tells us that khungtaskd crashed
> the
> system:
>
> Call Trace:
>  <#DB[1]>  <<EOE>> Pid: 46, comm: khungtaskd Not tainted
> 2.6.32-642.13.1.el6.x86_64 #1
> Call Trace:
>  <NMI>  [<ffffffff81009c47>] ? show_regs+0x27/0x30
>  [<ffffffff8154d909>] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0
>  [<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80
>  [<ffffffff8154f20a>] ? atomic_notifier_call_chain+0x1a/0x20
>  [<ffffffff810acc8e>] ? notify_die+0x2e/0x30
>  [<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350
>  [<ffffffff8154c6e3>] ? nmi+0x83/0x90
>  [<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10
>  <<EOE>>  [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0
>  [<ffffffff81039b9a>] ? arch_trigger_all_cpu_backtrace+0x27a/0x2c0
>  [<ffffffff810f2740>] ? watchdog+0x250/0x2a0
>  [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
>  [<ffffffff810a640e>] ? kthread+0x9e/0xc0
>  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
>  [<ffffffff810a6370>] ? kthread+0x0/0xc0
>  [<ffffffff8100c280>] ? child_rip+0x0/0x20
> Kernel panic - not syncing: hung_task: blocked tasks
> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1
> Call Trace:
>  [<ffffffff81548301>] ? panic+0xa7/0x179
>  [<ffffffff81039aa5>] ? arch_trigger_all_cpu_backtrace+0x185/0x2c0
>  [<ffffffff810f274e>] ? watchdog+0x25e/0x2a0
>  [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0
>  [<ffffffff810a640e>] ? kthread+0x9e/0xc0
>  [<ffffffff8100c28a>] ? child_rip+0xa/0x20
>  [<ffffffff810a6370>] ? kthread+0x0/0xc0
>  [<ffffffff8100c280>] ? child_rip+0x0/0x20
>
> khungtaskd is a kernel thread which checks for uninterruptible tasks, few
> relevant snips from kernel/hung_task.c:
>
> static int __init hung_task_init(void)
> {
>         atomic_notifier_chain_register(&panic_notifier_list,
> &panic_block);
>         watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");
>
>         return 0;
> }
>
> ..
>
> static int watchdog(void *dummy)
> {
>         set_user_nice(current, 0);
>
>         for ( ; ; ) {
>                 unsigned long timeout = sysctl_hung_task_timeout_secs;
>
>                 while (schedule_timeout_interruptible(timeout_jiffies(
> timeout)))
>                         timeout = sysctl_hung_task_timeout_secs;
>
>                 check_hung_uninterruptible_tasks(timeout);
>         }
>
>         return 0;
> }
>
> By scrolling back a bit in ring buffer a hung_task_timeout_secs warning
> can be
> seen along with the following backtrace:
>
> 1 > INFO: task dd:7224 blocked for more than 10 seconds.
> 2 >      Not tainted 2.6.32-642.13.1.el6.x86_64 #1
> 3 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> 4> dd            D 0000000000000001     0  7224   6702 0x00000080
>  ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000
>  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>  0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8
> 5> Call Trace:
>  [<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80
>  [<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120
>  [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff811bcc14>] mnt_want_write+0x24/0x50
>  [<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190
>  [<ffffffff811aca99>] do_filp_open+0x2b9/0xd20
>  [<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90
>  [<ffffffff811ba072>] ? alloc_fd+0x92/0x160
>  [<ffffffff811969f7>] do_sys_open+0x67/0x130
>  [<ffffffff81196b00>] sys_open+0x20/0x30
>  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
>
> Let's look at it line by line:
>
> Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.
>
> Line 2 tells us the kernel version and that it is not tainted (i.e. no
> third
> party modules are loaded etc.).
>
> Line 3 is just telling us how to disable this message, not a great idea
> IMO.
>
> Line 4 contains command name, state, PID etc.
>
> Line 5 shows the associated call trace which suggests that dd became
> uninterruptible while trying to write to a filesystem.
>
> Process details can be checked using "ps":
>
> crash> ps -m | grep UN
> [0 00:00:14.117] [UN]  PID: 7224   TASK: ffff880037b70ab0  CPU: 1
>  COMMAND: "dd"
> ^^^^^^^^^^^^^^^^
>      |_____This is the duration of dd in uninterruptible state (14.117
> seconds)
>
> -m switch displays timestamp in days, hours, minutes, seconds and
> milliseconds.
>
> Next step is to examine dd process, in order to do that, we first need to
> change the process context (currently it is set to khungtaskd and needs to
> be
> switched to dd), which can be done using "set":
>
> crash> set 7224
>     PID: 7224
> COMMAND: "dd"
>    TASK: ffff880037b70ab0  [THREAD_INFO: ffff88007c470000]
>     CPU: 1
>   STATE: TASK_UNINTERRUPTIBLE
>
> Let's print backtrace along with kernel source line numbers (source line
> numbers probably wouldn't be displayed if kernel-debuginfo is not
> installed):
>
> crash> bt -l
> PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"
>  #0 [ffff88007c473c18] schedule at ffffffff815489d0
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/kernel/sched.c: 3119
>  #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c:
> 677
>  #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/fs/namespace.c: 300
>  #3 [ffff88007c473db0] do_filp_open at ffffffff811aca99
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/namei.c:
> 2267
>  #4 [ffff88007c473f20] do_sys_open at ffffffff811969f7
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c:
> 908
>  #5 [ffff88007c473f70] sys_open at ffffffff81196b00
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/open.c:
> 932
>  #6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2
>     /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/arch/x86/kernel/entry_64.S: 492
>     RIP: 000000358a8db480  RSP: 00007ffcd7d37ca8  RFLAGS: 00010246
>     RAX: 0000000000000002  RBX: ffffffff8100b0d2  RCX: 0000000000000000
>     RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 00007ffcd7d3870b
>     RBP: 0000000000000240   R8: 000000358ab8ee40   R9: 00007ffaa33c368c
>     R10: 1999999999999999  R11: 0000000000000246  R12: ffffffff81196b00
>     R13: ffff88007c473f78  R14: ffffffffffffffff  R15: 0000000000000064
>     ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b
>
> This backtrace needs to be read from bottom to top, following chain shows
> how
> these functions were called:
>
>  921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int,
> mode)
>  922 {
>  923     long ret;
>  924
>  925     if (force_o_largefile())
>  926         flags |= O_LARGEFILE;
>  927
>  928     ret = do_sys_open(AT_FDCWD, filename, flags,
> mode);--------------------
>
>       |
>                 ------------------------------
> ----------------------------------
>                 |
>  899 long do_sys_open(int dfd, const char __user *filename, int flags, int
> mode)
>  900 {
>  901     struct filename *tmp = getname(filename);
>  902     int fd = PTR_ERR(tmp);
>  903
>  904     if (!IS_ERR(tmp)) {
>  905         fd = get_unused_fd_flags(flags);
>  906         if (fd >= 0) {
>  907             struct file *f = do_filp_open(dfd, tmp, flags, mode, 0);
> ------
>
>       |
>                        ------------------------------
> ---------------------------
>                       |
> 2166 struct file *do_filp_open(int dfd, struct filename *filename,
> 2167         int open_flag, int mode, int acc_mode)
> 2168 {
> 2169     struct file *filp;
> ..
> ..
> 2256     /*
> 2257      * This write is needed to ensure that a
> 2258      * ro->rw transition does not occur between
> 2259      * the time when the file is created and when
> 2260      * a permanent write count is taken through
> 2261      * the 'struct file' in nameidata_to_filp().
> 2262      */
> 2263     error = mnt_want_write(nd.path.mnt);
> ----------------------------------
>
>       |
>                ------------------------------
> -----------------------------------
>               |
>  295 int mnt_want_write(struct vfsmount *m)
>  296 {
>  297     int ret;
>  298
>  299     sb_start_write(m->mnt_sb);  ------------------------------
> -------------
>
>       |
>                                ------------------------------
> -------------------
>                               |
> 1569 static inline void sb_start_write(struct super_block *sb)
> 1570 {
> 1571     __sb_start_write(sb, SB_FREEZE_WRITE, true);
> --------------------------
>
>       |
>                 ------------------------------
> ----------------------------------
>                |
> 668 int __sb_start_write(struct super_block *sb, int level, bool wait)
> 669 {
> 670     /* Out of tree modules don't use this mechanism */
> 671     if (unlikely(!sb_has_new_freeze(sb)))
> 672         return 1;
> 673 retry:
> 674     if (unlikely(sb->s_writers.frozen >= level)) {
> 675         if (!wait)
> 676             return 0;
> 677         wait_event(sb->s_writers.wait_unfrozen,  <<<<<<<<<<<<<<< dd
> was here
> 678                sb->s_writers.frozen < level);
> 679     }
>
>
> In short, mnt_want_write() called sb_start_write() which then called
> __sb_start_write(), and dd was waiting at line 677 in __sb_start_write()
> when
> this vmcore was collected.
>
> From this code it is clear that dd was trying to write to a filesystem but
> somehow it couldn't and became uninterruptible, now we need to find out
> which
> filesystem it was and why dd couldn't write to it. Can we do it? Yes,
> absolutely, if we can determine the first argument i.e. super_block that is
> passed to __sb_start_write(), it will tell us about the
> filesystem/mount-point.
>
> If we go back to backtrace, we can see that mnt_want_write() called
> __sb_start_write() and passed it three parameters, struct super_block (sb),
> level (SB_FREEZE_WRITE) and a Boolean called wait (true):
>
>  #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc
>  #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14
>
> Note: "whatis command can be used to check the function prototype (simply
> entering the function name or px also does the job):
>
> crash> whatis __sb_start_write
> int __sb_start_write(struct super_block *, int, bool);
>
> Before we go any further, it is important to know how arguments are passed
> to
> the called function. First argument goes to %rdi, second to %rsi, third to
> %rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed
> with this knowledge now we can turn our attention to disassembly. Let's
> disassemble mnt_start_write to find out arguments passed to
> __sb_start_write:
>
> crash> dis -lr ffffffff811bcc14
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/fs/namespace.c: 296
> 0xffffffff811bcbf0 <mnt_want_write>:    push   %rbp
> 0xffffffff811bcbf1 <mnt_want_write+1>:  mov    %rsp,%rbp
> 0xffffffff811bcbf4 <mnt_want_write+4>:  push   %rbx
> 0xffffffff811bcbf5 <mnt_want_write+5>:  sub    $0x18,%rsp
> 0xffffffff811bcbf9 <mnt_want_write+9>:  nopl   0x0(%rax,%rax,1)
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/include/linux/fs.h: 1571
> 0xffffffff811bcbfe <mnt_want_write+14>: mov    $0x1,%edx
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 3rd argument i.e. wait
> 0xffffffff811bcc03 <mnt_want_write+19>: mov    $0x1,%esi
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2nd argument i.e. SB_FREEZE_WRITE
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/fs/namespace.c: 296
> 0xffffffff811bcc08 <mnt_want_write+24>: mov    %rdi,%rbx
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/include/linux/fs.h: 1571
> 0xffffffff811bcc0b <mnt_want_write+27>: mov    0x28(%rdi),%rdi
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)
> 0xffffffff811bcc0f <mnt_want_write+31>: callq  0xffffffff8119bd20
> <__sb_start_write>
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/fs/namespace.c: 300
> 0xffffffff811bcc14 <mnt_want_write+36>: mov    %rbx,%rdi
>
> The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are
> pretty straightforward, both are set to 1.
>
> But to determine super_block which is %rdi we have to back-peddle a little
> bit,
> 28th offset of %rdi was copied to %rdi:
>
> mov    0x28(%rdi),%rdi
>
> we need to determine what's inside %rdi and one line above that tells us
> that
> %rdi was copied to %rbx.
>
> mov    %rdi,%rbx
>
> If we can determine value inside %rbx, it will give us %rdi and then 28th
> offset of %rdi can be calculated.
>
> To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved
> somewhere:
>
> crash> dis -lr ffffffff8119bdfc | head
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.13.1.el6.x86_64/fs/super.c:
> 669
> 0xffffffff8119bd20 <__sb_start_write>:  push   %rbp
> 0xffffffff8119bd21 <__sb_start_write+1>:        mov    %rsp,%rbp
> 0xffffffff8119bd24 <__sb_start_write+4>:        add
> $0xffffffffffffff80,%rsp
> 0xffffffff8119bd28 <__sb_start_write+8>:        mov    %rbx,-0x28(%rbp)
>  <<<<<<< rbx
> 0xffffffff8119bd2c <__sb_start_write+12>:       mov    %r12,-0x20(%rbp)
> 0xffffffff8119bd30 <__sb_start_write+16>:       mov    %r13,-0x18(%rbp)
> 0xffffffff8119bd34 <__sb_start_write+20>:       mov    %r14,-0x10(%rbp)
> 0xffffffff8119bd38 <__sb_start_write+24>:       mov    %r15,-0x8(%rbp)
> 0xffffffff8119bd3c <__sb_start_write+28>:       nopl   0x0(%rax,%rax,1)
>
> %rbx is copied to -28th offset of %rbp, and %rbp can be found either in the
> stack data of __sb_start_write or schedule, I am going to use stack data of
> schedule for now to make it simpler. Disassembly of schedule (next function
> after __sb_start_write) suggests that %rbp is pushed to the stack:
>
> crash> dis -lr ffffffff815489d0 | head
> /usr/src/debug/kernel-2.6.32-642.13.1.el6/linux-2.6.32-642.
> 13.1.el6.x86_64/kernel/sched.c: 6133
> 0xffffffff81548630 <schedule>:  push   %rbp  <<<<<<<<<<<<<<<<<<< 1st push
> 0xffffffff81548631 <schedule+1>:        mov    %rsp,%rbp
> 0xffffffff81548634 <schedule+4>:        push   %r15
> 0xffffffff81548636 <schedule+6>:        push   %r14
> 0xffffffff81548638 <schedule+8>:        push   %r13
> 0xffffffff8154863a <schedule+10>:       push   %r12
> 0xffffffff8154863c <schedule+12>:       push   %rbx
> 0xffffffff8154863d <schedule+13>:       sub    $0x98,%rsp
> 0xffffffff81548644 <schedule+20>:       nopl   0x0(%rax,%rax,1)
>
> first push at stack will give us %rbp, and bt -f can be used to print
> stack:
>
> crash> bt -f | grep -A15 schedule
>  #0 [ffff88007c473c18] schedule at ffffffff815489d0
>     ffff88007c473c20: 0000000000000082 0000000000000000
>     ffff88007c473c30: 0000000000000000 0000000000000000
>     ffff88007c473c40: 0000000000000000 0000000000000000
>     ffff88007c473c50: 0000000000000000 0000000000000000
>     ffff88007c473c60: 0000000000000000 ffff880037b71068
>     ffff88007c473c70: ffff88007c473fd8 0000000000010068
>     ffff88007c473c80: ffff880037b71068 ffff88007e527520
>     ffff88007c473c90: ffff880037b70ab0 0000000000000002
>     ffff88007c473ca0: 0000000000000246 ffff88007c473ce8
>     ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0
>     ffff88007c473cc0: ffff8800379e7400 0000000000000001
>     ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8
>     ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<<
> ffff88007c473d78 is first push == %rbp
>     ffff88007c473cf0: ffffffff8119bdfc
>
> Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:
>
> crash> eval (0xffff88007c473d78 - 0x28)
> hexadecimal: ffff88007c473d50
>     decimal: 18446612134399262032  (-131939310289584)
>       octal: 1777774200017421636520
>      binary: 111111111111111110001000000000000111110001000111001111010101
> 0000
> crash> rd ffff88007c473d50
> ffff88007c473d50:  ffff88007ef64380                    .C.~....
>
> %rbx == ffff88007ef64380
>
> Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we
> calculate 28th offset to %rdi we will get super_block:
>
> crash> eval (0xffff88007ef64380 + 0x28)
> hexadecimal: ffff88007ef643a8
>     decimal: 18446612134444286888  (-131939265264728)
>       octal: 1777774200017675441650
>      binary: 111111111111111110001000000000000111111011110110010000111010
> 1000
> crash> rd ffff88007ef643a8
> ffff88007ef643a8:  ffff8800379e7400                    .t.7....
>
> super_block == ffff8800379e7400.
>
> super_block can be passed to the "mount" command to get the mount point:
>
> crash> mount ffff8800379e7400
>     VFSMOUNT         SUPERBLK     TYPE   DEVNAME   DIRNAME
> ffff88007ef64380 ffff8800379e7400 ext4   /dev/vda1 /boot
>
> This tells us that that dd was trying to write to /boot. Next part of our
> job
> is to find out why couldn't it write to /boot. In order to answer that,
> let's
> take a step back and revisit the kernel code at line 677 (where dd was):
>
> 677         wait_event(sb->s_writers.wait_unfrozen,
> 678                sb->s_writers.frozen < level);
>
> Following wait_event() definition tells us that dd was put to sleep
> (uninterruptible) and will be woken up when "sb->s_writers.frozen < level"
> condition becomes true:
>
> /**
>  * wait_event - sleep until a condition gets true
>  * @wq: the waitqueue to wait on
>  * @condition: a C expression for the event to wait for
>  *
>  * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the
>  * @condition evaluates to true. The @condition is checked each time
>  * the waitqueue @wq is woken up.
>  *
>  * wake_up() has to be called after changing any variable that could
>  * change the result of the wait condition.
>  */
> #define wait_event(wq, condition)                   \
> do {                                    \
>     if (condition)                          \
>         break;                          \
>     __wait_event(wq, condition);                    \
> } while (0)
>
> We know that "level" is 1 (second argument SB_FREEZE_WRITE), and
> sb->s_writers.frozen needs to be less than 1 for this condition to become
> true.
> Let's see what exactly is the current value of sb->s_writers.frozen, since
> we
> already have sb, it is pretty straightforward to obtain this value:
>
> crash> pd ((struct super_block *)0xffff8800379e7400)->s_writers.frozen
> $5 = 4
>
> Or do this (I prefer the 1st one):
>
> crash> super_block.s_writers ffff8800379e7400 | grep frozen
>     frozen = 4,
>
> sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why
> dd is
> uninterruptible, but what does this value 4 mean? This is defined in
> include/linux/fs.h ("cscope" can be used to examine kernel source):
>
> /* Possible states of 'frozen' field */
> enum {
>     SB_UNFROZEN = 0,        /* FS is unfrozen */
>     SB_FREEZE_WRITE = 1,        /* Writes, dir ops, ioctls frozen */
>     SB_FREEZE_TRANS = 2,
>     SB_FREEZE_PAGEFAULT = 2,    /* Page faults stopped as well */
>     SB_FREEZE_FS = 3,       /* For internal FS use (e.g. to stop
>                      * internal threads if needed) */
>     SB_FREEZE_COMPLETE = 4,     /* ->freeze_fs finished successfully */
> <<<<<<<
> };
>
> 4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen.
> So,
> now we can say that dd became uninterruptible because /boot filesystem was
> frozen, this triggered hung_task_timeout_secs which was caught by
> hung_task_panic and this caused the crash.
>
> Next step should be to check what exactly froze the filesystem, in this
> particular case I did it using fsfreeze to collect this vmcore so I can
> demonstrate a few things about vmcore analysis.
>
> In order to collect this demo vmcore I used a sysctl parameter named
> hung_task_panic. This parameter when enabled will trigger a crash if a
> process
> becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs
> sysctl parameter controls this duration and is set to 120 seconds by
> default in
> RHEL 6/7. After enabling hung_task_panic and reducing
> hung_task_timeout_secs to
> 10 seconds:
>
> # echo 1 > /proc/sys/kernel/hung_task_panic
> # echo 10 > /proc/sys/kernel/hung_task_timeout_secs
>
> I simply executed following dd command on /boot:
>
> # dd if=/dev/zero of=/boot/test
>
> And in the second terminal this:
>
> # fsfreeze -f /boot
>
> This made dd uninterruptible, hung_task_panic kicked in and kdump did its
> job
> by saving a vmcore under /var/crash.
>
> We can print these values from vmcore:
>
> crash> p sysctl_hung_task_timeout_secs
> sysctl_hung_task_timeout_secs = $1 = 10
>
> crash> p sysctl_hung_task_panic
> sysctl_hung_task_panic = $2 = 1
>
> Please send an email to the kernel meetup list for any queries and if this
> write-up needs any corrections.
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Kernel-meetup mailing list
> Kernel-meetup at lists.reserved-bit.com
> http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-reserved-bit.com
>
>
> ------------------------------
>
> End of Kernel-meetup Digest, Vol 4, Issue 8
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reserved-bit.com/pipermail/kernel-meetup-reserved-bit.com/attachments/20170419/116c2848/attachment-0001.htm>