<p dir="ltr">Hi,</p>

<p dir="ltr">Shall we create Kernel-meetup official blog, to update by some maintainer after each meetup. <br></p>

<p dir="ltr">Thanks,<br>

Amit G</p>

<div class="gmail_quote">On Apr 19, 2017 7:53 PM,  <<a href="mailto:kernel-meetup-request@lists.reserved-bit.com">kernel-meetup-request@lists.reserved-bit.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Kernel-meetup mailing list submissions to<br>

        <a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-reserved-bit.com" rel="noreferrer" target="_blank">http://lists.reserved-bit.com/<wbr>listinfo.cgi/kernel-meetup-<wbr>reserved-bit.com</a><br>

<br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:kernel-meetup-request@lists.reserved-bit.com">kernel-meetup-request@lists.<wbr>reserved-bit.com</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:kernel-meetup-owner@lists.reserved-bit.com">kernel-meetup-owner@lists.<wbr>reserved-bit.com</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of Kernel-meetup digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. Blog posts and slides for past meetups (Siddhesh Poyarekar)<br>

   2. Re: Blog posts and slides for past meetups (Nitin Bansal)<br>

<br>

<br>

------------------------------<wbr>------------------------------<wbr>----------<br>

<br>

Message: 1<br>

Date: Wed, 19 Apr 2017 11:29:59 +0530<br>

From: Siddhesh Poyarekar <<a href="mailto:sid@reserved-bit.com">sid@reserved-bit.com</a>><br>

To: Kernel Meetup Pune <<a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a>><br>

Subject: [Kernel-meetup] Blog posts and slides for past meetups<br>

Message-ID: <<a href="mailto:264d86b2-6a56-4d41-9477-f68299adfc2b@reserved-bit.com">264d86b2-6a56-4d41-9477-<wbr>f68299adfc2b@reserved-bit.com</a>><br>

Content-Type: text/plain; charset=utf-8<br>

<br>

Hi,<br>

<br>

If you have written blog posts or uploaded slides of your sessions at<br>

the kernel meetup in the past months, please don't forget to add a link<br>

to it on the relevant wiki page.<br>

<br>

<a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_February_-<wbr>_2017</a><br>

<a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_April_-_<wbr>2017</a><br>

<br>

Siddhesh<br>

<br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Wed, 19 Apr 2017 19:53:49 +0530<br>

From: Nitin Bansal <<a href="mailto:nb@randomsearch.org">nb@randomsearch.org</a>><br>

To: Kernel Meetup Pune <<a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a>><br>

Subject: Re: [Kernel-meetup] Blog posts and slides for past meetups<br>

Message-ID:<br>

        <<a href="mailto:CAPx4u31Z3vFFCqr1We2zopihTNOxWnPCuX2c9%2B9SVpSZ8taQ3g@mail.gmail.com">CAPx4u31Z3vFFCqr1We2zopihTNOx<wbr>WnPCuX2c9+9SVpSZ8taQ3g@mail.<wbr>gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Hi Siddhesh,<br>

<br>

On Wed, Apr 19, 2017 at 11:29 AM, Siddhesh Poyarekar<br>

<<a href="mailto:sid@reserved-bit.com">sid@reserved-bit.com</a>> wrote:<br>

> Hi,<br>

><br>

> If you have written blog posts or uploaded slides of your sessions at<br>

> the kernel meetup in the past months, please don't forget to add a link<br>

> to it on the relevant wiki page.<br>

><br>

> <a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_February_-<wbr>_2017</a><br>

> <a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_April_-_<wbr>2017</a><br>

><br>

<br>

I am attaching a file which contains write-up about vmcore analysis<br>

session that I presented in last meetup,<br>

I don't maintain any blog etc so don't really have any place to<br>

publish it, but feel free to post/attach it to<br>

reserved-bit wiki if you like.<br>

-------------- next part --------------<br>

I attended the Pune Kernel Meetup on 8th April and presented a session on<br>

Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a<br>

few things about the vmcore analysis, hope it helps.<br>

<br>

Before starting with the vmcore analysis, we need following three things:<br>

<br>

o crash utility [<a href="https://people.redhat.com/anderson/crash_whitepaper/" rel="noreferrer" target="_blank">https://people.redhat.com/<wbr>anderson/crash_whitepaper/</a>]<br>

o vmlinux<br>

o vmcore<br>

<br>

vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux<br>

(RHEL) at least). One can either install kernel-debuginfo or simply extract<br>

vmlinux file using rpm2cpio:<br>

<br>

# yum install kernel-debuginfo-2.6.32-642.<wbr>13.1.el6.x86_64<br>

<br>

Or<br>

<br>

# rpm2cpio kernel-debuginfo-2.6.32-642.<wbr>13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>

<br>

vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo<br>

package of same version is required.<br>

<br>

Now let's use "crash" utility to open this vmcore, which after printing a few<br>

details i.e. copyright information, GDB version information etc drops us to<br>

"crash>" prompt:<br>

<br>

# crash /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux  vmcore<br>

<br>

crash 7.1.0-6.el6<br>

..<br>

GNU gdb (GDB) 7.6<br>

Copyright (C) 2013 Free Software Foundation, Inc.<br>

License GPLv3+: GNU GPL version 3 or later <<a href="http://gnu.org/licenses/gpl.html" rel="noreferrer" target="_blank">http://gnu.org/licenses/gpl.<wbr>html</a>><br>

This is free software: you are free to change and redistribute it.<br>

There is NO WARRANTY, to the extent permitted by law.  Type "show copying"<br>

and "show warranty" for details.<br>

This GDB was configured as "x86_64-unknown-linux-gnu"...<br>

<br>

      KERNEL: /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>

    DUMPFILE: vmcore  [PARTIAL DUMP]<br>

        CPUS: 2<br>

        DATE: Fri Mar 31 11:15:47 2017<br>

      UPTIME: 00:06:01<br>

LOAD AVERAGE: 0.40, 0.30, 0.13<br>

       TASKS: 232<br>

    NODENAME: rhel6<br>

     RELEASE: 2.6.32-642.13.1.el6.x86_64<br>

     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016<br>

     MACHINE: x86_64  (2808 Mhz)<br>

      MEMORY: 2 GB<br>

       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"<br>

         PID: 46<br>

     COMMAND: "khungtaskd"<br>

        TASK: ffff880078d16ab0  [THREAD_INFO: ffff880078d20000]<br>

         CPU: 1<br>

       STATE: TASK_RUNNING (PANIC)<br>

<br>

crash><br>

<br>

<br>

"sys" command can be used to print most of the information that crash prints<br>

when invoked e.g.:<br>

<br>

crash> sys<br>

      KERNEL: /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>

    DUMPFILE: vmcore  [PARTIAL DUMP]<br>

        CPUS: 2<br>

        DATE: Fri Mar 31 11:15:47 2017<br>

      UPTIME: 00:06:01<br>

LOAD AVERAGE: 0.40, 0.30, 0.13<br>

       TASKS: 232<br>

    NODENAME: rhel6<br>

     RELEASE: 2.6.32-642.13.1.el6.x86_64<br>

     VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016<br>

     MACHINE: x86_64  (2808 Mhz)<br>

      MEMORY: 2 GB<br>

       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"<br>

<br>

<br>

To list all the available crash commands, type "?" and hit enter:<br>

<br>

crash> ?<br>

<br>

*              files          mach           repeat         timer<br>

alias          foreach        mod            runq           tree<br>

ascii          fuser          mount          search         union<br>

bt             gdb            net            set            vm<br>

btop           help           p              sig            vtop<br>

dev            ipcs           ps             struct         waitq<br>

dis            irq            pte            swap           whatis<br>

eval           kmem           ptob           sym            wr<br>

exit           list           ptov           sys            q<br>

extend         log            rd             task<br>

<br>

Use "help" to get details about a command and parameters it accepts e.g.:<br>

<br>

crash>help sys<br>

<br>

Now, let's examine kernel ring buffer (that's where we find panic message) with<br>

the help of "log" command.<br>

<br>

And following is the panic message, which tells us that khungtaskd crashed the<br>

system:<br>

<br>

Call Trace:<br>

 <#DB[1]>  <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>

Call Trace:<br>

 <NMI>  [<ffffffff81009c47>] ? show_regs+0x27/0x30<br>

 [<ffffffff8154d909>] ? arch_trigger_all_cpu_<wbr>backtrace_handler+0x99/0xc0<br>

 [<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80<br>

 [<ffffffff8154f20a>] ? atomic_notifier_call_chain+<wbr>0x1a/0x20<br>

 [<ffffffff810acc8e>] ? notify_die+0x2e/0x30<br>

 [<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350<br>

 [<ffffffff8154c6e3>] ? nmi+0x83/0x90<br>

 [<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10<br>

 <<EOE>>  [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0<br>

 [<ffffffff81039b9a>] ? arch_trigger_all_cpu_<wbr>backtrace+0x27a/0x2c0<br>

 [<ffffffff810f2740>] ? watchdog+0x250/0x2a0<br>

 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0<br>

 [<ffffffff810a640e>] ? kthread+0x9e/0xc0<br>

 [<ffffffff8100c28a>] ? child_rip+0xa/0x20<br>

 [<ffffffff810a6370>] ? kthread+0x0/0xc0<br>

 [<ffffffff8100c280>] ? child_rip+0x0/0x20<br>

Kernel panic - not syncing: hung_task: blocked tasks<br>

Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>

Call Trace:<br>

 [<ffffffff81548301>] ? panic+0xa7/0x179<br>

 [<ffffffff81039aa5>] ? arch_trigger_all_cpu_<wbr>backtrace+0x185/0x2c0<br>

 [<ffffffff810f274e>] ? watchdog+0x25e/0x2a0<br>

 [<ffffffff810f24f0>] ? watchdog+0x0/0x2a0<br>

 [<ffffffff810a640e>] ? kthread+0x9e/0xc0<br>

 [<ffffffff8100c28a>] ? child_rip+0xa/0x20<br>

 [<ffffffff810a6370>] ? kthread+0x0/0xc0<br>

 [<ffffffff8100c280>] ? child_rip+0x0/0x20<br>

<br>

khungtaskd is a kernel thread which checks for uninterruptible tasks, few<br>

relevant snips from kernel/hung_task.c:<br>

<br>

static int __init hung_task_init(void)<br>

{<br>

        atomic_notifier_chain_<wbr>register(&panic_notifier_list, &panic_block);<br>

        watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");<br>

<br>

        return 0;<br>

}<br>

<br>

..<br>

<br>

static int watchdog(void *dummy)<br>

{<br>

        set_user_nice(current, 0);<br>

<br>

        for ( ; ; ) {<br>

                unsigned long timeout = sysctl_hung_task_timeout_secs;<br>

<br>

                while (schedule_timeout_<wbr>interruptible(timeout_jiffies(<wbr>timeout)))<br>

                        timeout = sysctl_hung_task_timeout_secs;<br>

<br>

                check_hung_uninterruptible_<wbr>tasks(timeout);<br>

        }<br>

<br>

        return 0;<br>

}<br>

<br>

By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be<br>

seen along with the following backtrace:<br>

<br>

1 > INFO: task dd:7224 blocked for more than 10 seconds.<br>

2 >      Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>

3 > "echo 0 > /proc/sys/kernel/hung_task_<wbr>timeout_secs" disables this message.<br>

4> dd            D 0000000000000001     0  7224   6702 0x00000080<br>

 ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000<br>

 0000000000000000 0000000000000000 0000000000000000 0000000000000000<br>

 0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8<br>

5> Call Trace:<br>

 [<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80<br>

 [<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120<br>

 [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/<wbr>0x40<br>

 [<ffffffff811bcc14>] mnt_want_write+0x24/0x50<br>

 [<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190<br>

 [<ffffffff811aca99>] do_filp_open+0x2b9/0xd20<br>

 [<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90<br>

 [<ffffffff811ba072>] ? alloc_fd+0x92/0x160<br>

 [<ffffffff811969f7>] do_sys_open+0x67/0x130<br>

 [<ffffffff81196b00>] sys_open+0x20/0x30<br>

 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b<br>

<br>

Let's look at it line by line:<br>

<br>

Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.<br>

<br>

Line 2 tells us the kernel version and that it is not tainted (i.e. no third<br>

party modules are loaded etc.).<br>

<br>

Line 3 is just telling us how to disable this message, not a great idea IMO.<br>

<br>

Line 4 contains command name, state, PID etc.<br>

<br>

Line 5 shows the associated call trace which suggests that dd became<br>

uninterruptible while trying to write to a filesystem.<br>

<br>

Process details can be checked using "ps":<br>

<br>

crash> ps -m | grep UN<br>

[0 00:00:14.117] [UN]  PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"<br>

^^^^^^^^^^^^^^^^<br>

     |_____This is the duration of dd in uninterruptible state (14.117 seconds)<br>

<br>

-m switch displays timestamp in days, hours, minutes, seconds and milliseconds.<br>

<br>

Next step is to examine dd process, in order to do that, we first need to<br>

change the process context (currently it is set to khungtaskd and needs to be<br>

switched to dd), which can be done using "set":<br>

<br>

crash> set 7224<br>

    PID: 7224<br>

COMMAND: "dd"<br>

   TASK: ffff880037b70ab0  [THREAD_INFO: ffff88007c470000]<br>

    CPU: 1<br>

  STATE: TASK_UNINTERRUPTIBLE<br>

<br>

Let's print backtrace along with kernel source line numbers (source line<br>

numbers probably wouldn't be displayed if kernel-debuginfo is not installed):<br>

<br>

crash> bt -l<br>

PID: 7224   TASK: ffff880037b70ab0  CPU: 1   COMMAND: "dd"<br>

 #0 [ffff88007c473c18] schedule at ffffffff815489d0<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/kernel/sched.<wbr>c: 3119<br>

 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/super.c: 677<br>

 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 300<br>

 #3 [ffff88007c473db0] do_filp_open at ffffffff811aca99<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namei.c: 2267<br>

 #4 [ffff88007c473f20] do_sys_open at ffffffff811969f7<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/open.c: 908<br>

 #5 [ffff88007c473f70] sys_open at ffffffff81196b00<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/open.c: 932<br>

 #6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2<br>

    /usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/arch/x86/<wbr>kernel/entry_64.S: 492<br>

    RIP: 000000358a8db480  RSP: 00007ffcd7d37ca8  RFLAGS: 00010246<br>

    RAX: 0000000000000002  RBX: ffffffff8100b0d2  RCX: 0000000000000000<br>

    RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 00007ffcd7d3870b<br>

    RBP: 0000000000000240   R8: 000000358ab8ee40   R9: 00007ffaa33c368c<br>

    R10: 1999999999999999  R11: 0000000000000246  R12: ffffffff81196b00<br>

    R13: ffff88007c473f78  R14: ffffffffffffffff  R15: 0000000000000064<br>

    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b<br>

<br>

This backtrace needs to be read from bottom to top, following chain shows how<br>

these functions were called:<br>

<br>

 921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)<br>

 922 {<br>

 923     long ret;<br>

 924<br>

 925     if (force_o_largefile())<br>

 926         flags |= O_LARGEFILE;<br>

 927<br>

 928     ret = do_sys_open(AT_FDCWD, filename, flags, mode);--------------------<br>

                                                                                |<br>

                ------------------------------<wbr>------------------------------<wbr>----<br>

                |<br>

 899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode)<br>

 900 {<br>

 901     struct filename *tmp = getname(filename);<br>

 902     int fd = PTR_ERR(tmp);<br>

 903<br>

 904     if (!IS_ERR(tmp)) {<br>

 905         fd = get_unused_fd_flags(flags);<br>

 906         if (fd >= 0) {<br>

 907             struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------<br>

                                                                                |<br>

                       ------------------------------<wbr>---------------------------<br>

                      |<br>

2166 struct file *do_filp_open(int dfd, struct filename *filename,<br>

2167         int open_flag, int mode, int acc_mode)<br>

2168 {<br>

2169     struct file *filp;<br>

..<br>

..<br>

2256     /*<br>

2257      * This write is needed to ensure that a<br>

2258      * ro->rw transition does not occur between<br>

2259      * the time when the file is created and when<br>

2260      * a permanent write count is taken through<br>

2261      * the 'struct file' in nameidata_to_filp().<br>

2262      */<br>

2263     error = mnt_want_write(nd.path.mnt); ------------------------------<wbr>----<br>

                                                                                |<br>

               ------------------------------<wbr>------------------------------<wbr>-----<br>

              |<br>

 295 int mnt_want_write(struct vfsmount *m)<br>

 296 {<br>

 297     int ret;<br>

 298<br>

 299     sb_start_write(m->mnt_sb);  ------------------------------<wbr>-------------<br>

                                                                                |<br>

                               ------------------------------<wbr>-------------------<br>

                              |<br>

1569 static inline void sb_start_write(struct super_block *sb)<br>

1570 {<br>

1571     __sb_start_write(sb, SB_FREEZE_WRITE, true); --------------------------<br>

                                                                                |<br>

                ------------------------------<wbr>------------------------------<wbr>----<br>

               |<br>

668 int __sb_start_write(struct super_block *sb, int level, bool wait)<br>

669 {<br>

670     /* Out of tree modules don't use this mechanism */<br>

671     if (unlikely(!sb_has_new_freeze(<wbr>sb)))<br>

672         return 1;<br>

673 retry:<br>

674     if (unlikely(sb->s_writers.frozen >= level)) {<br>

675         if (!wait)<br>

676             return 0;<br>

677         wait_event(sb->s_writers.wait_<wbr>unfrozen,  <<<<<<<<<<<<<<< dd was here<br>

678                sb->s_writers.frozen < level);<br>

679     }<br>

<br>

<br>

In short, mnt_want_write() called sb_start_write() which then called<br>

__sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when<br>

this vmcore was collected.<br>

<br>

>From this code it is clear that dd was trying to write to a filesystem but<br>

somehow it couldn't and became uninterruptible, now we need to find out which<br>

filesystem it was and why dd couldn't write to it. Can we do it? Yes,<br>

absolutely, if we can determine the first argument i.e. super_block that is<br>

passed to __sb_start_write(), it will tell us about the filesystem/mount-point.<br>

<br>

If we go back to backtrace, we can see that mnt_want_write() called<br>

__sb_start_write() and passed it three parameters, struct super_block (sb),<br>

level (SB_FREEZE_WRITE) and a Boolean called wait (true):<br>

<br>

 #1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc<br>

 #2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14<br>

<br>

Note: "whatis command can be used to check the function prototype (simply<br>

entering the function name or px also does the job):<br>

<br>

crash> whatis __sb_start_write<br>

int __sb_start_write(struct super_block *, int, bool);<br>

<br>

Before we go any further, it is important to know how arguments are passed to<br>

the called function. First argument goes to %rdi, second to %rsi, third to<br>

%rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed<br>

with this knowledge now we can turn our attention to disassembly. Let's<br>

disassemble mnt_start_write to find out arguments passed to __sb_start_write:<br>

<br>

crash> dis -lr ffffffff811bcc14<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 296<br>

0xffffffff811bcbf0 <mnt_want_write>:    push   %rbp<br>

0xffffffff811bcbf1 <mnt_want_write+1>:  mov    %rsp,%rbp<br>

0xffffffff811bcbf4 <mnt_want_write+4>:  push   %rbx<br>

0xffffffff811bcbf5 <mnt_want_write+5>:  sub    $0x18,%rsp<br>

0xffffffff811bcbf9 <mnt_want_write+9>:  nopl   0x0(%rax,%rax,1)<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/include/linux/<wbr>fs.h: 1571<br>

0xffffffff811bcbfe <mnt_want_write+14>: mov    $0x1,%edx  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<wbr><<<< 3rd argument i.e. wait<br>

0xffffffff811bcc03 <mnt_want_write+19>: mov    $0x1,%esi  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<wbr><<<< 2nd argument i.e. SB_FREEZE_WRITE<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 296<br>

0xffffffff811bcc08 <mnt_want_write+24>: mov    %rdi,%rbx<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/include/linux/<wbr>fs.h: 1571<br>

0xffffffff811bcc0b <mnt_want_write+27>: mov    0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)<br>

0xffffffff811bcc0f <mnt_want_write+31>: callq  0xffffffff8119bd20 <__sb_start_write><br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 300<br>

0xffffffff811bcc14 <mnt_want_write+36>: mov    %rbx,%rdi<br>

<br>

The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are<br>

pretty straightforward, both are set to 1.<br>

<br>

But to determine super_block which is %rdi we have to back-peddle a little bit,<br>

28th offset of %rdi was copied to %rdi:<br>

<br>

mov    0x28(%rdi),%rdi<br>

<br>

we need to determine what's inside %rdi and one line above that tells us that<br>

%rdi was copied to %rbx.<br>

<br>

mov    %rdi,%rbx<br>

<br>

If we can determine value inside %rbx, it will give us %rdi and then 28th<br>

offset of %rdi can be calculated.<br>

<br>

To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved<br>

somewhere:<br>

<br>

crash> dis -lr ffffffff8119bdfc | head<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/super.c: 669<br>

0xffffffff8119bd20 <__sb_start_write>:  push   %rbp<br>

0xffffffff8119bd21 <__sb_start_write+1>:        mov    %rsp,%rbp<br>

0xffffffff8119bd24 <__sb_start_write+4>:        add    $0xffffffffffffff80,%rsp<br>

0xffffffff8119bd28 <__sb_start_write+8>:        mov    %rbx,-0x28(%rbp)   <<<<<<< rbx<br>

0xffffffff8119bd2c <__sb_start_write+12>:       mov    %r12,-0x20(%rbp)<br>

0xffffffff8119bd30 <__sb_start_write+16>:       mov    %r13,-0x18(%rbp)<br>

0xffffffff8119bd34 <__sb_start_write+20>:       mov    %r14,-0x10(%rbp)<br>

0xffffffff8119bd38 <__sb_start_write+24>:       mov    %r15,-0x8(%rbp)<br>

0xffffffff8119bd3c <__sb_start_write+28>:       nopl   0x0(%rax,%rax,1)<br>

<br>

%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the<br>

stack data of __sb_start_write or schedule, I am going to use stack data of<br>

schedule for now to make it simpler. Disassembly of schedule (next function<br>

after __sb_start_write) suggests that %rbp is pushed to the stack:<br>

<br>

crash> dis -lr ffffffff815489d0 | head<br>

/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/kernel/sched.<wbr>c: 6133<br>

0xffffffff81548630 <schedule>:  push   %rbp  <<<<<<<<<<<<<<<<<<< 1st push<br>

0xffffffff81548631 <schedule+1>:        mov    %rsp,%rbp<br>

0xffffffff81548634 <schedule+4>:        push   %r15<br>

0xffffffff81548636 <schedule+6>:        push   %r14<br>

0xffffffff81548638 <schedule+8>:        push   %r13<br>

0xffffffff8154863a <schedule+10>:       push   %r12<br>

0xffffffff8154863c <schedule+12>:       push   %rbx<br>

0xffffffff8154863d <schedule+13>:       sub    $0x98,%rsp<br>

0xffffffff81548644 <schedule+20>:       nopl   0x0(%rax,%rax,1)<br>

<br>

first push at stack will give us %rbp, and bt -f can be used to print stack:<br>

<br>

crash> bt -f | grep -A15 schedule<br>

 #0 [ffff88007c473c18] schedule at ffffffff815489d0<br>

    ffff88007c473c20: 0000000000000082 0000000000000000<br>

    ffff88007c473c30: 0000000000000000 0000000000000000<br>

    ffff88007c473c40: 0000000000000000 0000000000000000<br>

    ffff88007c473c50: 0000000000000000 0000000000000000<br>

    ffff88007c473c60: 0000000000000000 ffff880037b71068<br>

    ffff88007c473c70: ffff88007c473fd8 0000000000010068<br>

    ffff88007c473c80: ffff880037b71068 ffff88007e527520<br>

    ffff88007c473c90: ffff880037b70ab0 0000000000000002<br>

    ffff88007c473ca0: 0000000000000246 ffff88007c473ce8<br>

    ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0<br>

    ffff88007c473cc0: ffff8800379e7400 0000000000000001<br>

    ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8<br>

    ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp<br>

    ffff88007c473cf0: ffffffff8119bdfc<br>

<br>

Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:<br>

<br>

crash> eval (0xffff88007c473d78 - 0x28)<br>

hexadecimal: ffff88007c473d50<br>

    decimal: 18446612134399262032  (-131939310289584)<br>

      octal: 1777774200017421636520<br>

     binary: 111111111111111110001000000000<wbr>000111110001000111001111010101<wbr>0000<br>

crash> rd ffff88007c473d50<br>

ffff88007c473d50:  ffff88007ef64380                    .C.~....<br>

<br>

%rbx == ffff88007ef64380<br>

<br>

Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we<br>

calculate 28th offset to %rdi we will get super_block:<br>

<br>

crash> eval (0xffff88007ef64380 + 0x28)<br>

hexadecimal: ffff88007ef643a8<br>

    decimal: 18446612134444286888  (-131939265264728)<br>

      octal: 1777774200017675441650<br>

     binary: 111111111111111110001000000000<wbr>000111111011110110010000111010<wbr>1000<br>

crash> rd ffff88007ef643a8<br>

ffff88007ef643a8:  ffff8800379e7400                    .t.7....<br>

<br>

super_block == ffff8800379e7400.<br>

<br>

super_block can be passed to the "mount" command to get the mount point:<br>

<br>

crash> mount ffff8800379e7400<br>

    VFSMOUNT         SUPERBLK     TYPE   DEVNAME   DIRNAME<br>

ffff88007ef64380 ffff8800379e7400 ext4   /dev/vda1 /boot<br>

<br>

This tells us that that dd was trying to write to /boot. Next part of our job<br>

is to find out why couldn't it write to /boot. In order to answer that, let's<br>

take a step back and revisit the kernel code at line 677 (where dd was):<br>

<br>

677         wait_event(sb->s_writers.wait_<wbr>unfrozen,<br>

678                sb->s_writers.frozen < level);<br>

<br>

Following wait_event() definition tells us that dd was put to sleep<br>

(uninterruptible) and will be woken up when "sb->s_writers.frozen < level"<br>

condition becomes true:<br>

<br>

/**<br>

 * wait_event - sleep until a condition gets true<br>

 * @wq: the waitqueue to wait on<br>

 * @condition: a C expression for the event to wait for<br>

 *<br>

 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the<br>

 * @condition evaluates to true. The @condition is checked each time<br>

 * the waitqueue @wq is woken up.<br>

 *<br>

 * wake_up() has to be called after changing any variable that could<br>

 * change the result of the wait condition.<br>

 */<br>

#define wait_event(wq, condition)                   \<br>

do {                                    \<br>

    if (condition)                          \<br>

        break;                          \<br>

    __wait_event(wq, condition);                    \<br>

} while (0)<br>

<br>

We know that "level" is 1 (second argument SB_FREEZE_WRITE), and<br>

sb->s_writers.frozen needs to be less than 1 for this condition to become true.<br>

Let's see what exactly is the current value of sb->s_writers.frozen, since we<br>

already have sb, it is pretty straightforward to obtain this value:<br>

<br>

crash> pd ((struct super_block *)0xffff8800379e7400)->s_<wbr>writers.frozen<br>

$5 = 4<br>

<br>

Or do this (I prefer the 1st one):<br>

<br>

crash> super_block.s_writers ffff8800379e7400 | grep frozen<br>

    frozen = 4,<br>

<br>

sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is<br>

uninterruptible, but what does this value 4 mean? This is defined in<br>

include/linux/fs.h ("cscope" can be used to examine kernel source):<br>

<br>

/* Possible states of 'frozen' field */<br>

enum {<br>

    SB_UNFROZEN = 0,        /* FS is unfrozen */<br>

    SB_FREEZE_WRITE = 1,        /* Writes, dir ops, ioctls frozen */<br>

    SB_FREEZE_TRANS = 2,<br>

    SB_FREEZE_PAGEFAULT = 2,    /* Page faults stopped as well */<br>

    SB_FREEZE_FS = 3,       /* For internal FS use (e.g. to stop<br>

                     * internal threads if needed) */<br>

    SB_FREEZE_COMPLETE = 4,     /* ->freeze_fs finished successfully */  <<<<<<<<br>

};<br>

<br>

4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So,<br>

now we can say that dd became uninterruptible because /boot filesystem was<br>

frozen, this triggered hung_task_timeout_secs which was caught by<br>

hung_task_panic and this caused the crash.<br>

<br>

Next step should be to check what exactly froze the filesystem, in this<br>

particular case I did it using fsfreeze to collect this vmcore so I can<br>

demonstrate a few things about vmcore analysis.<br>

<br>

In order to collect this demo vmcore I used a sysctl parameter named<br>

hung_task_panic. This parameter when enabled will trigger a crash if a process<br>

becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs<br>

sysctl parameter controls this duration and is set to 120 seconds by default in<br>

RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to<br>

10 seconds:<br>

<br>

# echo 1 > /proc/sys/kernel/hung_task_<wbr>panic<br>

# echo 10 > /proc/sys/kernel/hung_task_<wbr>timeout_secs<br>

<br>

I simply executed following dd command on /boot:<br>

<br>

# dd if=/dev/zero of=/boot/test<br>

<br>

And in the second terminal this:<br>

<br>

# fsfreeze -f /boot<br>

<br>

This made dd uninterruptible, hung_task_panic kicked in and kdump did its job<br>

by saving a vmcore under /var/crash.<br>

<br>

We can print these values from vmcore:<br>

<br>

crash> p sysctl_hung_task_timeout_secs<br>

sysctl_hung_task_timeout_secs = $1 = 10<br>

<br>

crash> p sysctl_hung_task_panic<br>

sysctl_hung_task_panic = $2 = 1<br>

<br>

Please send an email to the kernel meetup list for any queries and if this<br>

write-up needs any corrections.<br>

<br>

<br>

------------------------------<br>

<br>

Subject: Digest Footer<br>

<br>

______________________________<wbr>_________________<br>

Kernel-meetup mailing list<br>

<a href="mailto:Kernel-meetup@lists.reserved-bit.com">Kernel-meetup@lists.reserved-<wbr>bit.com</a><br>

<a href="http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-reserved-bit.com" rel="noreferrer" target="_blank">http://lists.reserved-bit.com/<wbr>listinfo.cgi/kernel-meetup-<wbr>reserved-bit.com</a><br>

<br>

<br>

------------------------------<br>

<br>

End of Kernel-meetup Digest, Vol 4, Issue 8<br>

******************************<wbr>*************<br>

</blockquote></div>