<p dir="ltr">Hi,</p>
<p dir="ltr">Shall we create Kernel-meetup official blog, to update by some maintainer after each meetup. <br></p>
<p dir="ltr">Thanks,<br>
Amit G</p>
<div class="gmail_quote">On Apr 19, 2017 7:53 PM, <<a href="mailto:kernel-meetup-request@lists.reserved-bit.com">kernel-meetup-request@lists.reserved-bit.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Kernel-meetup mailing list submissions to<br>
<a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-reserved-bit.com" rel="noreferrer" target="_blank">http://lists.reserved-bit.com/<wbr>listinfo.cgi/kernel-meetup-<wbr>reserved-bit.com</a><br>
<br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:kernel-meetup-request@lists.reserved-bit.com">kernel-meetup-request@lists.<wbr>reserved-bit.com</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:kernel-meetup-owner@lists.reserved-bit.com">kernel-meetup-owner@lists.<wbr>reserved-bit.com</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Kernel-meetup digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. Blog posts and slides for past meetups (Siddhesh Poyarekar)<br>
2. Re: Blog posts and slides for past meetups (Nitin Bansal)<br>
<br>
<br>
------------------------------<wbr>------------------------------<wbr>----------<br>
<br>
Message: 1<br>
Date: Wed, 19 Apr 2017 11:29:59 +0530<br>
From: Siddhesh Poyarekar <<a href="mailto:sid@reserved-bit.com">sid@reserved-bit.com</a>><br>
To: Kernel Meetup Pune <<a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a>><br>
Subject: [Kernel-meetup] Blog posts and slides for past meetups<br>
Message-ID: <<a href="mailto:264d86b2-6a56-4d41-9477-f68299adfc2b@reserved-bit.com">264d86b2-6a56-4d41-9477-<wbr>f68299adfc2b@reserved-bit.com</a>><br>
Content-Type: text/plain; charset=utf-8<br>
<br>
Hi,<br>
<br>
If you have written blog posts or uploaded slides of your sessions at<br>
the kernel meetup in the past months, please don't forget to add a link<br>
to it on the relevant wiki page.<br>
<br>
<a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_February_-<wbr>_2017</a><br>
<a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_April_-_<wbr>2017</a><br>
<br>
Siddhesh<br>
<br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Wed, 19 Apr 2017 19:53:49 +0530<br>
From: Nitin Bansal <<a href="mailto:nb@randomsearch.org">nb@randomsearch.org</a>><br>
To: Kernel Meetup Pune <<a href="mailto:kernel-meetup@lists.reserved-bit.com">kernel-meetup@lists.reserved-<wbr>bit.com</a>><br>
Subject: Re: [Kernel-meetup] Blog posts and slides for past meetups<br>
Message-ID:<br>
<<a href="mailto:CAPx4u31Z3vFFCqr1We2zopihTNOxWnPCuX2c9%2B9SVpSZ8taQ3g@mail.gmail.com">CAPx4u31Z3vFFCqr1We2zopihTNOx<wbr>WnPCuX2c9+9SVpSZ8taQ3g@mail.<wbr>gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hi Siddhesh,<br>
<br>
On Wed, Apr 19, 2017 at 11:29 AM, Siddhesh Poyarekar<br>
<<a href="mailto:sid@reserved-bit.com">sid@reserved-bit.com</a>> wrote:<br>
> Hi,<br>
><br>
> If you have written blog posts or uploaded slides of your sessions at<br>
> the kernel meetup in the past months, please don't forget to add a link<br>
> to it on the relevant wiki page.<br>
><br>
> <a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_February_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_February_-<wbr>_2017</a><br>
> <a href="https://ideas.reserved-bit.com/w/Kernel_Meetup_April_-_2017" rel="noreferrer" target="_blank">https://ideas.reserved-bit.<wbr>com/w/Kernel_Meetup_April_-_<wbr>2017</a><br>
><br>
<br>
I am attaching a file which contains write-up about vmcore analysis<br>
session that I presented in last meetup,<br>
I don't maintain any blog etc so don't really have any place to<br>
publish it, but feel free to post/attach it to<br>
reserved-bit wiki if you like.<br>
-------------- next part --------------<br>
I attended the Pune Kernel Meetup on 8th April and presented a session on<br>
Kernel crash dump (vmcore) analysis. In this write-up I have tried to explain a<br>
few things about the vmcore analysis, hope it helps.<br>
<br>
Before starting with the vmcore analysis, we need following three things:<br>
<br>
o crash utility [<a href="https://people.redhat.com/anderson/crash_whitepaper/" rel="noreferrer" target="_blank">https://people.redhat.com/<wbr>anderson/crash_whitepaper/</a>]<br>
o vmlinux<br>
o vmcore<br>
<br>
vmlinux can be obtained from kernel-debuginfo (for Red Hat Enterprise Linux<br>
(RHEL) at least). One can either install kernel-debuginfo or simply extract<br>
vmlinux file using rpm2cpio:<br>
<br>
# yum install kernel-debuginfo-2.6.32-642.<wbr>13.1.el6.x86_64<br>
<br>
Or<br>
<br>
# rpm2cpio kernel-debuginfo-2.6.32-642.<wbr>13.1.el6.x86_64.rpm | cpio -idv ./usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>
<br>
vmcore was collected from kernel-2.6.32-642.13.1.el6 so vmlinux from debuginfo<br>
package of same version is required.<br>
<br>
Now let's use "crash" utility to open this vmcore, which after printing a few<br>
details i.e. copyright information, GDB version information etc drops us to<br>
"crash>" prompt:<br>
<br>
# crash /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux vmcore<br>
<br>
crash 7.1.0-6.el6<br>
..<br>
GNU gdb (GDB) 7.6<br>
Copyright (C) 2013 Free Software Foundation, Inc.<br>
License GPLv3+: GNU GPL version 3 or later <<a href="http://gnu.org/licenses/gpl.html" rel="noreferrer" target="_blank">http://gnu.org/licenses/gpl.<wbr>html</a>><br>
This is free software: you are free to change and redistribute it.<br>
There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br>
and "show warranty" for details.<br>
This GDB was configured as "x86_64-unknown-linux-gnu"...<br>
<br>
KERNEL: /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>
DUMPFILE: vmcore [PARTIAL DUMP]<br>
CPUS: 2<br>
DATE: Fri Mar 31 11:15:47 2017<br>
UPTIME: 00:06:01<br>
LOAD AVERAGE: 0.40, 0.30, 0.13<br>
TASKS: 232<br>
NODENAME: rhel6<br>
RELEASE: 2.6.32-642.13.1.el6.x86_64<br>
VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016<br>
MACHINE: x86_64 (2808 Mhz)<br>
MEMORY: 2 GB<br>
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"<br>
PID: 46<br>
COMMAND: "khungtaskd"<br>
TASK: ffff880078d16ab0 [THREAD_INFO: ffff880078d20000]<br>
CPU: 1<br>
STATE: TASK_RUNNING (PANIC)<br>
<br>
crash><br>
<br>
<br>
"sys" command can be used to print most of the information that crash prints<br>
when invoked e.g.:<br>
<br>
crash> sys<br>
KERNEL: /usr/lib/debug/lib/modules/2.<wbr>6.32-642.13.1.el6.x86_64/<wbr>vmlinux<br>
DUMPFILE: vmcore [PARTIAL DUMP]<br>
CPUS: 2<br>
DATE: Fri Mar 31 11:15:47 2017<br>
UPTIME: 00:06:01<br>
LOAD AVERAGE: 0.40, 0.30, 0.13<br>
TASKS: 232<br>
NODENAME: rhel6<br>
RELEASE: 2.6.32-642.13.1.el6.x86_64<br>
VERSION: #1 SMP Wed Nov 23 16:03:01 EST 2016<br>
MACHINE: x86_64 (2808 Mhz)<br>
MEMORY: 2 GB<br>
PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"<br>
<br>
<br>
To list all the available crash commands, type "?" and hit enter:<br>
<br>
crash> ?<br>
<br>
* files mach repeat timer<br>
alias foreach mod runq tree<br>
ascii fuser mount search union<br>
bt gdb net set vm<br>
btop help p sig vtop<br>
dev ipcs ps struct waitq<br>
dis irq pte swap whatis<br>
eval kmem ptob sym wr<br>
exit list ptov sys q<br>
extend log rd task<br>
<br>
Use "help" to get details about a command and parameters it accepts e.g.:<br>
<br>
crash>help sys<br>
<br>
Now, let's examine kernel ring buffer (that's where we find panic message) with<br>
the help of "log" command.<br>
<br>
And following is the panic message, which tells us that khungtaskd crashed the<br>
system:<br>
<br>
Call Trace:<br>
<#DB[1]> <<EOE>> Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>
Call Trace:<br>
<NMI> [<ffffffff81009c47>] ? show_regs+0x27/0x30<br>
[<ffffffff8154d909>] ? arch_trigger_all_cpu_<wbr>backtrace_handler+0x99/0xc0<br>
[<ffffffff8154f1a5>] ? notifier_call_chain+0x55/0x80<br>
[<ffffffff8154f20a>] ? atomic_notifier_call_chain+<wbr>0x1a/0x20<br>
[<ffffffff810acc8e>] ? notify_die+0x2e/0x30<br>
[<ffffffff8154ce23>] ? do_nmi+0x1c3/0x350<br>
[<ffffffff8154c6e3>] ? nmi+0x83/0x90<br>
[<ffffffff8104615a>] ? native_write_msr_safe+0xa/0x10<br>
<<EOE>> [<ffffffff8103e92c>] ? x2apic_send_IPI_mask+0x6c/0xb0<br>
[<ffffffff81039b9a>] ? arch_trigger_all_cpu_<wbr>backtrace+0x27a/0x2c0<br>
[<ffffffff810f2740>] ? watchdog+0x250/0x2a0<br>
[<ffffffff810f24f0>] ? watchdog+0x0/0x2a0<br>
[<ffffffff810a640e>] ? kthread+0x9e/0xc0<br>
[<ffffffff8100c28a>] ? child_rip+0xa/0x20<br>
[<ffffffff810a6370>] ? kthread+0x0/0xc0<br>
[<ffffffff8100c280>] ? child_rip+0x0/0x20<br>
Kernel panic - not syncing: hung_task: blocked tasks<br>
Pid: 46, comm: khungtaskd Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>
Call Trace:<br>
[<ffffffff81548301>] ? panic+0xa7/0x179<br>
[<ffffffff81039aa5>] ? arch_trigger_all_cpu_<wbr>backtrace+0x185/0x2c0<br>
[<ffffffff810f274e>] ? watchdog+0x25e/0x2a0<br>
[<ffffffff810f24f0>] ? watchdog+0x0/0x2a0<br>
[<ffffffff810a640e>] ? kthread+0x9e/0xc0<br>
[<ffffffff8100c28a>] ? child_rip+0xa/0x20<br>
[<ffffffff810a6370>] ? kthread+0x0/0xc0<br>
[<ffffffff8100c280>] ? child_rip+0x0/0x20<br>
<br>
khungtaskd is a kernel thread which checks for uninterruptible tasks, few<br>
relevant snips from kernel/hung_task.c:<br>
<br>
static int __init hung_task_init(void)<br>
{<br>
atomic_notifier_chain_<wbr>register(&panic_notifier_list, &panic_block);<br>
watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");<br>
<br>
return 0;<br>
}<br>
<br>
..<br>
<br>
static int watchdog(void *dummy)<br>
{<br>
set_user_nice(current, 0);<br>
<br>
for ( ; ; ) {<br>
unsigned long timeout = sysctl_hung_task_timeout_secs;<br>
<br>
while (schedule_timeout_<wbr>interruptible(timeout_jiffies(<wbr>timeout)))<br>
timeout = sysctl_hung_task_timeout_secs;<br>
<br>
check_hung_uninterruptible_<wbr>tasks(timeout);<br>
}<br>
<br>
return 0;<br>
}<br>
<br>
By scrolling back a bit in ring buffer a hung_task_timeout_secs warning can be<br>
seen along with the following backtrace:<br>
<br>
1 > INFO: task dd:7224 blocked for more than 10 seconds.<br>
2 > Not tainted 2.6.32-642.13.1.el6.x86_64 #1<br>
3 > "echo 0 > /proc/sys/kernel/hung_task_<wbr>timeout_secs" disables this message.<br>
4> dd D 0000000000000001 0 7224 6702 0x00000080<br>
ffff88007c473ce8 0000000000000082 0000000000000000 0000000000000000<br>
0000000000000000 0000000000000000 0000000000000000 0000000000000000<br>
0000000000000000 0000000000000000 ffff880037b71068 ffff88007c473fd8<br>
5> Call Trace:<br>
[<ffffffff810a6bce>] ? prepare_to_wait+0x4e/0x80<br>
[<ffffffff8119bdfc>] __sb_start_write+0xdc/0x120<br>
[<ffffffff810a68a0>] ? autoremove_wake_function+0x0/<wbr>0x40<br>
[<ffffffff811bcc14>] mnt_want_write+0x24/0x50<br>
[<ffffffff8119b348>] ? get_empty_filp+0xe8/0x190<br>
[<ffffffff811aca99>] do_filp_open+0x2b9/0xd20<br>
[<ffffffff812a885a>] ? strncpy_from_user+0x4a/0x90<br>
[<ffffffff811ba072>] ? alloc_fd+0x92/0x160<br>
[<ffffffff811969f7>] do_sys_open+0x67/0x130<br>
[<ffffffff81196b00>] sys_open+0x20/0x30<br>
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b<br>
<br>
Let's look at it line by line:<br>
<br>
Line 1 shows that dd PID 7224 was blocked for more than 10 seconds.<br>
<br>
Line 2 tells us the kernel version and that it is not tainted (i.e. no third<br>
party modules are loaded etc.).<br>
<br>
Line 3 is just telling us how to disable this message, not a great idea IMO.<br>
<br>
Line 4 contains command name, state, PID etc.<br>
<br>
Line 5 shows the associated call trace which suggests that dd became<br>
uninterruptible while trying to write to a filesystem.<br>
<br>
Process details can be checked using "ps":<br>
<br>
crash> ps -m | grep UN<br>
[0 00:00:14.117] [UN] PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd"<br>
^^^^^^^^^^^^^^^^<br>
|_____This is the duration of dd in uninterruptible state (14.117 seconds)<br>
<br>
-m switch displays timestamp in days, hours, minutes, seconds and milliseconds.<br>
<br>
Next step is to examine dd process, in order to do that, we first need to<br>
change the process context (currently it is set to khungtaskd and needs to be<br>
switched to dd), which can be done using "set":<br>
<br>
crash> set 7224<br>
PID: 7224<br>
COMMAND: "dd"<br>
TASK: ffff880037b70ab0 [THREAD_INFO: ffff88007c470000]<br>
CPU: 1<br>
STATE: TASK_UNINTERRUPTIBLE<br>
<br>
Let's print backtrace along with kernel source line numbers (source line<br>
numbers probably wouldn't be displayed if kernel-debuginfo is not installed):<br>
<br>
crash> bt -l<br>
PID: 7224 TASK: ffff880037b70ab0 CPU: 1 COMMAND: "dd"<br>
#0 [ffff88007c473c18] schedule at ffffffff815489d0<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/kernel/sched.<wbr>c: 3119<br>
#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/super.c: 677<br>
#2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 300<br>
#3 [ffff88007c473db0] do_filp_open at ffffffff811aca99<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namei.c: 2267<br>
#4 [ffff88007c473f20] do_sys_open at ffffffff811969f7<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/open.c: 908<br>
#5 [ffff88007c473f70] sys_open at ffffffff81196b00<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/open.c: 932<br>
#6 [ffff88007c473f80] system_call_fastpath at ffffffff8100b0d2<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/arch/x86/<wbr>kernel/entry_64.S: 492<br>
RIP: 000000358a8db480 RSP: 00007ffcd7d37ca8 RFLAGS: 00010246<br>
RAX: 0000000000000002 RBX: ffffffff8100b0d2 RCX: 0000000000000000<br>
RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007ffcd7d3870b<br>
RBP: 0000000000000240 R8: 000000358ab8ee40 R9: 00007ffaa33c368c<br>
R10: 1999999999999999 R11: 0000000000000246 R12: ffffffff81196b00<br>
R13: ffff88007c473f78 R14: ffffffffffffffff R15: 0000000000000064<br>
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b<br>
<br>
This backtrace needs to be read from bottom to top, following chain shows how<br>
these functions were called:<br>
<br>
921 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode)<br>
922 {<br>
923 long ret;<br>
924<br>
925 if (force_o_largefile())<br>
926 flags |= O_LARGEFILE;<br>
927<br>
928 ret = do_sys_open(AT_FDCWD, filename, flags, mode);--------------------<br>
|<br>
------------------------------<wbr>------------------------------<wbr>----<br>
|<br>
899 long do_sys_open(int dfd, const char __user *filename, int flags, int mode)<br>
900 {<br>
901 struct filename *tmp = getname(filename);<br>
902 int fd = PTR_ERR(tmp);<br>
903<br>
904 if (!IS_ERR(tmp)) {<br>
905 fd = get_unused_fd_flags(flags);<br>
906 if (fd >= 0) {<br>
907 struct file *f = do_filp_open(dfd, tmp, flags, mode, 0); ------<br>
|<br>
------------------------------<wbr>---------------------------<br>
|<br>
2166 struct file *do_filp_open(int dfd, struct filename *filename,<br>
2167 int open_flag, int mode, int acc_mode)<br>
2168 {<br>
2169 struct file *filp;<br>
..<br>
..<br>
2256 /*<br>
2257 * This write is needed to ensure that a<br>
2258 * ro->rw transition does not occur between<br>
2259 * the time when the file is created and when<br>
2260 * a permanent write count is taken through<br>
2261 * the 'struct file' in nameidata_to_filp().<br>
2262 */<br>
2263 error = mnt_want_write(nd.path.mnt); ------------------------------<wbr>----<br>
|<br>
------------------------------<wbr>------------------------------<wbr>-----<br>
|<br>
295 int mnt_want_write(struct vfsmount *m)<br>
296 {<br>
297 int ret;<br>
298<br>
299 sb_start_write(m->mnt_sb); ------------------------------<wbr>-------------<br>
|<br>
------------------------------<wbr>-------------------<br>
|<br>
1569 static inline void sb_start_write(struct super_block *sb)<br>
1570 {<br>
1571 __sb_start_write(sb, SB_FREEZE_WRITE, true); --------------------------<br>
|<br>
------------------------------<wbr>------------------------------<wbr>----<br>
|<br>
668 int __sb_start_write(struct super_block *sb, int level, bool wait)<br>
669 {<br>
670 /* Out of tree modules don't use this mechanism */<br>
671 if (unlikely(!sb_has_new_freeze(<wbr>sb)))<br>
672 return 1;<br>
673 retry:<br>
674 if (unlikely(sb->s_writers.frozen >= level)) {<br>
675 if (!wait)<br>
676 return 0;<br>
677 wait_event(sb->s_writers.wait_<wbr>unfrozen, <<<<<<<<<<<<<<< dd was here<br>
678 sb->s_writers.frozen < level);<br>
679 }<br>
<br>
<br>
In short, mnt_want_write() called sb_start_write() which then called<br>
__sb_start_write(), and dd was waiting at line 677 in __sb_start_write() when<br>
this vmcore was collected.<br>
<br>
>From this code it is clear that dd was trying to write to a filesystem but<br>
somehow it couldn't and became uninterruptible, now we need to find out which<br>
filesystem it was and why dd couldn't write to it. Can we do it? Yes,<br>
absolutely, if we can determine the first argument i.e. super_block that is<br>
passed to __sb_start_write(), it will tell us about the filesystem/mount-point.<br>
<br>
If we go back to backtrace, we can see that mnt_want_write() called<br>
__sb_start_write() and passed it three parameters, struct super_block (sb),<br>
level (SB_FREEZE_WRITE) and a Boolean called wait (true):<br>
<br>
#1 [ffff88007c473cf0] __sb_start_write at ffffffff8119bdfc<br>
#2 [ffff88007c473d80] mnt_want_write at ffffffff811bcc14<br>
<br>
Note: "whatis command can be used to check the function prototype (simply<br>
entering the function name or px also does the job):<br>
<br>
crash> whatis __sb_start_write<br>
int __sb_start_write(struct super_block *, int, bool);<br>
<br>
Before we go any further, it is important to know how arguments are passed to<br>
the called function. First argument goes to %rdi, second to %rsi, third to<br>
%rdx, fourth to %rcx and fifth and sixth to %r8 and %r9 respectively. Armed<br>
with this knowledge now we can turn our attention to disassembly. Let's<br>
disassemble mnt_start_write to find out arguments passed to __sb_start_write:<br>
<br>
crash> dis -lr ffffffff811bcc14<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 296<br>
0xffffffff811bcbf0 <mnt_want_write>: push %rbp<br>
0xffffffff811bcbf1 <mnt_want_write+1>: mov %rsp,%rbp<br>
0xffffffff811bcbf4 <mnt_want_write+4>: push %rbx<br>
0xffffffff811bcbf5 <mnt_want_write+5>: sub $0x18,%rsp<br>
0xffffffff811bcbf9 <mnt_want_write+9>: nopl 0x0(%rax,%rax,1)<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/include/linux/<wbr>fs.h: 1571<br>
0xffffffff811bcbfe <mnt_want_write+14>: mov $0x1,%edx <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<wbr><<<< 3rd argument i.e. wait<br>
0xffffffff811bcc03 <mnt_want_write+19>: mov $0x1,%esi <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<wbr><<<< 2nd argument i.e. SB_FREEZE_WRITE<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 296<br>
0xffffffff811bcc08 <mnt_want_write+24>: mov %rdi,%rbx<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/include/linux/<wbr>fs.h: 1571<br>
0xffffffff811bcc0b <mnt_want_write+27>: mov 0x28(%rdi),%rdi <<<<<<<<<<<<<<<<<<<<<<<<<<<<< first argument i.e. super_block (sb)<br>
0xffffffff811bcc0f <mnt_want_write+31>: callq 0xffffffff8119bd20 <__sb_start_write><br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/namespace.<wbr>c: 300<br>
0xffffffff811bcc14 <mnt_want_write+36>: mov %rbx,%rdi<br>
<br>
The values of second and third arguments i.e. SB_FREEZE_WRITE and wait are<br>
pretty straightforward, both are set to 1.<br>
<br>
But to determine super_block which is %rdi we have to back-peddle a little bit,<br>
28th offset of %rdi was copied to %rdi:<br>
<br>
mov 0x28(%rdi),%rdi<br>
<br>
we need to determine what's inside %rdi and one line above that tells us that<br>
%rdi was copied to %rbx.<br>
<br>
mov %rdi,%rbx<br>
<br>
If we can determine value inside %rbx, it will give us %rdi and then 28th<br>
offset of %rdi can be calculated.<br>
<br>
To obtain %rdi let's disassemble __sb_start_write and see if %rbx is saved<br>
somewhere:<br>
<br>
crash> dis -lr ffffffff8119bdfc | head<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/fs/super.c: 669<br>
0xffffffff8119bd20 <__sb_start_write>: push %rbp<br>
0xffffffff8119bd21 <__sb_start_write+1>: mov %rsp,%rbp<br>
0xffffffff8119bd24 <__sb_start_write+4>: add $0xffffffffffffff80,%rsp<br>
0xffffffff8119bd28 <__sb_start_write+8>: mov %rbx,-0x28(%rbp) <<<<<<< rbx<br>
0xffffffff8119bd2c <__sb_start_write+12>: mov %r12,-0x20(%rbp)<br>
0xffffffff8119bd30 <__sb_start_write+16>: mov %r13,-0x18(%rbp)<br>
0xffffffff8119bd34 <__sb_start_write+20>: mov %r14,-0x10(%rbp)<br>
0xffffffff8119bd38 <__sb_start_write+24>: mov %r15,-0x8(%rbp)<br>
0xffffffff8119bd3c <__sb_start_write+28>: nopl 0x0(%rax,%rax,1)<br>
<br>
%rbx is copied to -28th offset of %rbp, and %rbp can be found either in the<br>
stack data of __sb_start_write or schedule, I am going to use stack data of<br>
schedule for now to make it simpler. Disassembly of schedule (next function<br>
after __sb_start_write) suggests that %rbp is pushed to the stack:<br>
<br>
crash> dis -lr ffffffff815489d0 | head<br>
/usr/src/debug/kernel-2.6.32-<wbr>642.13.1.el6/linux-2.6.32-642.<wbr>13.1.el6.x86_64/kernel/sched.<wbr>c: 6133<br>
0xffffffff81548630 <schedule>: push %rbp <<<<<<<<<<<<<<<<<<< 1st push<br>
0xffffffff81548631 <schedule+1>: mov %rsp,%rbp<br>
0xffffffff81548634 <schedule+4>: push %r15<br>
0xffffffff81548636 <schedule+6>: push %r14<br>
0xffffffff81548638 <schedule+8>: push %r13<br>
0xffffffff8154863a <schedule+10>: push %r12<br>
0xffffffff8154863c <schedule+12>: push %rbx<br>
0xffffffff8154863d <schedule+13>: sub $0x98,%rsp<br>
0xffffffff81548644 <schedule+20>: nopl 0x0(%rax,%rax,1)<br>
<br>
first push at stack will give us %rbp, and bt -f can be used to print stack:<br>
<br>
crash> bt -f | grep -A15 schedule<br>
#0 [ffff88007c473c18] schedule at ffffffff815489d0<br>
ffff88007c473c20: 0000000000000082 0000000000000000<br>
ffff88007c473c30: 0000000000000000 0000000000000000<br>
ffff88007c473c40: 0000000000000000 0000000000000000<br>
ffff88007c473c50: 0000000000000000 0000000000000000<br>
ffff88007c473c60: 0000000000000000 ffff880037b71068<br>
ffff88007c473c70: ffff88007c473fd8 0000000000010068<br>
ffff88007c473c80: ffff880037b71068 ffff88007e527520<br>
ffff88007c473c90: ffff880037b70ab0 0000000000000002<br>
ffff88007c473ca0: 0000000000000246 ffff88007c473ce8<br>
ffff88007c473cb0: ffffffff810a6bce ffff880037b70ab0<br>
ffff88007c473cc0: ffff8800379e7400 0000000000000001<br>
ffff88007c473cd0: ffff88007c473d18 ffff8800379e76d8<br>
ffff88007c473ce0: ffff8800379e7770 ffff88007c473d78 <<<< ffff88007c473d78 is first push == %rbp<br>
ffff88007c473cf0: ffffffff8119bdfc<br>
<br>
Now that we have %rbp, %rbx can be calculated i.e. -0x28th offset of %rbp:<br>
<br>
crash> eval (0xffff88007c473d78 - 0x28)<br>
hexadecimal: ffff88007c473d50<br>
decimal: 18446612134399262032 (-131939310289584)<br>
octal: 1777774200017421636520<br>
binary: 111111111111111110001000000000<wbr>000111110001000111001111010101<wbr>0000<br>
crash> rd ffff88007c473d50<br>
ffff88007c473d50: ffff88007ef64380 .C.~....<br>
<br>
%rbx == ffff88007ef64380<br>
<br>
Since %rdi was copied to %rbx, %rdi also is ffff88007ef64380, now if we<br>
calculate 28th offset to %rdi we will get super_block:<br>
<br>
crash> eval (0xffff88007ef64380 + 0x28)<br>
hexadecimal: ffff88007ef643a8<br>
decimal: 18446612134444286888 (-131939265264728)<br>
octal: 1777774200017675441650<br>
binary: 111111111111111110001000000000<wbr>000111111011110110010000111010<wbr>1000<br>
crash> rd ffff88007ef643a8<br>
ffff88007ef643a8: ffff8800379e7400 .t.7....<br>
<br>
super_block == ffff8800379e7400.<br>
<br>
super_block can be passed to the "mount" command to get the mount point:<br>
<br>
crash> mount ffff8800379e7400<br>
VFSMOUNT SUPERBLK TYPE DEVNAME DIRNAME<br>
ffff88007ef64380 ffff8800379e7400 ext4 /dev/vda1 /boot<br>
<br>
This tells us that that dd was trying to write to /boot. Next part of our job<br>
is to find out why couldn't it write to /boot. In order to answer that, let's<br>
take a step back and revisit the kernel code at line 677 (where dd was):<br>
<br>
677 wait_event(sb->s_writers.wait_<wbr>unfrozen,<br>
678 sb->s_writers.frozen < level);<br>
<br>
Following wait_event() definition tells us that dd was put to sleep<br>
(uninterruptible) and will be woken up when "sb->s_writers.frozen < level"<br>
condition becomes true:<br>
<br>
/**<br>
* wait_event - sleep until a condition gets true<br>
* @wq: the waitqueue to wait on<br>
* @condition: a C expression for the event to wait for<br>
*<br>
* The process is put to sleep (TASK_UNINTERRUPTIBLE) until the<br>
* @condition evaluates to true. The @condition is checked each time<br>
* the waitqueue @wq is woken up.<br>
*<br>
* wake_up() has to be called after changing any variable that could<br>
* change the result of the wait condition.<br>
*/<br>
#define wait_event(wq, condition) \<br>
do { \<br>
if (condition) \<br>
break; \<br>
__wait_event(wq, condition); \<br>
} while (0)<br>
<br>
We know that "level" is 1 (second argument SB_FREEZE_WRITE), and<br>
sb->s_writers.frozen needs to be less than 1 for this condition to become true.<br>
Let's see what exactly is the current value of sb->s_writers.frozen, since we<br>
already have sb, it is pretty straightforward to obtain this value:<br>
<br>
crash> pd ((struct super_block *)0xffff8800379e7400)->s_<wbr>writers.frozen<br>
$5 = 4<br>
<br>
Or do this (I prefer the 1st one):<br>
<br>
crash> super_block.s_writers ffff8800379e7400 | grep frozen<br>
frozen = 4,<br>
<br>
sb->s_writers.frozen is 4 which obviously is lower than 1 and this is why dd is<br>
uninterruptible, but what does this value 4 mean? This is defined in<br>
include/linux/fs.h ("cscope" can be used to examine kernel source):<br>
<br>
/* Possible states of 'frozen' field */<br>
enum {<br>
SB_UNFROZEN = 0, /* FS is unfrozen */<br>
SB_FREEZE_WRITE = 1, /* Writes, dir ops, ioctls frozen */<br>
SB_FREEZE_TRANS = 2,<br>
SB_FREEZE_PAGEFAULT = 2, /* Page faults stopped as well */<br>
SB_FREEZE_FS = 3, /* For internal FS use (e.g. to stop<br>
* internal threads if needed) */<br>
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */ <<<<<<<<br>
};<br>
<br>
4 is SB_FREEZE_COMPLETE, which means that filesystem i.e. /boot is frozen. So,<br>
now we can say that dd became uninterruptible because /boot filesystem was<br>
frozen, this triggered hung_task_timeout_secs which was caught by<br>
hung_task_panic and this caused the crash.<br>
<br>
Next step should be to check what exactly froze the filesystem, in this<br>
particular case I did it using fsfreeze to collect this vmcore so I can<br>
demonstrate a few things about vmcore analysis.<br>
<br>
In order to collect this demo vmcore I used a sysctl parameter named<br>
hung_task_panic. This parameter when enabled will trigger a crash if a process<br>
becomes uninterruptible for more than 120 seconds. hung_task_timeout_secs<br>
sysctl parameter controls this duration and is set to 120 seconds by default in<br>
RHEL 6/7. After enabling hung_task_panic and reducing hung_task_timeout_secs to<br>
10 seconds:<br>
<br>
# echo 1 > /proc/sys/kernel/hung_task_<wbr>panic<br>
# echo 10 > /proc/sys/kernel/hung_task_<wbr>timeout_secs<br>
<br>
I simply executed following dd command on /boot:<br>
<br>
# dd if=/dev/zero of=/boot/test<br>
<br>
And in the second terminal this:<br>
<br>
# fsfreeze -f /boot<br>
<br>
This made dd uninterruptible, hung_task_panic kicked in and kdump did its job<br>
by saving a vmcore under /var/crash.<br>
<br>
We can print these values from vmcore:<br>
<br>
crash> p sysctl_hung_task_timeout_secs<br>
sysctl_hung_task_timeout_secs = $1 = 10<br>
<br>
crash> p sysctl_hung_task_panic<br>
sysctl_hung_task_panic = $2 = 1<br>
<br>
Please send an email to the kernel meetup list for any queries and if this<br>
write-up needs any corrections.<br>
<br>
<br>
------------------------------<br>
<br>
Subject: Digest Footer<br>
<br>
______________________________<wbr>_________________<br>
Kernel-meetup mailing list<br>
<a href="mailto:Kernel-meetup@lists.reserved-bit.com">Kernel-meetup@lists.reserved-<wbr>bit.com</a><br>
<a href="http://lists.reserved-bit.com/listinfo.cgi/kernel-meetup-reserved-bit.com" rel="noreferrer" target="_blank">http://lists.reserved-bit.com/<wbr>listinfo.cgi/kernel-meetup-<wbr>reserved-bit.com</a><br>
<br>
<br>
------------------------------<br>
<br>
End of Kernel-meetup Digest, Vol 4, Issue 8<br>
******************************<wbr>*************<br>
</blockquote></div>