Thoughts of DS: 2009

22 May 2009

E2fsck hanging at 70% on Large Partition

We had a disk of 1 TB sized partition, which started giving problems.

We ran fsck, but it got stuck at 70% after about 4-5 hours.

The version on our system was e2fsprogs-1.35-7.1.
The issue we faced on Friday, 13-Feb-2009 [Friday the 13th :)]

On researching further it seemed to be due to a bug in the e2fsprogs.

Finally we concluded the bug to be a floating point precision error which could cause e2fsck to loop forever on really big filesystems with a large inode count.

On searching through our friend Google, we found a Bug Report.
Read more about it at:http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=411838http://www.linuxquestions.org/questions/linux-software-2/e2fsck-is-running-for-3-days-checking-800-gb-ext3-lvm-volume-411715/

We prepared a Patch to address the issue. Here is the patch:

--- e2fsprogs-1.35/lib/ext2fs/icount.c 2003-12-07 22:41:38.000000000 +0530
+++ e2fsprogs-1.35/lib/ext2fs/icount.c.new 2009-03-16 17:39:11.000000000 +0530
@@ -251,6 +251,10 @@
range = ((float) (ino - lowval)) /
(highval - lowval);
mid = low + ((int) (range * (high-low)));
+ if (mid > high)
+ mid = high;
+ if (mid < mid =" low;" ino ="=">list[mid].ino) {

We applied the patch against the source RPM and rebuilt the RPM to get a patched e2fsck.
Then we ran the fsck again on the partition and waited for about 4-6 hours. Wow, it completed successfully. Here is the output:

rpm -ivh e2fsprogs-1.35-7.1.src.rpm

copied our patch "e2fsprogs-1.35-icount-floating-point-precision.patch" to /usr/src/redhat/SOURCES

vi /usr/src/redhat/SPECS/e2fsprogs.spec

Line15: Patch8: e2fsprogs-1.35-icount-floating-point-precision.patch
Line54: %patch8 -p1 -b .icount-float

cd /usr/src/redhat/SPECS/
rpmbuild -ba e2fsprogs.spec

...
...
...
Wrote: /usr/src/redhat/SRPMS/e2fsprogs-1.35-7.1.src.rpm
Wrote: /usr/src/redhat/RPMS/x86_64/e2fsprogs-1.35-7.1.x86_64.rpm
Wrote: /usr/src/redhat/RPMS/x86_64/e2fsprogs-devel-1.35-7.1.x86_64.rpm
Wrote: /usr/src/redhat/RPMS/x86_64/e2fsprogs-debuginfo-1.35-7.1.x86_64.rpm

Now, e2fsprogs-1.35-7.1.x86_64.rpm was installed with rpm -Uvh

[root@server root]# e2fsck -C 0 /dev/cciss/c1d0p1
e2fsck 1.35 (28-Feb-2004)
/dev/cciss/c1d0p1 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 101949715
Connect to /lost+found? yes

Inode 101949715 ref count is 65535, should be 1. Fix? yes

Unattached zero-length inode 101949716. Clear? yes

Unattached inode 101949717
Connect to /lost+found? yes

Inode 101949717 ref count is 65535, should be 1. Fix? yes

Unattached inode 101949718
Connect to /lost+found? yes

Inode 101949718 ref count is 65535, should be 1. Fix? yes

Unattached inode 101949719
Connect to /lost+found? yes

Inode 101949719 ref count is 65535, should be 1. Fix? yes

Unattached inode 101949720
Connect to /lost+found? yes

Inode 101949720 ref count is 65535, should be 1. Fix? yes

Unattached inode 101949721
Connect to /lost+found? yes

Inode 101949721 ref count is 65535, should be 1. Fix? yes

Pass 5: Checking group summary information
Block bitmap differences: -(204107626--204107632) -(204107680--204107685) -204107757
Fix? yes

Free blocks count wrong for group #6228 (5, counted=19).
Fix? yes

Free blocks count wrong (54192711, counted=54192725).
Fix? yes

Inode bitmap differences: -(101949713--101949714)
Fix? yes

Free inodes count wrong for group #6222 (7057, counted=7059).
Fix? yes

Free inodes count wrong (78687902, counted=78687904).
Fix? yes

/dev/cciss/c1d0p1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/cciss/c1d0p1: 46764384/125452288 files (0.-1% non-contiguous), 196710437/250903162 blocks
[root@server root]#

Maybe it helps someone else too.

24 Feb 2009

Debugging the notorious "Segmentation fault" in Apache Error Log?

Have you ever noticed errors as below in the Apache Error Log and scratched your head as from which URL has caused this.

child pid xxxxx exit signal Segmentation fault (11) --> xxxxx is a pid number.

Problem is that whenever an Apache Thread / Process crashes, the error log reports this line and nothing more than that.

That is we get no idea what type of request led to the segfault.

To find out the reasons, one can start Apache in Single Process mode and try accessing the web-site and view the error log. Chances are that you may catch it.

But, generally on a large Site that's really difficult. Moreover, if its a production site, possibility of taking this route is more difficult.

You can also go through a suggested solution at Debugging intermittent crashes.
Solution

There is a module mod_whatkilledus by an Apache Developer Jeff Trawick, which helped me a lot. I hope that this may help others too.

mod_whatkilledus keeps a little bit of state on each active connection, which allows it to know what a thread was handling in case the thread segfaults. If that happens, it writes a report to the error log.

Here is short note from within the mod_whatkilledus code itself:

mod_whatkilledus is an experimental module for Apache httpd 2.x which tracks the current request and logs a report of the active request when a child process crashes. You should verify that it works reasonably on your system before putting it in production.

mod_whatkilledus is called during request processing to save information about the current request. It also implements a fatal exception hook that will be called when a child process crashes.
Apache httpd requirements for mod_whatkilledus:

Apache httpd >= 2.0.49 must be built with the --enable-exception-hook
configure option and mod_so enabled.

Compiling mod_whatkilledus:

apxs -ci -I/path/to/httpd-2.0/server mod_whatkilledus.c

Activating mod_whatkilledus:

1. Load it like any other DSO.

LoadModule whatkilledus_module modules/mod_whatkilledus.so

2. Enable exception hooks for modules like mod_whatkilledus:
EnableExceptionHook On

3. Choose where the report on current activity should be written. If you want it reported to some place other than the error log, use the WhatKilledUsLog directive to specify a fully-qualified filename for the log. Note that the web server user id (e.g., "nobody") must be able to create or append to this log file, as the log file is not opened until a crash occurs.

How I did:

I just recompiled my Apache with extra --enable-exception-hook option in configure as:

cd /usr/local/src/httpd-2.0.58 --> then I was using 2.0.58
./configure \
--prefix=/usr/local/apache \
--with-mpm=prefork \
--enable-so \
--enable-cgi \
--enable-info \
--enable-mod_status \
--enable-usertrack \
--enable-mime-magic \
--enable-rewrite=shared \
--enable-speling=shared \
--enable-nonportable-atomics=yes \
--enable-deflate \
--enable-expires \
--enable-headers \
--enable-setenvif \
--enable-env \
--enable-unique-id \
--enable-logio \
--enable-exception-hook
make
make install

Now, the Apache was ready for Exception-Hook based modules.

Lets, compile the mod_whatkilledus itself:

cd /usr/local/src
mkdir mod_whatkilledus
cd mod_whatkilledus
wget http://people.apache.org/~trawick/mod_whatkilledus.c
wget http://people.apache.org/~trawick/test_char.h

/usr/local/apache/bin/apxs -ci mod_whatkilledus.c

For me it produced the following output:

[root@maria mod_whatkilledus]# /usr/local/apache_back/bin/apxs -ci mod_whatkilledus.c
/usr/local/apache_back/build/libtool --silent --mode=compile gcc -prefer-pic -DAP_HAVE_DESIGNATED_INITIALIZER -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE -g -O2 -pthread -I/usr/local/apache_back/include -I/usr/local/apache_back/include -I/usr/local/apache_back/include -c -o mod_whatkilledus.lo mod_whatkilledus.c && touch mod_whatkilledus.slo
/usr/local/apache_back/build/libtool --silent --mode=link gcc -o mod_whatkilledus.la -rpath /usr/local/apache_back/modules -module -avoid-version mod_whatkilledus.lo
/usr/local/apache_back/build/instdso.sh SH_LIBTOOL='/usr/local/apache_back/build/libtool' mod_whatkilledus.la /usr/local/apache_back/modules
/usr/local/apache_back/build/libtool --mode=install cp mod_whatkilledus.la /usr/local/apache_back/modules/
cp .libs/mod_whatkilledus.so /usr/local/apache_back/modules/mod_whatkilledus.so
cp .libs/mod_whatkilledus.lai /usr/local/apache_back/modules/mod_whatkilledus.la
cp .libs/mod_whatkilledus.a /usr/local/apache_back/modules/mod_whatkilledus.a
ranlib /usr/local/apache_back/modules/mod_whatkilledus.a
chmod 644 /usr/local/apache_back/modules/mod_whatkilledus.a
PATH="$PATH:/sbin" ldconfig -n /usr/local/apache_back/modules
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/apache_back/modules

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,--rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
chmod 755 /usr/local/apache_back/modules/mod_whatkilledus.so
[root@maria mod_whatkilledus]#

In my httpd.conf, I put the following:

LoadModule whatkilledus_module modules/mod_whatkilledus.so

EnableExceptionHook On

WhatKilledUsLog /var/log/whatkilledus.log

Restarted the Apache (/etc/init.d/httpd restart).

Started watching Apache Error Log:

tail -f /var/log/apache_error_log

I just noticed a SegFault line:

[Mon Feb 23 16:24:56 2009] [notice] child pid 29412 exit signal Segmentation fault (11)

Went on to check the /var/log/whatkilledus.log and I found something similar to the following:

[Mon Feb 23 16:24:56 2009] pid 29412 mod_whatkilledus sig 11 crash
[Mon Feb 23 16:24:56 2009] pid 29412 mod_whatkilledus active connection: 9.65.120.97:2035->9.27.177.26:8080 (conn_rec 434b90)
[Mon Feb 23 16:24:56 2009] pid 29412 mod_whatkilledus active request (request_rec 438b48): GET /silly?fn=sigsegv HTTP/1.1|Host:aplinux.raleigh.ibm.com%3a8080|User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv%3a1.5)Gecko/20031007|Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1|Accept-Language:en-us,en;q=0.5|Accept-Encoding:gzip,deflate|Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7|Keep-Alive:300|Connection:keep-alive|Cache-Control:max-age=0
[Mon Feb 23 16:24:56 2009] pid 29412 mod_whatkilledus end of report

The above logs are not the actual once (I am unable to show due to obvious reasons). Thus, I was able to track the cause and intimated the Development Team and had it resolved.

You can read more about it from Jeff Trawick's pages here and here.

13 Feb 2009

Remote Logging and simplest LogServer

I am not going to talk about Central Syslog Server or Syslog-ng.

There are situations where one would like make some logs remotely viewable (say, to some developer who is not allowed to the Remote Server).

You could use netcat. Something like this would work.

On the remote machine:

tail -f /some/log/file.log | nc Local-IP Port

On the Local machine:

nc -l -p Port

Netcat is a little 25K Swiss Army Knife that can be really useful.

But, once I had a different situation.

There were around 6 different application Servers (in a cluster - request could come to anyone) generating access Logs on each one. We wished to gather the access logs on a Central place (Machine) in real-time to process / analyze.

All of a sudden I remembered DJB's tcpserver and multilog.

I prepared a Central Log collection Server with a daemon tools run script something similar to as below:

#!/bin/sh
export PATH="/usr/local/bin:$PATH"
setuidgid remoteloguser tcpserver Ip-Addr-of-LogServer Listen-Port-on-Server multilog ./access_log

And on each of the individual log generating Servers I implemented something similar to:

multitail /var/log/local_access_log | tcpclient Ip-Addr-of-LogServer Listen-Port-on-Server sh -c "cat >&7"

of course you have to run this one in background. I had also used the multitail from qlogtools.

A more interesting solution could be socklog.

12 Feb 2009

qmail-rspawn qmail-lspawn memory leak

Having trouble with qmail-rspawn (or qmail-lspawn) leaking memory?

One problem is with glibc2.3 (redhat-4) and possibly glibc2.4's implementation of vfork where there is an execvp memory leak in a vfork context described here (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=221187) and here (http://sources.redhat.com/ml/libc-hacker/2007-01/msg00000.html).

A workaround if you can't get the patch for your system: qmail is stable compiling statically on
a patched system (glibc-2.5.90-15 rawhide or older glibc2.2 system) just add -static to conf-ld and run it on the target system.

Reproduced from http://www.thesmbexchange.com/eng/qmail-rspawn-spawn_memleak.html

See Also a similar report in Qmail Mailing List:
http://www.gossamer-threads.com/lists/qmail/users/131456

I had experienced it and had concluded the same. While searching for this issue reported or experienced by others, I had encountered this post.

Hope that others would also find this useful.

Excel Formula to convert Epoch time to localtime (IST)

UNIX time is the number of seconds that has elapsed since 1/1/1970 [DATE(1970,1,1) = 25569].

Excel calculates dates by using the number of days that has elapsed since 1/1/1900.

Therefore you should be able to convert from one to the other by converting from seconds to days, and then adding on the 70 odd years difference, Plus 5:30 hours for IST (330 mins = 19800 secs).

Thus The formula : =((A1+19800)/86400)+25569 where A1 contains the UNIX time should convert to Excel date/time.

Make sure you format the cell as the required date/time format.

BTW, if you wish to convert a unix time using a perl one liner:

perl -e 'print scalar localtime(x),"\n"'

where x is the unix time.

Locate your CPAN.pm's systemwide configuration file

Do you want to locate the CPAN.pm's systemwide configuration file on your Linux Machine, try this one liner:

perl -le 'for (@INC) { $_ .= $ARGV[0]; print if -f }' /CPAN/Config.pm /System/Library/Perl/CPAN/Config.pm