Discussion:
thread leak
Emmanuel Dreyfus
2014-02-11 15:18:48 UTC
Permalink
Hello everybody

Sometimes, milter-greylist gets overloaded, and stops answering requests
in time. sendmail says it goes "to error state", and I have to restart
milter-greylist to get it working again.

Today I looked closly at the problem and found that the frozen
milter-greylist had more than 1600 threads.

Anyone experienced thread leakage with libmilter?
--
Emmanuel Dreyfus
manu-S783fYmB3Ccdnm+***@public.gmane.org
Peter Bonivart
2014-02-11 15:34:25 UTC
Permalink
Post by Emmanuel Dreyfus
Hello everybody
Sometimes, milter-greylist gets overloaded, and stops answering requests
in time. sendmail says it goes "to error state", and I have to restart
milter-greylist to get it working again.
Today I looked closly at the problem and found that the frozen
milter-greylist had more than 1600 threads.
Anyone experienced thread leakage with libmilter?
I've used milter-greylist for many years on both Solaris and RHEL and
both can sometimes go to error state combined with the process
consuming huge amounts of memory or simply crashing. A restart of the
process always works. Nowadays I don't see it often though, I usually
have many months of update so it's time to reboot the servers due to
patching anyway.

/peter
Bruncsak, Attila
2014-02-12 10:06:47 UTC
Permalink
Post by Emmanuel Dreyfus
Today I looked closly at the problem and found that the frozen
milter-greylist had more than 1600 threads.
How much sendmail (or postfix) process
had you on the system at that time?
Around 1600 or much less?
manu-S783fYmB3Ccdnm+
2014-02-12 12:30:38 UTC
Permalink
Post by Bruncsak, Attila
How much sendmail (or postfix) process
had you on the system at that time?
Around 1600 or much less?
Much less, of course.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu-S783fYmB3Ccdnm+***@public.gmane.org
Bruncsak, Attila
2014-02-12 13:32:33 UTC
Permalink
Post by manu-S783fYmB3Ccdnm+
Post by Bruncsak, Attila
How much sendmail (or postfix) process
had you on the system at that time?
Around 1600 or much less?
Much less, of course.
Are the threads normal working threads created in libmilter?
manu-S783fYmB3Ccdnm+
2014-02-12 19:07:55 UTC
Permalink
Post by Bruncsak, Attila
Are the threads normal working threads created in libmilter?
It seems they are: 1600 sleeping threads. That's odd.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu-S783fYmB3Ccdnm+***@public.gmane.org
Johann Klasek
2014-02-12 16:58:46 UTC
Permalink
Post by Emmanuel Dreyfus
Sometimes, milter-greylist gets overloaded, and stops answering requests
in time. sendmail says it goes "to error state", and I have to restart
milter-greylist to get it working again.
Today I looked closly at the problem and found that the frozen
milter-greylist had more than 1600 threads.
Is there any hint what these threads are doing?
What says
pstack PID_OF_MG_PROCESS
?
manu-S783fYmB3Ccdnm+
2014-02-12 19:07:54 UTC
Permalink
Post by Johann Klasek
Is there any hint what these threads are doing?
All sleeping. I suspect libmilter fails to track threads and leaves some
behind.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu-S783fYmB3Ccdnm+***@public.gmane.org
Johann Klasek
2014-02-12 19:16:50 UTC
Permalink
Post by manu-S783fYmB3Ccdnm+
Post by Johann Klasek
Is there any hint what these threads are doing?
All sleeping. I suspect libmilter fails to track threads and leaves some
behind.
I think libmilter does not actually tracking his threads. They are
created/cloned to process a SMTP-session and self-terminate later ...

Have you a sample for a backtrace of your threads? Are they all the
same (beside the organizational ones)?

With Linux Fedora 16 most of the workers looks like this:

Thread 2 (Thread 0x7f6f2cc13700 (LWP 21649)):
#0 0x00000036df0e8283 in select () from /lib64/libc.so.6
#1 0x000000000041fbd5 in mi_rd_cmd ()
#2 0x000000000041f3ec in mi_engine ()
#3 0x000000000041c478 in mi_handle_session ()
#4 0x000000000041b129 in mi_thread_handle_wrapper ()
#5 0x00000038a8807d90 in start_thread () from /lib64/libpthread.so.0
#6 0x00000036df0eeddd in clone () from /lib64/libc.so.6

beside the dumper, sync_master, sync_sender and Signaling thread.
Bruncsak, Attila
2014-02-13 08:48:00 UTC
Permalink
Post by manu-S783fYmB3Ccdnm+
Post by Johann Klasek
Is there any hint what these threads are doing?
All sleeping. I suspect libmilter fails to track threads and leaves some
behind.
Did you had libmilter compilation option defined "_FFR_WORKERS_POOL" ?
(FFR: for future release)
By the way, your libmilter is coming from which version of sendmail?
Emmanuel Dreyfus
2014-02-14 09:10:20 UTC
Permalink
Post by Bruncsak, Attila
Did you had libmilter compilation option defined "_FFR_WORKERS_POOL" ?
(FFR: for future release)
By the way, your libmilter is coming from which version of sendmail?
No, and it is 8.14.7.

But I managed to track down the offending code. It was tricky because
once milter-greylist get too much threads, gdb becomes unable to
explore them. Catching the process soon enough (351 threads) gives me this:

#0 0x00007f7ff6875d6a in ___lwp_park50 () from /usr/lib/libc.so.12
#1 0x00007f7ff70088f1 in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7ff78245a1 in ldap_send_initial_request ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#3 0x00007f7ff7815668 in ldap_pvt_search ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#4 0x00007f7ff781576f in ldap_pvt_search_s ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#5 0x00007f7ff7815839 in ldap_search_ext_s ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#6 0x00000000004157d9 in ldapcheck_validate (ad=<optimized out>,
stage=<optimized out>, ap=0x7f7fdffff4d0, priv=0x7f7ff5110800)
at ldapcheck.c:502
#7 0x00000000004120e8 in acl_filter (stage=AS_RCPT, ctx=<optimized out>,
priv=0x7f7ff5110800) at acl.c:2407
#8 0x0000000000408f53 in real_envrcpt (ctx=0x7f7ff7332220,
envrcpt=0x7f7ff511b3d0) at milter-greylist.c:725
#9 0x000000000040928f in mlfi_envrcpt (ctx=0x7f7ff7332220,
envrcpt=0x7f7ff511b3d0) at milter-greylist.c:230
#10 0x00000000004231b6 in st_rcpt ()
#11 0x000000000042301a in mi_engine ()
#12 0x0000000000420bbf in mi_handle_session ()
#13 0x000000000041faf9 in mi_thread_handle_wrapper ()
#14 0x00007f7ff700b2ce in ?? () from /usr/lib/libpthread.so.1
#15 0x00007f7ff6875d80 in ___lwp_park50 () from /usr/lib/libc.so.12

ldap_send_initial_request() uses two mutex. I think one thread get
stuck in connection opening or request sending, and the other threads
wait.

The timelimit option of ldap_search_ext_s() will not help: this is
a server-side timeout for the request.

I think the fix is to start a new LDAP connexion when we detect the
deadlock. I could be because the thread count involved in LDAP
operations reach a threshold, or because the oldest opeartion hits
a timeout. I suspect the second approach is better.

There is still a problem with that approach: correctly handling
if the LDAP directory is misbehaving: we do not want to open an
inifinite amount of connexions if they all get stuck.
--
Emmanuel Dreyfus
manu-S783fYmB3Ccdnm+***@public.gmane.org
Bruncsak, Attila
2014-02-14 11:00:12 UTC
Permalink
Post by Emmanuel Dreyfus
There is still a problem with that approach: correctly handling
if the LDAP directory is misbehaving: we do not want to open an
inifinite amount of connexions if they all get stuck.
This pseudo code is just to show the concept how I imagine to implement
the client side time-out:

Worker thread:

If no ldap_connection_established
then
get_the_lock
If no ldap_opener_thread_working
Then
Spawn ldap_opener_thread (detach, etc...)
fi
Mark ldap_opener_thread_working
release_the_lock
Wait for a very short only (client side time-out) and return LDAP error if no connection yet
fi

Ldap opener thread:

open ldap connection
if no ldap_connection_established
then
sleep for a while (to throttle the connection attempts)
fi
get_the_lock
UnMark ldap_opener_thread_working
release_the_lock
exit
manu-S783fYmB3Ccdnm+
2014-02-15 05:36:21 UTC
Permalink
Post by Bruncsak, Attila
This pseudo code is just to show the concept how I imagine to implement
I realized that a much simplier way of dealing with the issue was to use
LDAP asynchronous requests. I am testing it right now.
--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu-S783fYmB3Ccdnm+***@public.gmane.org
Continue reading on narkive:
Loading...