Post by Bruncsak, AttilaDid you had libmilter compilation option defined "_FFR_WORKERS_POOL" ?
(FFR: for future release)
By the way, your libmilter is coming from which version of sendmail?
No, and it is 8.14.7.
But I managed to track down the offending code. It was tricky because
once milter-greylist get too much threads, gdb becomes unable to
explore them. Catching the process soon enough (351 threads) gives me this:
#0 0x00007f7ff6875d6a in ___lwp_park50 () from /usr/lib/libc.so.12
#1 0x00007f7ff70088f1 in ?? () from /usr/lib/libpthread.so.1
#2 0x00007f7ff78245a1 in ldap_send_initial_request ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#3 0x00007f7ff7815668 in ldap_pvt_search ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#4 0x00007f7ff781576f in ldap_pvt_search_s ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#5 0x00007f7ff7815839 in ldap_search_ext_s ()
from /usr/pkg/lib/libldap_r-2.4.so.2
#6 0x00000000004157d9 in ldapcheck_validate (ad=<optimized out>,
stage=<optimized out>, ap=0x7f7fdffff4d0, priv=0x7f7ff5110800)
at ldapcheck.c:502
#7 0x00000000004120e8 in acl_filter (stage=AS_RCPT, ctx=<optimized out>,
priv=0x7f7ff5110800) at acl.c:2407
#8 0x0000000000408f53 in real_envrcpt (ctx=0x7f7ff7332220,
envrcpt=0x7f7ff511b3d0) at milter-greylist.c:725
#9 0x000000000040928f in mlfi_envrcpt (ctx=0x7f7ff7332220,
envrcpt=0x7f7ff511b3d0) at milter-greylist.c:230
#10 0x00000000004231b6 in st_rcpt ()
#11 0x000000000042301a in mi_engine ()
#12 0x0000000000420bbf in mi_handle_session ()
#13 0x000000000041faf9 in mi_thread_handle_wrapper ()
#14 0x00007f7ff700b2ce in ?? () from /usr/lib/libpthread.so.1
#15 0x00007f7ff6875d80 in ___lwp_park50 () from /usr/lib/libc.so.12
ldap_send_initial_request() uses two mutex. I think one thread get
stuck in connection opening or request sending, and the other threads
wait.
The timelimit option of ldap_search_ext_s() will not help: this is
a server-side timeout for the request.
I think the fix is to start a new LDAP connexion when we detect the
deadlock. I could be because the thread count involved in LDAP
operations reach a threshold, or because the oldest opeartion hits
a timeout. I suspect the second approach is better.
There is still a problem with that approach: correctly handling
if the LDAP directory is misbehaving: we do not want to open an
inifinite amount of connexions if they all get stuck.
--
Emmanuel Dreyfus
manu-S783fYmB3Ccdnm+***@public.gmane.org