Tuesday, July 28, 2015

LDAP Performance Troubleshooting - Isilon

Recently we were complained about authentication issue, where some of the users are unable to login to the Isilon cluster, where login requests are getting time out. This is kind of issue, where some users are able to login to every Isilon individual node with IP address as well as SSIP and remaining DNS addresses. On the other side few users are able to login to couple of nodes and while others not.
No changes performed on either side of Isilon cluster as well as LDAP server.  Below are the troubleshooting steps performing to figure out the location and root cause of the issue.

1) Login to each and individual nodes separately with IP address, SSIP, and pool DNS addresses.
2) Tried with different LDAP users by all step one ways
3) Checked with LDAP team whether they are receiving the LDAP requests, where by verifying the logs as well as Splunk log repositories.
4) Restarted authentication services
5) Verified if any changes performed on Isilon cluster as well as LDAP server by checking the time when issue was started.
6) Verified if whether configuration on Isilon cluster is reflecting accross all nodes, since login to some of them working fine while others not.
7) Listed all the Physical components where the request and response flows in the network. Like
Isilon -> Nexus 5K -> Nexus 7K -> F 5 load balancers -> Nexus 7K -> Fabric Interconnect - > ESXi hosts  -> Virtual LDAP machines and vice versa response from LDAP virtual machines to Isilon clusters
8) Captured network traces on Isilon cluster as well as on Nexus 7K switch while doing couple of tests to see the flow of LDAP request
I will provide the commands for taking TCP dumps on Isilon cluster and troubleshooting helpful commands at the end.

9) Used wireshark to verify the tcpdump pcap captures.

10) Once pcaps are opened in wireshark, filter the frames by decode to LDAP to minimize the output to LDAP frames 

11) Decode can be performed to any type like TCP, UDP, LDAP to minimize the output to our preferred format for ease of troubleshooting.

Other ways to filter is 
HTTPS is eq to "ABC"
LDAP eq to "ABC"

Where ABC is user id or any filter

right click on any frame and see the TCP flow to check complete flow happend during particular session.  

Red color code indicates the request from Isilon cluster, and green represent the response from Server.

12) We opened both captures performed at Isilon cluster as well as 7K and compared same session from both pcaps.

The 7K switch showed it received the responses from LDAP servers but has lot of retransmissions and frames in red color.  where Isilon pcaps missed all the responses.  It just waited for 100 seconds before sending unbind request and received response and successfully closed the connection.

13) F5 engineer verified and confirmed all packets are being placed on the wire which going to 7K switch

14) That way isolated, Isilon cluster and LDAP server, F5 from the issue list as both are trying to communicate, Since response reached all the way back to 7K which left 2 devices on the network

7K and 5K switches.

15) Logged in to 7K switch and started shutting off one port at a time, connecting to 5K switch and tested the logins to Isilon cluster

16) Tested with 3 ports, the logins started working after shutting down the 4 th port.

17) Verified the configuration for the port interface and found it has some CRC errors and others which eating the packets.


Below are the commands which helps towards the troubleshooting process. Use as required.

Collecting tcp dumps.

tcpdump -i vlan0 -s 0 -w /ifs/data/Isilon_Support/node2/ssh_login.pcap host 10.10.10.10 
tcpdump -i vlan0 -s 0 -w /ifs/data/Isilon_Support/node2/node2_ldap.pcap host 10.10.10.10
tcpdump -i vlan0 -s 0 -w /ifs/data/Isilon_Support/node2/node2_mapping.pcap &

isi_for_array 'tcpdump -s 0 -i lagg0 -w /ifs/data/Isilon_Support/$(date +%m%d%Y)/`hostname`.$(date +%m%d%Y_%H%M%S).lagg0.cap &'

Verify active connections on Isilon cluster

isi_for_array netstat -an|grep 10.10.10.10
isi_for_array ifconfig |grep 10.10.10.10


Other commands

tail -f /var/log/lsassd.log       Authentication log file
ps aux |grep lsass                  Current running processes
ifcofnig -a
ls -lrth
isi auth ldap list                     List ldap servers configured on Isilon clsuter
isi auth mapping token --user=abce --zone=1    Verify mapping information for LDAP user
isi auth mapping token --user=abcd
isi auth mapping flush                                         Flush the cache
isi auth mapping flush --all                                 Flush the cache
isi_for_array -n3,4,5 isi auth mapping token --user=abcd
isi_for_array -n3 isi auth mapping token --user=abcd

ldapsearch -H ldap://10.10.10.10 -b 'ou=enterprise,o=abc,c=us' 
ldapsearch -H ldap://10.10.10.10 -x -b "" -s base objectclass="*"  supportedControl
ldapsearch -H 10.10.10.10 -x -b "" -s base objectclass="*"  supportedControl
ldapsearch -H 10.10.10.10 -x -b ,ou=enterprise,o=abc,c=xyz
ldapsearch -H 10.10.10.10 -x -b ou=enterprise,o=abc,c=xyz
ldapsearch -H ldap://10.10.10.10 -x -b ou=enterprise,o=abc,c=xyz
ldapsearch -H ldap://10.10.10.10:2389 -x -b ou=enterprise,o=abc,c=xyz
/usr/likewise/bin/lwsm list
ldapsearch -H ldap://10.10.10.10:2389 -x -b ou=enterprise,o=abc,c=xyz
ldapsearch -x LLL -H ldap://10.10.10.10:2389  -b 'ou=enterprise,o=abc,c=xyz' -D abcd

date; isi auth mapping token --user=abcd
ls -l /ifs/data/Isilon_Support/node2/node2_mapping.pcap
ls -lh /ifs/data/Isilon_Support/node2/node2_mapping.pcap

ping -c 1000 10.10.10.10 -W 1
ping -c 1000 -W 1 10.10.10.10 
ping -c 1000 10.10.10.10 
isi services -a
isi_for_array "ps auxww | grep lsass | grep -v grep"

ldapsearch -x -h abc.xyz.com -p 2389 -D "abcd" -W -b "" -s base "objectclass =*"
ldapsearch -x -h abc.xyz.com -p 2389 -D "abcd" -W -b "" -s base "objectclass=*"
isi_for_array "ps auxww | grep lsass | grep -v grep"
isi_for_array "isi_classic auth ads cache flush --all"    Flush the cache
isi_for_array "isi_classic auth mapping flush --all"      Flush the cache
isi_for_array "killall lsassd -9"                                     Kill the lsassd authentication deamon, which whill be automatically restarted by MCP master control process


ldapsearch -h abc.xyz.ldap.com -p 2389 -D "uid=abc,ou=def,ou=enterprise,o=hij,c=abd" -W  -b "ou=enterprise,o=hij,c=abd" 
ldapsearch -h abc.xyz.ldapserver.com -p 2389 -D "uid=abc,ou=def,ou=enterprise,o=hij,c=abd" -W  -b "ou=enterprise,o=hij,c=abd"  
isi auth ldap list -v
 ldapsearch -h abc.xyz.ldapserver.com -p 2389 -D "uid=abc,ou=def,ou=enterprise,o=hij,c=abd" -W  -b "ou=enterprise,o=hij,c=abd"  "(&(objectClass=posixAccount)(uidNumber=1234))"
 ldapsearch -h abc.xyz.com -p 2389 -D "uid=abc,ou=def,ou=enterprise,o=hij,c=abd" -W  -b "ou=enterprise,o=hij,c=abd"  "(&(objectClass=posixAccount)(uidNumber=1234))"

isi auth mapping dump| less
isi auth mapping dump| wc -l
isi auth mapping token --user=abc
isi auth mapping token --uid 1234
less /var/log/messages
tail /var/log/messages

isi auth log-level
isi auth log-level --set=debug                Set the log level to debug
isi auth log-level --set=warning             Set the log level to warning

ping -c 10 abc.xyz.ldap.com
traceroute abc.xyz.ldapserver.com
isi auth status
isi status
isi auth ldap view Primary
less /var/log/lsassd.log
isi auth mapping token abcdef
isi auth users view abcdef
less /var/log/lwiod.log
less /var/log/messages
less /var/log/lsassd.log

cd /etc/openldap
ls
less ldap.conf
less ldap.conf.default
less /ifs/.ifsvar/main_config_changes.log
less /var/log/lsassd.log
isi_for_array -s isi auth ldap.conf
isi auth status
isi auth ldap view --provider-name=Primary | grep "Group Filter:" | grep "User Filter:"
isi auth ldap view --provider-name=Primary 
isi_for_array -s isi auth ldap view --provider-name=Primary | grep "User Filter:"
isi_for_array -s isi auth ldap view --provider-name=Primary | grep "Group Filter:"
isi_for_array -s isi auth ldap view --provider-name=Primary | grep "User Domain:"
isi_for_array -s /usr/likewise/bin/lwsm list 
isi_for_array -s ps awux | grep lw

ifconfig
isi zone zones list
isi zone zones view system
isi_for_array -s isi zone zones view system
isi networks list pools
isi networks list pools -v
exit
isi status
isi networks list pools
isi networks list pools --name=pool1
mkdir /ifs/data/Isilon_Support/$(date +%m%d%Y)
isi_for_array 'tcpdump -s 0 -i lagg0 -w /ifs/data/Isilon_Support/$(date +%m%d%Y)/`hostname`.$(date +%m%d%Y_%H%M%S).lagg0.cap &'














Friday, July 10, 2015

Isilon : Sync IQ scheduler memory leak issue

Current Isilon versions  7.* have a memory leak issue which causes the sync scheduler to run out of it's allocated 512 max memory and go into hung state. This state will stops all jobs from initializing weather incremental or full. Current code doesn't trigger any alerts during this outage until some one manually verify.

To avoid enter into the outage situation follow the below steps:


Isilon has developed a script for monitoring and trigger email alerts once sync scheduler memory utilization reaches certain threshold, so that sync process can be restarted before go into the hung state.

Below are the commands to verify the memory usage manually.



# isi_for_array -s ps awxu | grep isi_migr_sched | grep -v grep |awk '{print $1 $6}'    This command give the current memory usage across all nodes in the cluster

For example, if we want to be notified when memory reaches 470 MB, script is available with EMC support. edit the threshold values to 470 MB from the script.

Once we receive the email,  run the following commands to reset the memory.

isi sync settings modify --service=off
isi sync settings modify --service=on

This command will reset the memory value to around 76 MB


Note: Script from Isilon has to be executed every time the node gets rebooted.


** Permanent fix will be expected to be on Riptide version (8.0) which is expected in Q4