There are really good tools and document available on troubleshooting vmware. I’m learning and as I go, I want to share with you all. Though production scenario are quite different but below approach might help you in finding bottlenecks in advance. I intend to show how to keep tab on the below values over a period of time. Say Weekly or monthly. Any changes or variation in these values means there is problem in the system which needs to be address ASAP.
Three important counters to find disk bottleneck
- Kernel Disk Command Latency: it is VMKernel processing time for each SCSI Command. So fast VMKernel can process SCSI command is clear from this counter. Ideally it should be between 0 – 1 millisecond. If it crosses 4 millisecond then you need look at either increasing CPU or queue depth. Below is real time observation of one of my ESXi box. You should actually observer this counter for week’s time to check future expansion
- Physical Device command latency: It is the amount of time taken by physical device to complete the SCSI command. Here physical device is HBA,Switch fabric or ports. If this value is more than 15 millisecond it can indicate a problem with storage array.
Remember to go by average, in this naa.6000eb316dd2aeb80000000000000020 is at one time 315 millisecond but on average it goes to only 24 sec. Remember these are real time values, we should actually be watching these values our week or monthly period to keep tab on them
- Queue command latency: How long SCSI command spends time in VMKernel queue. It should be ideally zero and so it is seen below.If it not zero then there is again problem in the storage array.