Recently we had an "interesting" situation arise that required a bit of work and a bit of thought. The issue was that we had some slave servers that began "hanging". We noticed that a table change didn't get propogated down one of the slaves and investigation showed that the master binary log execution position (Exec_Master_Log_Pos in the output of SHOW SLAVE STATUS\G) was not changing. There were no errors in the error log. Stopping slave on the server and restarting it brought replication back online but it only lasted for a few minutes. Then the exact same behaviour was exhibited -- no errors but replication hung.
It was an odd situation and one in which I thought the core problem was a network issue. However, I had to prove the problem to other people. I thought it was a good time to pull out one of the many tools in the Percona Toolkit. In this case I combined the standard Unix tcpdump tool with pt-query-digest to easily verify that, in fact, the slave server was periodically just "losing contact" with the master -- it would send out a request, but the master would never receive it.
The following was run on the master server. I using the host option narrowed down the data being logged the specific slave (host.ip.address) as this server had multiple slaves. This data was then piped into query digest where it gave me the results in real time.
tcpdump -s 65535 -x -nn -q -tttt -i any host host.ip.address | pt-query-digest --type tcpdump --print --noreport
The following was run on the slave:
tcpdump -s 65535 -x -nn -q -tttt -i any port 3306 | pt-query-digest --type tcpdump --print --noreport
In this case I capatured all data on port 3306 on any interface. A few minutes of analysis of these commands running simultaneously and it became very clear that the core issue was a network problem (since resolved thankfully).