Fun with Running a Cluster on Two Servers
In preparation for rolling out a production cluster I had to set up a development cluster. I set things up on two servers so that a SQL node and a ndb node ran on each server. What fun. Right.
To start of with I set the two servers up as two virtual machine images. My initial problem came because I was running both of the VMs on the same physical server. Heartbeats were being missed and nodes were dropping off because both servers would try and write to the same physical drive at the same time. This was causing latency issues/dropped heartbeats and then dropped nodes.
It took probably a day and a half to figure this out. We then moved one of the VMs to another physical server and this resolved the latency issue.
I thought I was home free. Should have known better. A couple of days later one of the developer asked if I was having a problem with the development cluster. I sat down and start looking around. The SQL node on one of the servers was not running. I looked in the .err file (and if you don’t have it set in your configuration files…shame on you). Hmmm… a nice little dump of the mysqld daemon. It seg faulted. Here is the information from a later dump (every dump looked very much the same):
071018 16:38:14 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=33
max_threads=60
threads_connected=29
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 147493 K
bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.
thd: 0×174f760
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
Cannot determine thread, fp=0xb, backtrace may not be correct.
Bogus stack limit or frame pointer, fp=0xb, stack_bottom=0×40bf0000, thread_stack=131072, aborting backtrace.
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort…
thd->query at 0×1759c50 = REPLACE INTO file (fid, dmid, dkey, length, classid, devcount) VALUES (’424939′,’4′,’Post_416_Content’,'15′,’0′,0)
thd->thread_id=27
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.
071018 16:38:14 mysqld_safe Number of processes running now: 0
071018 16:38:14 mysqld_safe mysqld restarted
071018 16:38:14 [Note] Plugin ‘InnoDB’ disabled by command line option
071018 16:38:17 [Note] Starting MySQL Cluster Binlog Thread
071018 16:38:18 [Note] Event Scheduler: Loaded 0 events
071018 16:38:18 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: ‘5.1.22-rc’ socket: ‘/var/run/mysqld/mysqld.sock’ port: 3306 MySQL Community Server (GPL)
The statement in the variable dump (REPLACE INTO) was in every dump. Still really don’t know why that is the case, but there is no syntax error with the statement. The statement itself is actually code being executed as part of the Mogile Filesystem.
The system was dumping a mysqld daemon about three times in a twenty-four hour period. Every time the dump looked almost identical to what I listed above.
To make a long story a little shorter — because of the issues I had with latencies and nodes shutting down on the VM I decided to move the cluster from the two VMs to two physical computers. Memory was identical (one gig for each) between the VMs and the physical servers. So, off we go. I honestly was very hopeful at this point that my problems were over.
It didn’t take long and the mysqld daemon seg faults again. Wow, this is getting frustrating. I was monitoring around the clock the two servers. Time for some (more) configuration changes. I had configured the ndbd daemon so that it was locked into memory. This is a good thing and something I HIGHLY recommend. The ‘top’ command showed that it was steadily holding about 80% of the total memory. I lowered that down to 71.3% and added some configuration parameters to the my.cnf to try and keep it smaller. I restarted the mysql daemon yesterday at one p.m. so it has been running for thirty-four hours. Everything looks good. I have been running the developer unit tests repeatedly including the one test that exercises the mogile filesystem. I am starting to breath a little easier that things might be going well this time. Total time spent on this one problem — a week. What a waste. Now I was doing other things too but this was consuming a major portion of my days (and nights).
Others might argue with this, but I would never put the SQL nodes on the same servers as the ndbd nodes for production. Some say you can run multiple ndbd nodes on the same server and I am more comfortable with that since I can lock the ndbd daemon into memory and know its not going to change (my ndbd nodes on those two servers have been at exactly 71.3% since I started them up. If I had servers for the ndbd nodes that had 16+ gigs of RAM I might start allocating 4 gigs of RAM to a ndbd daemon with 3+ daemons per node. My understanding is that this helps keep the transactional logs for the nodes under control. When you do a ndbd node restart it takes less time for a node to get up and running because of the smaller files to read. I might be mistaken and its too late for me to look it up :) Anyone got other reasons or maybe (if I am right) someone can elaborate.
4 Comments so far
Leave a reply
Running more than one ndbd on a machine…
…
For log sizes and recovery times see this article by Johan Andersson
http://johanandersson.blogspot.com/2007/03/benchmarking-node-restarts-times-in-51.html
Just a note of running SQL nodes on NDB nodes, apart from the issues you were having, the NDB nodes have no security so the cluster nodes should be quarantined behind a secure network.
Have Fun
Paul
[…] (Link) Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages. […]
i never tried that!!!
after i reset my connection later and format everything on my PC….
try this..