The code as it stands is not 100% robust to various types of failures.
For example, a compute node works like this (roughly):
- Take job description message from work queue
- If more than 1 iteration left, put new message in with iteration:=iteration-1
- Delete existing message
- Do experiment
- end result when convenient (in separate thread)S
This has a problem. If the node crashes or a queue disappears or the Internet dies between deleting the existing message and the result getting sent, that job is lost. This isn't usually terrible, because we run each experiment 40+ times, so if there are 40 of some, 39 of others, no big deal. But, currently I'm trying to do exactly 1 run of 6 million experiments. Can't afford to lose any.
There is a similar problem in the aggregator node. After a new result is aggregated, the result message is deleted. The result is only really locked in when the aggregator flushes. It is supposed to flush at regular intervals, and when it shuts down. But, today, I ran into a problem where the nodes command queue disappeared so I could send a shutdown message to the aggregator. So, aside from trying to CTRL-C it right after a flush finishes, this is a problem.
Whatever we do, we might still lose some jobs, lose some results, or get an occassional duplicate job/result in the database. Realistically, we should be addressing ALL of these issues with the logbook. So, here is a multi-point plan.
- Don't delete work or results messages until we are DONE with them. In the compute node, this means not resubmitting the decremented job until the current result is successfully delivered. In the aggregator, it means not deleting the results messages until we flush the database back to the remote storage service. This will cut down on lost jobs.
- Use a larger visibility timeout in the work and results queues. This controls how long a message is invisible after it has been received but not deleted. This needs to be set high enough to make sure that a message won't be picked up a second time before handling it is completed. This will cut down on duplicated results and jobs.
- Update the queueing infrastructure to include a "recreate if missing" option on queues, and detect the "queue missing" exception inside the queue code. If we/amazon/someone accidentally deletes a queue for a running node, it would be great if we can just recreate the queue on the fly. We'll have to watch for the "can't recreate queue within 60 seconds exception.
- Create a receipt of all submitted jobs. Keep it in a SQLite database with similar structure to the results database. Later, we can check the receipts against the results to make sure all is accounted for, and possibly resubmit results that are missing.
With these 4 strategies together, I think I can solve this problem. Thankfully, they are all fairly easy to implement as well.