|The results aggregator will be listening at the end of a queue, receiving messages about results files that are waiting to be processed. |
<Figure out what the message format is>
<Figure out where the intermediate files will be stored>
The aggregator will be a long-lived process which is always listening. It can be running on any computer, but inside the cloud is better because data transfer costs to and from S3 are free then. Also, the highest bandwidth will be between S3 and the cloud.
There should only be one process running at a time per queue.
The aggregator will:
- Receive a message from the the queue
- Look up the <experiment,agent> pair (either via traversal of S3 or more likely in its cached memory structure)
- Open the appropriate index file (either via S3 download or more likely cached in memory)
- Make an entry for this parameter configuration if necessary (optionally save the file back to S3) <-- might want to do this less often than every record
- Retreive the data file (either via S3 download or cached in memory)
- Append new data record (maybe we'll have runkeys cached so we don't append it if its a dupe... but it probably won't be a dupe
- Write the updated data file back to S3 <-- might want to do this less often than every record