Status Blog‎ > ‎

Merging SQLite Databases

posted Jul 11, 2009, 10:55 AM by Brian Tanner   [ updated Sep 5, 2009, 10:24 AM ]
I have improved the speed of aggregation by allowing it to be done in parallel.

Note: The tricks and techniques here are currently in

In the old system, there could only be one aggregator node, and it would download the existing result summary, update it, then flush it back to the server occasionally.  As the file got larger, the flushes took longer.  Also, if this was too slow to keep up with the incoming results, too bad.

The new system allows us to run multiple aggregators.  Each one starts its own results summary (uniquely identified), and then flushes that to the server occasionally.  We have a new problem: when we want to look at results, we have to aggregate multiple summaries.  This is actually not a bad situation, because it means we can download and process results in parallel instead of eneding to download a single huge file.

To oversimplify the situation, we have N databases that look like:
Agent: <ROWID (Long), ParamSummary (String)>
Experiment: <ROWID (Long), ParamSummary (String)>
Result: <AGENTID (Long), EXPERIMENTID (Long), Result (Blob)>

We will think of combining databases by calling one of them the "main" database and one of them the "new" database.  We want to get all of the information from new into main.

The steps are something like:

  • Agent Records
    • Make sure all of the agent records from new exist in main
  • Do the same for experiments
  • Copy Result records from new to main.  In main, they need to have the *new* ROWIDs for the agent Agent and Experiment tables
I think the easiest way to do this might be to alter the Agent and Experiment tables to have a new column, "NEWROWID".  Then maybe we can do something as simple as "insert or replace" from new->main, then select the ROWIDs from main to new.Agent.NEWROWID, then insert from new to main for the records.

Now that I'm looking at this, I think we need to have an index on paramSummary!

Here is the plan as it is unfolding.

attach 'newdb.sqlite as new;
alter table new.Agent add column newIndex INTEGER;

sqlite> select count(*) from main.Agent;
sqlite> select count(*) from new.Agent;

Here is the total schema for the database.
CREATE TABLE 'Agent'('id' INTEGER PRIMARY KEY  AUTOINCREMENT,'agentShortName' VARCHAR(32),'agentName' VARCHAR(32),'agentJar' VARCHAR(256),'paramSummary' VARCHAR(1024));
CREATE TABLE 'Experiment'('id' INTEGER PRIMARY KEY AUTOINCREMENT,'experimentShortName' VARCHAR(32),'experimentClassName' VARCHAR(32),'paramSummary' VARCHAR(1024));
CREATE TABLE 'ResultsRecord'('id' INTEGER PRIMARY KEY AUTOINCREMENT,'agentId' INTEGER,'experimentId' INTEGER,'runDate' DATE,'runTimeInMS' INTEGER,'payload' BLOB, FOREIGN KEY ('agentId') REFERENCES 'Agent'('id'), FOREIGN KEY ('experimentId') REFERENCES 'Experiment'('id'));
CREATE UNIQUE INDEX all_agent_index ON 'Agent' ('agentShortName','agentName','agentJar','paramSummary');
CREATE UNIQUE INDEX all_experiment_index ON 'Experiment' ('experimentShortName','experimentClassName','paramSummary');

insert or ignore into main.Agent (agentShortName,agentName,agentJar,paramSummary) select agentShortName,agentName,agentJar,paramSummary from new.Agent;

Timing: This took several (10-20 seconds).  I wonder if it would be faster if we only had the unique key on the paramSummary.

sqlite> select count(*) from main.Agent;

This looks like there were no duplicates. Lets be sure.  I executed the command again and the count stayed the same.  We're in business.

Now we have to get those ROWIDs into new.Agent.newIndex!  This is really at the edge of my SQL knowledge, but maybe this will work?

sqlite> update new.Agent set newIndex=(select ROWID from main.Agent where new.Agent.paramSummary=main.Agent.paramSummary);

Note: It's taking a long time, more than a few minutes.  Added an index on paramSummary:
sqlite> create index if not exists  main.paramIndexAgent on Agent (paramSummary);

That took about 10-20 seconds.  Running the update again took 3 seconds.  So, we should really change all of the database definitions to have that index.

Lets do all the same for experiment.
sqlite> select count(*) from new.Experiment;
sqlite> select count(*) from main.Experiment;

create index if not exists  main.paramIndexExperiment on Experiment (paramSummary);
insert or ignore into main.Experiment (experimentShortName,experimentClassName,paramSummary) select experimentShortName,experimentClassName,paramSummary from new.Experiment;

sqlite> select count(*) from main.Experiment;

alter table new.Experiment add column newIndex INTEGER;
update new.Experiment set newIndex=(select ROWID from main.Experiment where new.Experiment.paramSummary=main.Experiment.paramSummary);

Ok, now time to do the big insert!

sqlite> select count(*) from new.ResultsRecord;
sqlite> select count(*) from main.ResultsRecord;

insert into main.ResultsRecord(agentId,experimentId,runDate,runTimeInMS,payload) select Agent.newIndex,Experiment.newIndex,runDate,runTimeInMS,payload from new.Agent,new.Experiment, new.ResultsRecord where and;

sqlite> select count(*) from main.ResultsRecord;

Yay!  And it was fast (.26 seconds)