stephenbrooks.org : Database format (hypothetical)

Stephen Brooks 2013-12-13 21:25:34	In theory I could store what are currently text files in a single database table with the columns: String lattice, String user, String genome, Double score, Double mpts, String version, Hex checksum, Optional String comment However, the lattice, user and version strings would be repeated many times: it might be more efficient to store them as IDs rather than names (and have ID->name mappings as three other tables). That would leave only the genome and optional comment as strings in the main table. Considering the genome is going to be the longest string, this won't save a huge amount of storage - but would it save time? I'd expect indexing/sorting by ID is faster than by string but I'm not a database expert. That's why I'm writing this on here in case anyone has an opinion. To quickly detect repeats I'd probably have to index by the genome strings anyway -- or actually the checksums would do. In the interests of speed I might want to maintain totals over every user, optimisation or combination of those (that could end up messy). The "ultimate" solution would be to have the web front end query this directly in which case I'd add a table for team mappings. It would also have the clients post to the database directly. This potentially puts a lot of load on a single server, though. The database would be about 75 million rows with ~1KB of data each, since its current size is >75GB. This would be a pretty sizeable task and it's not something I anticipate doing until it becomes necessary for the project to continue. But I thought I needed a plan figured out in advance.
Hannibal 2014-03-21 14:09:17	Little late answer, but I'm not very often here now... First of all you should develop binary version of genome data. AFAIR genome parameters are in <0,999> so it fit in 10 bit keeping 23 more values for special case (i.e. missing parameter). So in 4byte int you can fit 3 values. This make 100 param genome fit in ~130 bytes. For each lattice you should keep fixed order when it is not meaningful. If not add 7 more bytes for order. In that case you can fit 3 values in 8byte int. If you want more space save you can use bit packing without byte boundaries. IMHO this packing save LOT out space and don't make important processing time as genome probably won't be needed on database level in raw form. TABLES (I've omitted some fields you maybe want add, which are not important to whole case) TEAM: ID smallit, Desc: string USER: ID int, team_ID, Desc: string,... CLIENT: ID tinyint, Version: string,... LATTICE: ID tiny/smallint, Name: string., Dest: string, GenotypePackingDesc: string, blob/string RESULT: ID int, lattice_ID, user_ID, client_ID, PackedGenome: blob, Score: double, K(M)pts: int/decimal, Checksum hugeint(sufficient?) or BOOL correct - you can always count it if necessary. RESULT_COMMENT: ID int (same as in result), Comment sting. You may also split each lattice result in separate table. Or genome to additional table like I do with comment.
Stephen Brooks 2014-03-25 19:46:22	Thanks - I actually use a method (binary genomes) such as you describe for reducing bandwidth when sending a results file. That's what the results.bin format is, which is produced temporarily during sending. It has a header of all the parameter names for each lattice and then bit-packed (10 bits each) parameter values. However, for the full project I think the problem is the list of parameter names is quite variable (even within each lattice), since the accelerators can vary in length. Even worse, if a new parameter appeared, that would mean reformatting all the data already there. That's not a problem with a one-off compression of results before sending. I also do need to read the genomes out of the database when doing analysis. Anyway, this side seems OK for now. The place I've encountered scaling problems is the rawstats.txt file and rendering it to tables in PHP on my website! The summary scores including every user ever are megabytes now, so I think if anything goes in a database it'll be those first.
Hannibal 2014-03-25 22:13:48	> However, for the full project I think the problem is the list of parameter names is quite variable (even within each lattice), since the accelerators can vary in length. But probably you have some upper limit - this should be base for packing and storing genome. You also can store genome outside database, keeping in it only some reference (i.e. file name + offset within). > The place I've encountered scaling problems is the rawstats.txt file and rendering it to tables in PHP on my website! The summary scores including every user ever are megabytes now, so I think if anything goes in a database it'll be those first. So you should create some redundant statistics table: STAT_USER_TOTAL ID int, user_ID, lattice_ID, (optionally client_ID), BestResult: double, TotalWU int, TotalK(M)pts: int/decimal, LastSeen datetime, STAT_TEAM_TOTAL ID int, team_ID, rest like above... STAT_USER_PER_DAY ID int, user_ID, lattice_ID, (optionally client_ID), Date (date - without time, indexed!), WUThisDay int, K(M)ptsThisDay: int/decimal STAT_TEAM_PER_DAY ID int, team_ID, rest like above... Instead by day, you can also group per communication session - but this will be probably bigger You can fill this tables automagically by adding database triggers to RESULT table for USERS_STATS and to USER_STATS for TEAM_STATS i.e.: Assuming [user_ID, lattice_ID, (optionally client_ID)] are ONE UNIQUE INDEX (and date in per day table) CREATE TRIGGER `updateUserStats` AFTER INSERT ON `result` FOR EACH ROW BEGIN INSERT INTO stat_user_total SET user_ID=NEW.user_ID, lattice_ID=NEW.lattice_ID, client_ID=NEW.clientID, bestResult=Score, totalWU=1, totalMpts=Mpts, lastSeen=NOW() ON DUPLICATE KEY UPDATE bestResult=IF(bestReslut < Sore, Score, bestResult), totalWU=totalWU+1, totalMpts=totalMpts+Mpts, lastSeen=NOW(); INSERT INTO stat_user_per_day SET user_ID=NEW.user_ID, lattice_ID=NEW.lattice_ID, client_ID=NEW.clientID, date=NOW(), WUThisDay=1, MptsThisDay=Mpts ON DUPLICATE KEY UPDATE WUThisDay=totalWU+1,MptsThisDay=totalMpts+Mpts; END Team stats filling triggers will be little more complicated since you must fetch team_ID by user_ID first.