So far I had a positive experience using MySQL for holding results of aggregations. Firstly, it provides flexible and efficient indexing on the dimension columns. For me, it was a great improvement over Cassandra’s in-built counter tables (we were using v2.0.7), which do not offer much flexibility on querying dimensions independently due to its key based partitioning . Secondly, MySQL provides the regular SQL aggregating functions which are quite handy. In Cassandra, the aggregating must happen on the application side.
Let’s take a bookstore database as an example, where we want to keep record of the store’s sales.
CREATE TABLE `bookstore_sales` ( `date` DATE NOT NULL, `genre` varchar(45) NOT NULL, `publish_year` year(4) NOT NULL, `sold_count` INT DEFAULT NULL, `returned_count` INT DEFAULT NULL ) ENGINE=InnoDB;
Now let’s add a UNIQUE index on all of our dimensions:
ALTER TABLE `test`.`bookstore_sales` ADD UNIQUE INDEX `unique_idx_bookstore_sales` (`date` ASC, `genre` ASC, `publish_year` ASC);
Now we can update our counters with the very useful INSERT… ON DUPLICATE KEY UPDATE statement which basically allows us to not care about a possible prior existence of a row. If the row does not exist, the INSERT part is executed, otherwise the UPDATE part is executed.
So upon a sale in the bookstore, we issue the following statement:
INSERT INTO bookstore_sales (`date`,genre,publish_year,sold_count) VALUES('2016-02-20','thriller',2016,1) ON DUPLICATE KEY UPDATE sold_count=sold_count+1; Query OK, 1 row affected (0.07 sec) +------------+----------+--------------+------------+----------------+ | date | genre | publish_year | sold_count | returned_count | +------------+----------+--------------+------------+----------------+ | 2016-02-20 | thriller | 2016 | 1 | NULL | +------------+----------+--------------+------------+----------------+
If we execute this query again we increment the appropriate counter.
Query OK, 2 rows affected (0.04 sec) +------------+----------+--------------+------------+----------------+ | date | genre | publish_year | sold_count | returned_count | +------------+----------+--------------+------------+----------------+ | 2016-02-20 | thriller | 2016 | 2 | NULL | +------------+----------+--------------+------------+----------------+
Now, let’s say on of the customers comes later to return a bought book, then we issue the following query:
INSERT INTO bookstore_sales (`date`,genre,publish_year,returned_count) VALUES('2016-02-20','thriller',2016,1) ON DUPLICATE KEY UPDATE returned_count=returned_count+1; Query OK, 0 rows affected (0.00 sec) +------------+----------+--------------+------------+----------------+ | date | genre | publish_year | sold_count | returned_count | +------------+----------+--------------+------------+----------------+ | 2016-02-20 | thriller | 2016 | 2 | NULL | +------------+----------+--------------+------------+----------------+
Wait a minute, what happened here? My update was reported successful, though without any effect! The reason is pretty simple actually: the query applied the increment NULL + 1 which would always result in NULL, therefore no change was made to the prior entry.
Ok, so theoretically we could have avoided this issue by supplying 0 values ourselves in the insert statement, but this is prone to errors, as more counters might be added to the table. Solution: initialize your counter columns with 0 values, so that your create table looks like this:
CREATE TABLE `bookstore_sales` ( `date` DATE NOT NULL, `genre` varchar(45) NOT NULL, `publish_year` year(4) NOT NULL, `sold_count` INT DEFAULT 0, `returned_count` INT DEFAULT 0 ) ENGINE=InnoDB;
If we issue the same series of commands as before, we will end up with:
Query OK, 1 row affected (0.03 sec) +------------+----------+--------------+------------+----------------+ | date | genre | publish_year | sold_count | returned_count | +------------+----------+--------------+------------+----------------+ | 2016-02-20 | thriller | 2016 | 2 | 1 | +------------+----------+--------------+------------+----------------+