Data Storage - Matrixes, etc...

Kiopa_Matt

Banned
May 13, 2011
5,216
52
0
I'm not quite sure how to explain the issue, or even if it is an issue, but will try. Ok, data set, say:

-- Dimensions --
Color
Location
Country
Region
City
Product
Manufacturer
Date (Y-M-D)

-- Measures --
Views
Sales
Amount

Ok, good enough. Large dataset with a bunch of shit. What I need is to create various summaries of the data, such as for example, the changes in views and sales week-to-week. However, it has to be any combination of the dimensions. For example, just by a single color, or very in-depth such as color red, location Dallas, TX, USA, product ABC from manufacturer XYZ on a certain date.

Obviously I could query the raw data for this, but I would prefer to generate summary table(s) to ease resource usage. Right now, it's just creating a row for every possible combination of the dimensions, and I don't know, but that just seems bulky to me.

I'm by no means an expert, but have been doing through various things on linear algebra, statistics, machine learning, etc... so I have a decent grasp of matrixes, vectors, etc. And maybe I'm wrong, but it just seems there might be a way to store all this is a condensed matrix of some sort, then via various mathematical algorithms, easily and quickly pull out the various data I need for any combination.

Any tips?
 


I don't think you really need much more than a well designed SQL-based database and queries to create what you want. A database with proper joins between tables is extremely fast so long as you aren't dealing with Amazon levels of data.

the changes in views and sales week-to-week
That can be handled with GROUP queries

just by a single color, or very in-depth such as color red, location Dallas, TX, USA, product ABC from manufacturer XYZ on a certain date

That can be handled with chained WHERE commands

Right now, it's just creating a row for every possible combination of the dimensions

Don't do that.

Given your example data, the "dimensions" stuff should be a Products table for instance. And then you can have, say, a Sales table that has a column called product_id that stores in the Integer ID of the product it's related to. Same for the other "measures" as you call them

(Rails Activerecord ORM code incoming)
Then you'd just do something like Product.find(123).sales.count to get the total sales, or Product.find(123).sales.where(created_at > Time.now - 30.days).where(created_at <Time.now).count to get the number of sales in the last 30 days. All of these specific calls can be optimized to use the raw SQL commands to make this extremely fast.

So my tl;dr answer is that you can do this easily with a proper relational database design and I think you're over thinking it with matrices and machine learning, etc.
 
Don't do that.

I need to though. Yes, a standard SQL statement will do the job fine, but with a multidimensional database like this, it is a sloppy and taxing way to do it. For digging down into the raw data, I resolve that via OLAP and MDX, and it works great. However, for calculating two or three different layers of algorithms, it's not going to be efficient to work from the raw data. Best to have summarized data that's already been through an algorithm or two.

Again, I'm not sure if there's even an issue here, and maybe my existing approach is the best. I doubt it though, because I'm sure there's methodologies out there I don't know about yet. You know, maybe a neural network that starts with the raw data and processes it through Octave / MATLAB would be the best way, as neural networks are excellent for multi-layered algorithms. Or maybe OLTP is something I should look into.

Anyway, I'll shut up now. Have two machine learning experts on my Skype list, so probably best if I hit them up, and hope they'll give me some time in exchange for cash.
 
Yeah, you guys were right, I was probably over-thinking things. Then again, I don't know, as I'm absolutely certain there's quite a bit I don't know.

Anyway, got it to work fairly efficiently with an SQL like:

Code:
SELECT 
    sindex.campaign_id AS campaign_id,
    location.country AS country,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.views_raw, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.views_raw, 0))) AS views_raw,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.views_unique, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.views_unique, 0))) AS views_unique,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.visitors, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.visitors, 0))) AS visitors,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.conversions, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.conversions, 0))) AS conversions,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.revenue, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.revenue, 0))) AS revenue,
    (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.expense, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.expense, 0))) AS expense,
    ((SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.revenue, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.revenue, 0))) - (SUM(IF(MONTH(sindex.stats_date)=9 AND YEAR(sindex.stats_date) = 2012, traffic.expense, 0)) - SUM(IF(MONTH(sindex.stats_date)=8 AND YEAR(sindex.stats_date) = 2012, traffic.expense, 0)))) AS profit
FROM xtrack_matrix_traffic AS traffic, xtrack_matrix_index AS sindex, xtrack_location_index AS location 
WHERE traffic.index_id = sindex.id AND sindex.location_id = location.id GROUP BY campaign_id, country
Seems to push out the raw data like I need, but still uncertain on it. There's only 500k records right now, and I need this to support a bare minimum of 50 million, so we'll see what happens when I increase the database size. That, and this is still just a fairly basic example of what I truly need, so I'm not sure if it'll work.

There's still several more dimensions of this data that need to get filtered through, and that's where my issue is. I'm certain there's a better way to sift through all this data than what I'm doing, but I just don't know about that methodology yet.

Anyway, whatever. Will keep punching away, and get 'er figured. Thanks for all the help guys!
 
Nah, already got it figured, but thanks. I decided to go against standard design principles, and moved a dimension (date) over to the fact tables. Not supposed to do that, but fuck it, worked great!

Not supposed to put dimensions in fact tables, but at the same time, not supposed to have almost as many dimensions as fact rows, so this worked. Before it was creating a whole new set of dimensions for every day, whereas now it doesn't. Now the # of dimension rows are much smaller, and fact tables are much larger, which is how an OLAP setup is supposed to be. Mondrian seems to like it quite a bit more too, as the queries are much faster now.

Then just continued tweaking the code that compiled the aggregate data to make it as efficient as possible. Thanks anyway though!