Reasons to use Hadoop + Map/Reduce:
It solves the "scaling problem" for you. Assuming your main bottleneck is data management, and not actually serving HTTP requests, Hadoop will let you throw more hardware at a problem and won't ask any questions. It lets you handle arbitrarily large problems just by booting more servers. You get to store your data in HDFS, which is a distributed and redundant file system. Once you write a problem to be handled by map/reduce, you can run it standalone on your dev machine, on a single server, or on a whole damn cluster. There's a project called Whirr that you install on your dev machine, run a single command (passing it your map/reduce script and a large dataset), and it'll boot N servers, upload the data, and start crunching, then shut down on completion -- Matt, you'd like this for all the same reasons you like picloud.
Example: Assume you've got a CSV of plot points with one million rows, each row consisting of (id, category_name, X, Y)
Say you wanna count all the rows, grouped by category, where X>0 -- Write a per-row function (mapper):
Code:
def mapper(id, row):
if row['x'] > 0:
yield (category, 1)
This will run one million times (once per row), but e.g. if you have 10 machines booted, each machine will take 10% of the work load -- Moreover, because the paradigm is "Move the computation, not the data", when Hadoop splits up the 1m rows into 100k groups, it'll do so based on where the data is stored in your cluster.
The rows from the mapper are emitted as "('movies', 1)",
then they get sorted by key, grouped together, and passed to the reducer which will get ran one time
per key emitted from the mapper with an array of the values.
Code:
def reduce(category, array_of_ones): #e.g. ("movies", [1, 1, 1])
yield category, sum(array_of_ones)
I recognize that this example is both useless and apparently unattractive -- Surely this is more complicated than "SELECT category, count(*) FROM points WHERE x > 0 GROUP BY category". But you wouldn't use map/reduce for a job you could do so quickly at runtime, it's meant for background processing. I use it for large scale denormalization, e.g. computing totals and finding trends from complex queries that would require lots of joins and slow down the database.
Now think about harder problems --
Apache Mahout:: Scalable machine-learning and data-mining library A set of libraries built for doing machine learning on Hadoop. Imagine the above million-row table, but with FIFTY BILLION rows, and you want to "cluster" your points (find concentrations) --
Data management is no longer about worrying "how to store the most amount of data in fewest amount of rows", which is the primary problem with SQL, IMO. As 2010 rolls by quickly, and we're almost to 2011, I realize that the paradigm of trying to store "less" is inherently flawed. I fucking wanna store and analyze EVERYTHING. I need a big-boy data management system.
Why I'll use HBase over MySQL:
Consistency. If I'm solving my hardest problems with HBase, why not let it solve my easy problems too? "Most" programs I write never need more than a single MySQL server, sure, but for that matter, "most" programs I write don't really need an ACID compliant RDBMS either. For runtime queries, HBase will let me do things like the "count... groupby..." above, without flinching, and unlike MySQL I know that if I ever hit any sort of upper-bound, I have two options for scaling up -- 1) add another server, 2) denormalize the query with map/reduce and store the results in HBase.
Also, the cell-level versioning is AWESOME. e.g. Table `users` has column `points` -- Wanna graph the user's points over time? Just query for user.points with revisions=all. I think that the multidimensional document store is better suited to representing most real-world problems than a row-based DB system, because it allows you to express complex relationships without joins. Since each column in HBase is also multidimensional, you can express complex relationships INSIDE the column, and are therefore able to get the same data by touching fewer tables.
EDIT -- what a fucking waste of post #500...