Hadoop

uplinked

Code-whisperer
Mar 10, 1988
650
45
0
San Diego, CA
Anyone else here ride elephants?

I've been playing with Hadoop for about a year and a half now, but have only recently started to lean on it for heavy crunchin'. In a nutshell, it's the greatest thing that's ever happened to me, and if I could just get HBase to run correctly I'd never use another SQL database again.

Anyone use it? Wanna compare notes? The installation/setup barrier is substantial, so do you run your own servers, or use Amazon Map/Reduce?
 


I was going to put some stupid remark, but instead:

I think I get map/reduce, at least at a basic level, I've just never come across anything that could benefit from it, could you give some examples?
 
NerdAlert.gif


if I could just get HBase to run correctly I'd never use another SQL database again.

Give me a use case for that one...the majority of "standard" web apps I build do not need anything more than a SQL DB...sell me on Hadoop. I'll give you email for a free trial and cover shipping if you'd like
 
Give me a use case for that one...the majority of "standard" web apps I build do not need anything more than a SQL DB...sell me on Hadoop. I'll give you email for a free trial and cover shipping if you'd like

76.jpg



Wanna invest $20 mil in my tiwtter clone?
 
Reasons to use Hadoop + Map/Reduce:
It solves the "scaling problem" for you. Assuming your main bottleneck is data management, and not actually serving HTTP requests, Hadoop will let you throw more hardware at a problem and won't ask any questions. It lets you handle arbitrarily large problems just by booting more servers. You get to store your data in HDFS, which is a distributed and redundant file system. Once you write a problem to be handled by map/reduce, you can run it standalone on your dev machine, on a single server, or on a whole damn cluster. There's a project called Whirr that you install on your dev machine, run a single command (passing it your map/reduce script and a large dataset), and it'll boot N servers, upload the data, and start crunching, then shut down on completion -- Matt, you'd like this for all the same reasons you like picloud.

Example: Assume you've got a CSV of plot points with one million rows, each row consisting of (id, category_name, X, Y)
Say you wanna count all the rows, grouped by category, where X>0 -- Write a per-row function (mapper):
Code:
def mapper(id, row):
  if row['x'] > 0:
     yield (category, 1)
This will run one million times (once per row), but e.g. if you have 10 machines booted, each machine will take 10% of the work load -- Moreover, because the paradigm is "Move the computation, not the data", when Hadoop splits up the 1m rows into 100k groups, it'll do so based on where the data is stored in your cluster.

The rows from the mapper are emitted as "('movies', 1)", then they get sorted by key, grouped together, and passed to the reducer which will get ran one time per key emitted from the mapper with an array of the values.
Code:
def reduce(category, array_of_ones): #e.g. ("movies", [1, 1, 1])
  yield category, sum(array_of_ones)


I recognize that this example is both useless and apparently unattractive -- Surely this is more complicated than "SELECT category, count(*) FROM points WHERE x > 0 GROUP BY category". But you wouldn't use map/reduce for a job you could do so quickly at runtime, it's meant for background processing. I use it for large scale denormalization, e.g. computing totals and finding trends from complex queries that would require lots of joins and slow down the database.

Now think about harder problems -- http://mahout.apache.org/ A set of libraries built for doing machine learning on Hadoop. Imagine the above million-row table, but with FIFTY BILLION rows, and you want to "cluster" your points (find concentrations) --
2dFuzzyKMeans.png

Data management is no longer about worrying "how to store the most amount of data in fewest amount of rows", which is the primary problem with SQL, IMO. As 2010 rolls by quickly, and we're almost to 2011, I realize that the paradigm of trying to store "less" is inherently flawed. I fucking wanna store and analyze EVERYTHING. I need a big-boy data management system.

Why I'll use HBase over MySQL:
Consistency. If I'm solving my hardest problems with HBase, why not let it solve my easy problems too? "Most" programs I write never need more than a single MySQL server, sure, but for that matter, "most" programs I write don't really need an ACID compliant RDBMS either. For runtime queries, HBase will let me do things like the "count... groupby..." above, without flinching, and unlike MySQL I know that if I ever hit any sort of upper-bound, I have two options for scaling up -- 1) add another server, 2) denormalize the query with map/reduce and store the results in HBase.

Also, the cell-level versioning is AWESOME. e.g. Table `users` has column `points` -- Wanna graph the user's points over time? Just query for user.points with revisions=all. I think that the multidimensional document store is better suited to representing most real-world problems than a row-based DB system, because it allows you to express complex relationships without joins. Since each column in HBase is also multidimensional, you can express complex relationships INSIDE the column, and are therefore able to get the same data by touching fewer tables.

EDIT -- what a fucking waste of post #500...
 
Alternatively you could use Cassandra rather than HDFS if you're not too concerned about locking.
 
I wonder how this compares to MongoDB? I don't know squat about map reduce, but I know it's a feature with Mongo and just like Hadoop is incredibly scalable.
 
I think that the multidimensional document store is better suited to representing most real-world problems than a row-based DB system, because it allows you to express complex relationships without joins. Since each column in HBase is also multidimensional, you can express complex relationships INSIDE the column, and are therefore able to get the same data by touching fewer tables.

you're my hero because of comments like this
 
Reasons to use Hadoop + Map/Reduce:
It solves the "scaling problem" for you. Assuming your main bottleneck is data management, and not actually serving HTTP requests, Hadoop will let you throw more hardware at a problem and won't ask any questions. It lets you handle arbitrarily large problems just by booting more servers. You get to store your data in HDFS, which is a distributed and redundant file system. Once you write a problem to be handled by map/reduce, you can run it standalone on your dev machine, on a single server, or on a whole damn cluster. There's a project called Whirr that you install on your dev machine, run a single command (passing it your map/reduce script and a large dataset), and it'll boot N servers, upload the data, and start crunching, then shut down on completion -- Matt, you'd like this for all the same reasons you like picloud.

Example: Assume you've got a CSV of plot points with one million rows, each row consisting of (id, category_name, X, Y)
Say you wanna count all the rows, grouped by category, where X>0 -- Write a per-row function (mapper):
Code:
def mapper(id, row):
  if row['x'] > 0:
     yield (category, 1)
This will run one million times (once per row), but e.g. if you have 10 machines booted, each machine will take 10% of the work load -- Moreover, because the paradigm is "Move the computation, not the data", when Hadoop splits up the 1m rows into 100k groups, it'll do so based on where the data is stored in your cluster.

The rows from the mapper are emitted as "('movies', 1)", then they get sorted by key, grouped together, and passed to the reducer which will get ran one time per key emitted from the mapper with an array of the values.
Code:
def reduce(category, array_of_ones): #e.g. ("movies", [1, 1, 1])
  yield category, sum(array_of_ones)


I recognize that this example is both useless and apparently unattractive -- Surely this is more complicated than "SELECT category, count(*) FROM points WHERE x > 0 GROUP BY category". But you wouldn't use map/reduce for a job you could do so quickly at runtime, it's meant for background processing. I use it for large scale denormalization, e.g. computing totals and finding trends from complex queries that would require lots of joins and slow down the database.

Now think about harder problems -- Apache Mahout:: Scalable machine-learning and data-mining library A set of libraries built for doing machine learning on Hadoop. Imagine the above million-row table, but with FIFTY BILLION rows, and you want to "cluster" your points (find concentrations) --
2dFuzzyKMeans.png

Data management is no longer about worrying "how to store the most amount of data in fewest amount of rows", which is the primary problem with SQL, IMO. As 2010 rolls by quickly, and we're almost to 2011, I realize that the paradigm of trying to store "less" is inherently flawed. I fucking wanna store and analyze EVERYTHING. I need a big-boy data management system.

Why I'll use HBase over MySQL:
Consistency. If I'm solving my hardest problems with HBase, why not let it solve my easy problems too? "Most" programs I write never need more than a single MySQL server, sure, but for that matter, "most" programs I write don't really need an ACID compliant RDBMS either. For runtime queries, HBase will let me do things like the "count... groupby..." above, without flinching, and unlike MySQL I know that if I ever hit any sort of upper-bound, I have two options for scaling up -- 1) add another server, 2) denormalize the query with map/reduce and store the results in HBase.

Also, the cell-level versioning is AWESOME. e.g. Table `users` has column `points` -- Wanna graph the user's points over time? Just query for user.points with revisions=all. I think that the multidimensional document store is better suited to representing most real-world problems than a row-based DB system, because it allows you to express complex relationships without joins. Since each column in HBase is also multidimensional, you can express complex relationships INSIDE the column, and are therefore able to get the same data by touching fewer tables.

EDIT -- what a fucking waste of post #500...

That's all fine and dandy, but what applications do you plan to use it for?
 
That's all fine and dandy, but what applications do you plan to use it for?

Well, I've found the nirvana of scalable, practical data management, so I think initially, I'll store something simple and small in it, like an address book with only my own first name. Then, I plan to scale up.

Re: Mongo -- In my opinion, Mongo DB is to Hadoop+HDFS+HBase as an 8y/o with a Lego set is to a council of three grandmaster stonemasons. A single un-backed, underdeveloped tool that one day dreams about scaling big (but has been known to publicly piss it's pants in tantrums, ala Digg, Foursquare), next to a suite of professional-grade solutions built by the Apache foundation, modeled after Google's internal tools and implemented by Yahoo and IBM, to name a few.
 
Anyone else here ride elephants?

I've been playing with Hadoop for about a year and a half now, but have only recently started to lean on it for heavy crunchin'. In a nutshell, it's the greatest thing that's ever happened to me, and if I could just get HBase to run correctly I'd never use another SQL database again.

Anyone use it? Wanna compare notes? The installation/setup barrier is substantial, so do you run your own servers, or use Amazon Map/Reduce?

IMO Amazon's MapReduce is crap...you have no control of the actual MapReduce jobs other than supplying your mapper & reduce code files, a couple of basic config settings and then having amazon fire them off and bring you back the results (if it runs correctly)...so for simple jobs, most people are good with this option...but to get to this point, your mapper, reducer and data set need to be prestine, or you'll just get returned a bunch of pseudo-helpful errors from Amazon, that leaves you to set up a cluster for testing purpose and wondering why you even bothered in the first place with it...you're better off firing a couple of VM's in Amazon's EC2 and make your ad-hoc cluster with Hadoop pre-installed on an AMI.

but if you're already considering using Hadoop, you're obviously worried about processing large amounts of data (i.e. in the GB, TB, PB arena), so using Amazon's MapReduce is pointless...anything less than those sizes makes Hadoop worthless

but it could be the case that you're just wanting to steer away from the SQL based systems and approaching what most call the No-SQL mentality and that's where column based & object oriented level storage (i.e. Cassandra, BigTable etc.) come into play...all these are founded on the key/value model with several I/O modifications to massively improve reads or writes to/from disk, but hardly ever the combo of the two

another simpler map/reduce + distributed file system (esque, i say esque cause it's not 'traditionally' the most optimal distributed system) if you're just handling MB's or low GB's of data is Nokia's Disco...it's quick, easy and dirty in all sense's of the word for map-reducing jobs...it supports python code (which i love) and it's written in Erlang....so its a win-win for smaller jobs

just my two cents...

and for those curious as to their applications here's a simple example:

  1. Email Spammers
    • Say I have several text files (i.e. 100 files) with email's (~10,000 emails per file) and i want to blast with ads
    • By writing a proper mapper and a reducer i can hit all these emails in the traditional way: going file by file running the same code on 1 machine until i'm done..which could last somewhere on the order of several hours or maybe even days
    • or use hadoop or any other map-reduce centric systems to partition my total data amount to 10 VM's that cost me pennies to run, and crunch through the spamming much faster and cut my time to minutes if not even seconds (certain restrictions might apply)
but the possibilities are endless...IF you have large amounts of data that need the same type of processing...if not, none of these systems will give you much in return

hope this helps ;)