Should I Learn/use Mapreduce, Or Some Other Type Of Parallelization For This Task?

December 25, 2023 Post a Comment

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset. This dataset mirrors a 3rd party service's data, so,

Solution 1:

The work you describe is probably a good fit for either a queue, or a combination of a queue and job server. It certainly could work as a set of MapReduce steps as well.

For a job server, I recommend looking at Gearman. The documentation isn't awesome, but the presentations do a great job documenting it, and the Python module is fairly self-explanatory too.

Basically, you create functions in the job server, and these functions get called by clients via an API. The functions can be called either synchronously or asynchronously. In your example, you probably want to asynchronously add the "Start update" job. That will do whatever preparatory tasks, and then asynchronously call the "Get followed users" job. That job will fetch the users, and then call the "Update followed users" job. That will submit all the "Get Favourites for UserA" and friend jobs together in one go, and synchronously wait for the result of all of them. When they have all returned, it will call the "Calculate new queue" job.

This job-server-only approach will initially be a bit less robust, since ensuring that you handle errors and any down servers and persistence properly is going to be fun.

For a queue, SQS is an obvious choice. It is rock-solid, and very quick to access from EC2, and cheap. And way easier to set up and maintain than other queues when you're just getting started.

Basically, you will put a message onto the queue, much like you would submit a job to the job server above, except you probably won't do anything synchronously. Instead of making the "Get Favourites For UserA" and so forth calls synchronously, you will make them asynchronously, and then have a message that says to check whether all of them are finished. You'll need some sort of persistence (a SQL database you're familiar with, or Amazon's SimpleDB if you want to go fully AWS) to track whether the work is done - you can't check on the progress of a job in SQS (although you can in other queues). The message that checks whether they are all finished will do the check - if they're not all finished, don't do anything, and then the message will be retried in a few minutes (based on the visibility_timeout). Otherwise, you can put the next message on the queue.

This queue-only approach should be robust, assuming you don't consume queue messages by mistake without doing the work. Making a mistake like that is hard to do with SQS - you really have to try. Don't use auto-consuming queues or protocols - if you error out, you might not be able to ensure that you put a replacement message back on the queue.

A combination of queue and job server may be useful in this case. You can get away with not having a persistence store to check job progress - the job server will allow you to track job progress. Your "get favourites for users" message could place all the "get favourites for UserA/B/C" jobs into the job server. Then, put a "check all favourites fetching done" message on the queue with a list of tasks that need to be complete (and enough information to restart any jobs that mysteriously disappear).

For bonus points:

Doing this as a MapReduce should be fairly easy.

Your first job's input will be a list of all your users. The map will take each user, get the followed users, and output lines for each user and their followed user:

"UserX""UserA""UserX""UserB""UserX""UserC"

An identity reduce step will leave this unchanged. This will form the second job's input. The map for the second job will get the favourites for each line (you may want to use memcached to prevent fetching favourites for UserX/UserA combo and UserY/UserA via the API), and output a line for each favourite:

"UserX""UserA""Favourite1""UserX""UserA""Favourite2""UserX""UserA""Favourite3""UserX""UserB""Favourite4"

The reduce step for this job will convert this to:

"UserX" [("UserA", "Favourite1"), ("UserA", "Favourite2"), ("UserA", "Favourite3"), ("UserB", "Favourite4")]

At this point, you might have another MapReduce job to update your database for each user with these values, or you might be able to use some of the Hadoop-related tools like Pig, Hive, and HBase to manage your database for you.

I'd recommend using Cloudera's Distribution for Hadoop's ec2 management commands to create and tear down your Hadoop cluster on EC2 (their AMIs have Python set up on them), and use something like Dumbo (on PyPI) to create your MapReduce jobs, since it allows you to test your MapReduce jobs on your local/dev machine without access to Hadoop.

Good luck!

Solution 2:

Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.

Thanks for the answers, they were a lot of help finding the solution I was looking for.

Solution 3:

I am working with a similar problem that i need to solve. I was also looking at MapReduce and using the Elastic MapReduce service from Amazon.

I'm pretty convinced MapReduce will work for this problem. The implementation is where I'm getting hung up, becauase I'm not sure my reducer even needs to do anything.

I'll answer your questions as I understand your (and my) problem, and hopefully it helps.

Yes I think it'll be suited well. You could look at leveraging the Elastic MapReduce service's multiple steps option. You could use 1 Step to fetch a the people a user is following, and another step to compile a list of tracks for each of those followers, and the reducer for that 2nd step would probably be the one to build the cache.
Depends on how big your data-set is and how often you'll be running it. It's hard to say without knowing how big the data-set is (or is going to get) if it'll be cost effective or not. Initially, it'll probably be quite cost-effective, as you won't have to manage your own hadoop cluster, nor have to pay for EC2 instances (assuming that's what you use) to be up all the time. Once you reach the point where you're actually crunching this data for a long period of time, it probably will make less and less sense to use Amazon's MapReduce service, because you'll constantly have nodes online all the time.

A job is basically your MapReduce task. It can consist of multiple steps (each MapReduce task is a step). Once your data has been processed and all steps have been completed, your Job is done. So you're effectively paying for CPU time for each node in the Hadoop cluster. so, T*n where T is the Time (in hours) it takes to process your data, and n is the number of nodes you tell Amazon to spin up.

I hope this helps, good luck. I'd like to hear how you end up implementing your Mappers and Reducers, as I'm solving a very similar problem and I'm not sure my approach is really the best.

Python Freelancers

Should I Learn/use Mapreduce, Or Some Other Type Of Parallelization For This Task?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Should I Learn/use Mapreduce, Or Some Other Type Of Parallelization For This Task?"