Today we released what I think has to be one of the coolest features we’ve ever released: personalized iPhone app recommendations based on tweets! It’s quite cool: we monitor tweets about iPhone apps, and then we recommend more iPhone apps based on your tweets. And if you’re a registered member of AppStoreHQ, we also include your loved apps, bookmarked apps, etc as a way to improve your recommendations.
As one can imagine, finding these recommendations requires some backend processing to establish relationships between users and apps. We made a rake task to handle this for us. The process itself isn’t too complicated: first we build relationships between users, and then, given two similar users (user A and user B), we see which apps user B has liked that user A has not, and we recommend those apps to user A.
In testing — on our development database which has just a couple hundred apps and just a handful of users — this whole process took just a few seconds. The first time we ran it in production, however, it was projected to take 5,500 hours to complete!
With a few hours worth of work, we were able to get this number down from 5,500 to 1.5 hours. Here’s how we did it (short version: it’s all about remembering not to let Rails do it’s normal thing):
When we started, the code to build relationships between users looked a little something like this:
build_relationship(User.all).each do |u1, u2|
logger.info "Finding score between #{u1.inspect} and #{u2.inspect}"
score = u1.apps_in_common_with(u2).size
logger.info "Found score of: #{score}"
score
end
The apps_in_common method grabbed apps from three different denormalized tables for both users and found the intersection.
The first thing that we did was stop instantiating all of these User objects! We didn’t really care about anything but their id’s, so we didn’t need that extra SELECT statement.
build_relationship(User.all(:select => :id)).each do |u1, u2|
logger.info "Finding score between #{u1.id} and #{u2.id}"
score = User.apps_in_common(u1.id, u2.id).size
logger.info "Found score of #{score}"
score
end
Then we refactored the apps_in_common method to do everything it needed with just the user_id. This helped fairly significantly — somewhere around 4x as fast. That that’s still over 1,000 hours to complete!
The next thing we did was to add some timing code in our logging, just to see how long the User.apps_in_common method was taking. Obviously, it was the bulk of our time. Finding the apps for each user was the big time waster. We knew we had to do something about that.
Luckily, for a related project, we needed a denormalized version of this data anyways, indexed by user_id. Perfect!
build_relationship(User.all(:select => :id)).each do |u1, u2|
logger.info "Finding score between #{u1.id} and #{u2.id}"
u1_apps = denormalized_apps(u1.id)
u2_apps = denormalized_apps(u2.id)
score = (u1_apps & u2_apps).size
logger.info "Found score of #{score}"
score
end
(I’m simplifying here: the real code stayed in User.apps_in_common, etc.)
This helped out A TON. I don’t remember the exact statistic at this point, but I do remember it being close to a day. There was still plenty of room to go.
Next we had the “duh, we’re idiots” moment where we realized that we were getting the apps many, many times for each user. We decided to cache that.
user_ids_to_apps = {}
DenormalizedApp.all.each do |da|
user_ids_to_apps[da.user_id] = [] if user_ids_to_apps[da.user_id].nil?
user_ids_to_apps[da.user_id] << da.app_id unless user_ids_to_apps[da.user_id].include? da.app_id
end
build_relationship(user_ids_to_apps.keys).each do |u1, u2|
logger.info "Finding score between #{u1} and #{u2}"
u1_apps = user_ids_to_apps[u1]
u2_apps = user_ids_to_apps[u2]
score = (u1_apps & u2_apps).size
logger.info "Found score of #{score}"
score
end
Amazing! This was the HUGE winner. With this change, we were down to about 3 hours or so. The DenormalizedApp.all.each block only takes a few seconds to complete, and we’ve then built a hash of exactly what we need.
Lastly, we stopped logging (in reality, we log every 10,000 steps just to ensure we’re still going okay):
user_ids_to_apps = {}
DenormalizedApp.all.each do |da|
user_ids_to_apps[da.user_id] = [] if user_ids_to_apps[da.user_id].nil?
user_ids_to_apps[da.user_id] << da.app_id unless user_ids_to_apps[da.user_id].include? da.app_id
end
build_relationship(user_ids_to_apps.keys).each do |u1, u2|
u1_apps = user_ids_to_apps[u1]
u2_apps = user_ids_to_apps[u2]
score = (u1_apps & u2_apps).size
end
Surprisingly, this cut out about 50% of our time. We were now down to 1.5 hours, which we think is quite fast for our data set.
To summarize, remember that (a) Ruby can be slow, (b) Rails can be slow, yet (c) it’s still likely your fault for not optimizing.