Problems with these approaches?
Idea: Recommend items to customer X that are similar to items that customer X rated highly
Creates a profile for each user or product
Advantages
Cold start
Determining appropriate features is difficult
Implicit information
Relies only on past user behavior (doesn’t need explicit profiles)
Domain free
Generally more accurate than content based-approaches
Cold start: If there is no previous behavior for that user and no explicit profile, how can you make a suggestion?
Alternatively, what about a new product?
Expensive to find the nearest neighbor!
Empirically, not as good as latent factor models
Characterizes both items and users by vectors of factors inferred from item rating pattern
Explicit feedback: sparse matrix
Good scalability
Low-rank assumption: a few factors characterize the users and items (k << n)
$\min_{q*, p*}\sum_{(u,i) \in R} (r_{ui} - q_{i}^Tp_{u})^2 + \lambda(||q_{i}||^2 + ||p_{u}||^2)$
Answer: Too many missing entries, and imputation is expensive or inaccurate
Broadcast R, U, V
Problems?
R is a very large matrix and possibly won’t fit in main memory
Sends duplicate copies to each worker
Store R as an RDD/DataFrame, but broadcast U and V
Problems?
U and V might not fit in memory either
Sends duplicate copies to each worker
Store R, U, and V as an RDD/DataFrame
Spark implements a smarter version of join ALS to limit data shuffling
ALS is an example of a distributed model (i.e. stored across executors)