Goal: Find line of best fit
$\hat{y} = w_{0} + w_{1}x$
y $\approx \hat{y} + \epsilon$
x: feature
y: label
Measure "closeness" between label and prediction
Evaluation metrics:
$Error = (y_{i} - \hat{y_{i}})$
$SE = (y_{i} - \hat{y_{i}})^2$
$SSE = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$
$MSE = \frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$
$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_{i} - \hat{y_{i}})^2}$
Which is more important? Why?
Another measurement of "goodness of fit"
$SS_{\text{tot}}=\sum_{i=1}^n(y_{i}-{\bar {y}})^{2}$
$SS_{\text{res}}=\sum_{i=1}^n(y_{i}-\hat{y_{i}})^{2}$
$R^{2} = 1-{SS_{\rm {res}} \over SS_{\rm {tot}}}$
What is the range of R2?
Can train multiple scikit-learn models in parallel with Joblib or Pandas UDFs, but what if our data or model get big...
Original Spark ML API based on RDDs
Newer API based on DataFrames
Spark 2.0: Entered maintenance mode
Supported API moving forward
Categorical
Ordinal
Create single numerical feature to represent non-numeric one
Categorical features:
Implies Cats are 2x dogs!
Create a ‘dummy’ feature for each category
'Dog' => [1, 0, 0], 'Cat' => [0, 1, 0], 'Fish' => [0, 0, 1]
No spurious relationships!
Ok, so that works if we only have a few animal types, but what if we had a zoo?
Size of vector, indices of non-zero elements, values
DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])
Linear relationship between X and Y
Features not correlated
Errors only in Y