How is the Product backlog for machine learning systems different from a traditional product backlog?
I created this for Erdos Research - we are still looking for foundation members who want to learn AI in a hands-on manner.
Previously I shared how How is product market fit for an AI product different from a traditional product market fit
Recapping an MVP
To put this in context, lets recap the ideas of an MVP
Creating a Minimum Viable Product (MVP) involves several key stages.
With specific deliverables at each stage of the MVP
1. Idea Validation and Market Research
2. Define the Core Functionality
3. Create User Personas and Use Cases
4. Design the User Experience (UX)
5. Build the MVP
6. Test the MVP
7. Gather Feedback and Analyze
8. Iterate and Improve
9. Launch the MVP
10. Post-Launch Activities
Product backlog as a key artefact
The product backlog is a key artefact of the MVP development process and fits across multiple stages.
Subsequently, User Personas, Use Cases/User Stories will also be added to the backlog items - And also Wireframes and Mockups
Thus, a traditional product backlog contains
Now the next question is:
how is the product backlog for a machine learning product different from a traditional product backlog
ML Product Backlog
In addition to the elements listed above for a product backlog, a product backlog for an ML system would also contain the following elements
Recommended by LinkedIn
Model evaluation metrics as a template to get started with the ML product backlog
One idea I am thinking of is using Model evaluation metrics as a template to get started with the ML product backlog - because model evaluation metrics can be easily expressed as scenarios
Regression Evaluation Metrics
Scenario: Predicting house prices.
Mean Absolute Error (MAE)
Application: MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It is useful when you want to understand the average error in prediction in the same units as the target variable. For instance, if the predicted house price is off by an average of $10,000, MAE will be $10,000.
Scenario: Predicting student test scores.
Mean Squared Error (MSE)
Application: MSE measures the average of the squares of the errors. It gives a higher weight to larger errors, making it useful when large errors are particularly undesirable. For example, if predicting test scores, an MSE of 25 means that on average, the squared difference between the predicted and actual test scores is 25.
Scenario: Predicting daily electricity consumption.
Root Mean Squared Error (RMSE)
Application: RMSE is the square root of MSE and provides a measure of the average magnitude of the error. It is particularly useful when you want to assess the standard deviation of the prediction errors. For instance, an RMSE of 50 kWh indicates that the typical error in predicted electricity consumption is around 50 kWh.
Scenario: Predicting car fuel efficiency based on engine characteristics.
R-squared (R²)
Application: R² indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² of 0.8, for example, suggests that 80% of the variability in car fuel efficiency can be explained by the model.
Classification Evaluation Metrics
Scenario: Email spam detection.
Accuracy
Application: Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. For spam detection, if 90 out of 100 emails are classified correctly (both spam and non-spam), the accuracy is 90%.
Scenario: Fraud detection in credit card transactions.
Precision
Application: Precision is the ratio of true positive observations to the total predicted positives. It is useful in scenarios where the cost of false positives is high. For fraud detection, if 80 out of 100 flagged transactions are actually fraudulent, the precision is 0.8.
Scenario: Diagnosing a disease.
Recall (Sensitivity)
Application: Recall measures the ratio of true positive observations to the actual positives. It is critical in medical diagnostics where missing a positive case can be very costly. If the model correctly identifies 90 out of 100 actual disease cases, the recall is 0.9.
Scenario: Sentiment analysis in customer reviews.
F1 Score
Application: The F1 score is the harmonic mean of precision and recall and is useful when you need a balance between precision and recall. For instance, in sentiment analysis, where both false positives and false negatives are important, an F1 score provides a single metric to evaluate the model.
Scenario: Credit scoring for loan approval.
AUC-ROC (Area Under the Receiver Operating Characteristic Curve)
Application: AUC-ROC measures the model's ability to distinguish between classes. It is useful for understanding the trade-off between true positive rate and false positive rate. For credit scoring, an AUC-ROC of 0.85 indicates a high probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one.
Scenario: Predicting customer churn.
Confusion Matrix
Application: A confusion matrix shows the number of true positives, true negatives, false positives, and false negatives. It provides a comprehensive view of how the classification model performs. For customer churn, it helps understand the number of correctly and incorrectly predicted churns and non-churns.
I created this for Erdos Research - we are still looking for foundation members who want to learn AI in a hands-on manner.
Image source