Save the resource and share with your network ♻️
If you are a student wanting to work on portfolio project, I would recommend to start from these free/open datasets 👇 Here is how I would recommend you go about it: 1. Browse through the datasets, read the description and see if you find something interesting 2. Download the data/ look at the preview to dig deeper into what the data looks like (if this still interests you, continue to point 3, or keep browsing more datasets) 3. Now understand what the dataset is meant for. At times the datasets details is prescriptive in terms of what data/ML problem needs to be solved. Other times, the problem statement can be open-ended. 4. Open a document and start putting your brain dump on what you understand about the data in the first iteration, then go one level deeper to get descriptive statistics, then go another level deeper and work on exploratory data analysis. Keep this Jupyter notebook as a separate one. 5. Start making notes on what data quality checks, data cleaning, data processing you want to do - translate logic to code. Keep this as a seprate modular code. 6. Next step is model experimentation. Don't start with the bulkiest model, start simple say with linear models so you get a benchmark on the accuracy/F1/ whatever metric you want to optimize on. Gradually move up the complexity of the model, while tracking the metric. You can use AutoML tools to go v1 of experimentation. Make sure you understand the hyper-parameters while tuning the model. Keep this as a separate modular code as well. 7. Once you get to a decent model evaluation metric. Focus on explainability of the model, and translating your model output to a comprehensive narrative- either in terms of how it performed compared to your starting benchmark metric, or how much more efficient it became, or how fast did it help you reach decisions. TLDR; Derive business value proposition out of the technical ML pipeline. The reason I recommend to keep the code modular is because it will help you debug them easier, swap out methods faster, and build CI/CD/CM/CT more efficiently. This may not be necessary for w portfolio project, but it is a best practice to follow. Understand and appreciate high-quality data as they are the backbone of ML systems today. Without the high-quality data academics and industry leaders have put together, we won't be living in a ChatGPT/Perplexity/Gemini era. PS: Shoutout to Fei-Fei Li for pioneering data collection and labeling with ImageNet. #ai #ml #data #datasets #datascience