Ujas Dubal’s Post

1mo

🔹 𝗠𝘂𝘀𝘁-𝗞𝗻𝗼𝘄 𝗣𝘆𝘁𝗵𝗼𝗻 𝗣𝗮𝗰𝗸𝗮𝗴𝗲𝘀 𝗳𝗼𝗿 𝗔𝗪𝗦 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 Python is the go-to language for data engineers, especially when working on AWS. Here’s a list of essential Python packages that enhance data processing, automation, and machine learning on AWS: 1️⃣ 𝘽𝙤𝙩𝙤3: AWS’s official SDK for Python, allowing seamless access to AWS services like S3, DynamoDB, Lambda, and more. Essential for automating AWS operations. 2️⃣ 𝙋𝙖𝙣𝙙𝙖𝙨: Provides powerful data structures for efficient data analysis and manipulation. Perfect for preparing data before loading it into AWS services like Redshift. 3️⃣ 𝙋𝙮𝙎𝙥𝙖𝙧𝙠: Spark’s Python API, useful for big data processing on Amazon EMR. Scales data analysis across large datasets in distributed environments. 4️⃣ 𝙎𝙌𝙇𝘼𝙡𝙘𝙝𝙚𝙢𝙮: A SQL toolkit and ORM that integrates well with AWS RDS and Redshift, simplifying database interactions and data transformations. 5️⃣ 𝙨3𝙛𝙨: Simplifies file operations on S3, allowing direct file reading/writing from S3 buckets, which is invaluable for data preprocessing. 6️⃣ 𝘼𝙒𝙎 𝙇𝙖𝙢𝙗𝙙𝙖 𝙋𝙤𝙬𝙚𝙧𝙩𝙤𝙤𝙡𝙨 𝙛𝙤𝙧 𝙋𝙮𝙩𝙝𝙤𝙣: A set of utilities that make developing Lambda functions easier, with pre-built logging, tracing, and metrics collection. 7️⃣ 𝘿𝙖𝙨𝙠: A parallel computing library that scales well on AWS EC2 and EMR. Ideal for handling larger-than-memory datasets and distributed processing. 8️⃣ 𝙍𝙚𝙙𝙨𝙝𝙞𝙛𝙩-𝙎𝙌𝙇𝘼𝙡𝙘𝙝𝙚𝙢𝙮: Extends SQLAlchemy to work specifically with Redshift, making it easier to query and load data directly into Redshift tables. 9️⃣ 𝘼𝙥𝙖𝙘𝙝𝙚 𝘼𝙞𝙧𝙛𝙡𝙤𝙬 𝙬𝙞𝙩𝙝 𝘼𝙒𝙎 𝙄𝙣𝙩𝙚𝙜𝙧𝙖𝙩𝙞𝙤𝙣𝙨: Airflow is widely used for orchestrating ETL workflows. AWS provides managed Airflow with built-in integrations for seamless scheduling and monitoring. 🔟 𝙎𝙘𝙧𝙖𝙥𝙮: A web scraping library that can pull in data from external sources, ready to be processed and loaded into AWS databases or data lakes. #AWSDataEngineering #PythonForData #DataEngineeringTools #CloudAutomation #BigData #ETLProcesses #S3 #DataPipeline #AWSLambda #Redshift #DataAnalysis #ServerlessPython #DataIntegration #Airflow #AWSAutomation #CloudComputing #DataPreparation #MachineLearning #PythonPackages #CloudArchitecture

To view or add a comment, sign in

More Relevant Posts

Fernando De Nitto

Big Data Engineer at Data Reply
2mo Edited
Report this post
🚀 Struggling with managing table versions in AWS Glue? When working with large datasets in AWS Glue, keeping things running smoothly can be challenging—especially with the accumulation of table versions. Each new data transformation creates a new table version, and before you know it, you’re facing crawler failures and access issues with Athena or other DBMS/Data Warehouse solutions. 😓 To tackle this, I’ve written a blog post where I share a Python script that can be executed in AWS Lambda to efficiently clean up older table versions. This solution ensures seamless operations and prevents those pesky crawler failures. 🔗 Check out the full article to learn more about implementing this solution: https://lnkd.in/dhTz6WMm Let’s dive into this practical approach and make your data management a breeze! 💡 And follow DataTech Medium of Data Reply IT 🥸 #AWS #Glue #Python #DataEngineering #BigData #CloudComputing #DataManagement #AWSLambda

Automating AWS Glue Table Version Cleanup with AWS Lambda

medium.com
Like Comment
To view or add a comment, sign in
Malte Polley

Integration and Automation with AWS @ MRH Trowe
1w
Report this post
🚀 Unlock the Future of Data Processing with Amazon Web Services (AWS) #Fargate, Matillion Hybrid Agents, and #Python #Pandas! 📊 Dive into an exhilarating exploration of cutting-edge data integration and processing in our latest blog post at MRH Trowe: "Driving AWS Fargate to the Edge: Matillion Hybrid Agents and Python Pandas". Discover how the synergy of AWS Fargate, Matillion Hybrid Agents, and the powerful Python Pandas library is pushing the boundaries of data orchestration and analysis. Also your Snowflake warehouse will show performance increasements. Learn how #serverless #container #orchestration can revolutionize your #data workflows, making them more efficient and flexible than ever before. Read more: https://lnkd.in/eQGvFnw6 #dataops #devops #aws #mdp

Driving AWS Fargate to the Edge: Matillion Hybrid Agents and Python Pandas

dev.to

5 Comments
Like Comment
To view or add a comment, sign in
Arjun Shenoy

Software Engineer | Ex-Invesco
9mo Edited
Report this post
I acquired a lot of learnings while executing this tiny data engineering project, which covers everything from data extraction to data loading. Thanks to Darshil Parmar 🤝 Steps 🎢 - From the YouTube trending dataset, extract the.json and.csv files. - Use the AWS CLI to push this into a raw S3 bucket. - Using Amazon Glue, Create a Data Crawler and Catalog. - Create an Extract-Transform-Load (ETL) job using a Python script to clean up and modify the data using AWS Lambda. - Use AWS Athena to query the data and perform analysis. *Tech Stack* ⚒ AWS - Glue,Lambda,S3,Athena Python SQL ***What Did I Learn?*** 📖 A basic ETL end to end pipeline architecture for any kind of dataset. Basics of all the AWS technologies mentioned above. LINK: https://lnkd.in/gpckKSrn #dataengineering #aws #etl #python
Like Comment
To view or add a comment, sign in
Sachin Chandrashekhar 🇮🇳

24K Fam | 200+ Member ‘Data Engineering Hub’ Community Leader | 100 days AWS Data Engineering Program | Sr Data Engineer @ World’s #1 Airline | AWS Data Engineering Trainer & Mentor
1mo
Report this post
𝐏𝐲𝐒𝐩𝐚𝐫𝐤 𝐢𝐬 𝐭𝐡𝐞 𝐏𝐲𝐭𝐡𝐨𝐧 𝐀𝐏𝐈 for Apache Spark, designed for large-scale data processing. It's known for: 𝐒𝐜𝐚𝐥𝐚𝐛𝐢𝐥𝐢𝐭𝐲: Handling massive datasets across distributed computing clusters. 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞: Optimized for high-performance data processing with features like in-memory computing. 𝐁𝐢𝐠 𝐃𝐚𝐭𝐚 𝐈𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐢𝐨𝐧: Seamlessly integrates with big data frameworks and tools. 𝐁𝐞𝐬𝐭 𝐟𝐨𝐫: Large-scale data processing tasks that require distributed computing and efficient handling of vast amounts of data. 𝐏𝐚𝐧𝐝𝐚𝐬 is a go-to library for data manipulation and analysis in Python. It's widely used for: 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬: Efficiently handling data in-memory with powerful data manipulation capabilities. 𝐑𝐢𝐜𝐡 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐚𝐥𝐢𝐭𝐲: Comprehensive functions for data cleaning, transformation, and aggregation. 𝐄𝐚𝐬𝐞 𝐨𝐟 𝐔𝐬𝐞: Intuitive API and excellent support for smaller datasets. Best for: Small to medium-sized datasets that fit into memory and require intricate data manipulation. 𝐖𝐡𝐞𝐧 𝐭𝐨 𝐂𝐡𝐨𝐨𝐬𝐞 𝐖𝐡𝐢𝐜𝐡? For Smaller, In-Memory Data Tasks: Use Pandas for its ease of use and rich functionality. For Large-Scale Data Processing: Opt for PySpark to leverage its distributed computing power and scalability. P.S ✅Version 3 of 𝐬𝐜𝐞𝐧𝐚𝐫𝐢𝐨 𝐛𝐚𝐬𝐞𝐝 𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐡𝐚𝐧𝐝𝐬-𝐨𝐧 𝐑𝐞𝐚𝐥-world AWS Data Engineering ( RADE™) program 𝐰𝐢𝐭𝐡 80+ 𝐚𝐦𝐚𝐳𝐢𝐧𝐠 𝐩𝐞𝐨𝐩𝐥𝐞 completed on Oct 5th. Onboarding for the Next one to begin in November. ✅𝐈𝐟 𝐲𝐨𝐮 𝐚𝐫𝐞 𝐚𝐜𝐭𝐢𝐨𝐧-𝐨𝐫𝐢𝐞𝐧𝐭𝐞𝐝 and want to get in the next batch, fill the form below to be get invited for the webinar for RADE™ program V4 Click https://lnkd.in/ervbN-zt 𝐔𝐒𝐏: 𝐑𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐬𝐜𝐞𝐧𝐚𝐫𝐢𝐨 𝐛𝐚𝐬𝐞𝐝 𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐡𝐚𝐧𝐝𝐬-𝐨𝐧 program ! doc credit - Shwetank Singh #aws #dataengineering

22 Comments
Like Comment
To view or add a comment, sign in
Rahul Madhani

Microsoft & Databricks Certified | Data Architect | Lead Data Engineer | Technical Content Creator
5mo
Report this post
🌟 Automate Azure Data Factory (ADF) development with Python and REST APIs 🌟 Would you like to automate repetitive Azure Data Factory tasks like creating datasets or pipelines, or would you like to create Azure Data Factory components programmatically? If so, then my latest Data Engineer Things blog is just what you need! 🎉 https://lnkd.in/dxJY_PtJ With Azure Data Factory APIs, you can easily automate ADF tasks and create components programmatically. In this blog, I explain how to execute these APIs from Python, providing you with a solid starting point. Curious about other ways to execute APIs like using Curl and Postman? Check out my other blog here: https://lnkd.in/dxRuR94R Start streamlining your workflows today! 🚀 Check out my other blogs for insightful content! 📚 https://lnkd.in/dkkcHVXq Follow Rahul Madhani for more insights and updates. 🚀 If you found this post helpful, please help others by reposting it ♻️ Tagging Data Engineering experts to share this with a broader audience Data Engineer Things, Towards Data Science, Deepak Goyal, Zach Wilson, Ankit Bansal, Shubham Wadekar, Diksha Chourasiya, Darshil Parmar, Sumit Mittal, Shashank Mishra 🇮🇳, SHAILJA MISHRA🟢, Shubhankit Sirvaiya. Thank you for your support! #AzureDataFactory #Python #APIAutomation #DataEngineering #TechBlog #Automation #Programming #DataScience #DataEngineer

Execute Azure Data Factory REST APIs with Python

blog.det.life
Like Comment
To view or add a comment, sign in
Hamza Ali Khalid

Senior Software Engineer | Backend Development Specialist | Empowering Seamless Global Communication at LetzChat Inc.
9mo
Report this post
𝗕𝘂𝗶𝗹𝗱, 𝗱𝗲𝗽𝗹𝗼𝘆 to 𝗔𝗪𝗦, 𝗜𝗮𝗖, and 𝗖𝗜/𝗖𝗗 for a 𝗱𝗮𝘁𝗮 𝗰𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 that 𝗰𝗿𝗮𝘄𝗹𝘀 your 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗱𝗮𝘁𝗮 → 𝗪𝗵𝗮𝘁 do you need 🤔 𝗧𝗵𝗲 𝗲𝗻𝗱 𝗴𝗼𝗮𝗹? 𝘈 𝘴𝘤𝘢𝘭𝘢𝘣𝘭𝘦 𝘥𝘢𝘵𝘢 𝘱𝘪𝘱𝘦𝘭𝘪𝘯𝘦 𝘵𝘩𝘢𝘵 𝘤𝘳𝘢𝘸𝘭𝘴, 𝘤𝘰𝘭𝘭𝘦𝘤𝘵𝘴, 𝘢𝘯𝘥 𝘴𝘵𝘰𝘳𝘦𝘴 𝘢𝘭𝘭 𝘺𝘰𝘶𝘳 𝘥𝘪𝘨𝘪𝘵𝘢𝘭 𝘥𝘢𝘵𝘢 𝘧𝘳𝘰𝘮: - LinkedIn - Medium - Substack - Github 𝗧𝗼 𝗯𝘂𝗶𝗹𝗱 𝗶𝘁 - 𝗵𝗲𝗿𝗲 𝗶𝘀 𝘄𝗵𝗮𝘁 𝘆𝗼𝘂 𝗻𝗲𝗲𝗱 ↓ 𝟭. 𝗦𝗲𝗹𝗲𝗻𝗶𝘂𝗺: a Python tool for automating web browsers. It’s used here to interact with web pages programmatically (like logging into LinkedIn, navigating through profiles, etc.) 𝟮. 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽: a Python library for parsing HTML and XML documents. It creates parse trees that help us extract the data quickly. 𝟯. 𝗠𝗼𝗻𝗴𝗼𝗗𝗕 (𝗼𝗿 𝗮𝗻𝘆 𝗼𝘁𝗵𝗲𝗿 𝗡𝗼𝗦𝗤𝗟 𝗗𝗕): a NoSQL database fits like a glove on our unstructured text data 𝟰. 𝗔𝗻 𝗢𝗗𝗠: a technique that maps between an object model in an application and a document database 𝟱. 𝗗𝗼𝗰𝗸𝗲𝗿 & 𝗔𝗪𝗦 𝗘𝗖𝗥: to deploy our code, we have to containerize it, build an image for every change of the main branch, and push it to AWS ECR 𝟲. 𝗔𝗪𝗦 𝗟𝗮𝗺𝗯𝗱𝗮: we will deploy our Docker image to AWS Lambda - a serverless computing service that allows you to run code without provisioning or managing servers. It executes your code only when needed and scales automatically, from a few daily requests to thousands per second 𝟳. 𝗣𝘂𝗹𝘂𝗺𝗻𝗶: IaC tool used to programmatically create the AWS infrastructure: MongoDB instance, ECR, Lambdas and the VPC 𝟴. 𝗚𝗶𝘁𝗛𝘂𝗯 𝗔𝗰𝘁𝗶𝗼𝗻𝘀: used to build our CI/CD pipeline - on any merged PR to the main branch, it will build & push a new Docker image and deploy it to the AWS Lambda service . 𝘾𝙪𝙧𝙞𝙤𝙪𝙨 𝙝𝙤𝙬 𝙩𝙝𝙚𝙨𝙚 𝙩𝙤𝙤𝙡𝙨 𝙬𝙤𝙧𝙠 𝙩𝙤𝙜𝙚𝙩𝙝𝙚𝙧? Then... ↓↓↓ Check out 𝗟𝗲𝘀𝘀𝗼𝗻 𝟮 from the FREE 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻 𝗖𝗼𝘂𝗿𝘀𝗲 created by Decoding ML ...where we will walk you 𝘀𝘁𝗲𝗽-𝗯𝘆-𝘀𝘁𝗲𝗽 through the 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 and 𝗰𝗼𝗱𝗲 of the 𝗱𝗮𝘁𝗮 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: #machinelearning #mlops #datascience #ml #mongodb #aws #azure #python #ai #artificialinteligense
Like Comment
To view or add a comment, sign in
Leonardo N.

Software/Data Engineer | Software and Data Engineering Specialist - Big Data, Data Engineer, Apache Spark, AWS, GCP, Data Pipeline, ETL, ELT, SQL
7mo
Report this post
🚀 Implementing a Real-Time Data Pipeline with Python, PySpark, AWS, and Kafka 🌐 In today's data-driven world, the ability to process and analyze data in real-time is crucial for making informed business decisions. I'm excited to share a recent project where I designed and implemented a robust real-time data pipeline using some powerful tools and technologies. 🔧 Tools and Technologies Used: - Python: For scripting and orchestrating the data flow. - PySpark: To handle large-scale data processing and real-time analytics. - Apache Kafka: As a reliable and scalable data streaming platform. - AWS: Leveraging services like S3, EMR, and Lambda for cloud-based storage, processing, and automation. 🛠️ Solution Overview: 1. Data Ingestion: - Apache Kafka is used to ingest streaming data from various sources, ensuring low-latency and high-throughput data delivery. 2. Processing Layer: - PySpark on AWS EMR processes the ingested data in real-time, performing necessary transformations and aggregations. - PySpark's seamless integration with AWS services enables efficient and scalable data processing. 3. Storage: - Processed data is stored in AWS S3, providing a durable and scalable storage solution. - For real-time querying, data is also ingested into AWS Redshift or Amazon RDS. 4. Automation and Orchestration: - AWS Lambda functions automate various parts of the pipeline, including triggering PySpark jobs and managing Kafka streams. 5. Monitoring and Alerts: - Comprehensive monitoring using AWS CloudWatch and custom alerting to ensure the pipeline's health and performance. 🔍 Key Benefits: - Scalability: The pipeline can handle high-velocity data streams efficiently. - Real-Time Analytics: Immediate insights and data-driven decision-making. - Reliability: Using AWS and Kafka ensures data reliability and fault tolerance. - Flexibility: Python and PySpark provide the flexibility to implement complex transformations and analytics. I'm thrilled with the results and the potential applications of this solution. Real-time data pipelines are the future of data analytics, and leveraging the right technologies can significantly enhance an organization's data capabilities. Feel free to reach out if you'd like to know more about this implementation or discuss how real-time data solutions can benefit your business! #DataEngineering #RealTimeData #Python #PySpark #AWS #Kafka #BigData #CloudComputing #DataPipeline
Like Comment
To view or add a comment, sign in
Hamza Ali Khalid

Senior Software Engineer | Backend Development Specialist | Empowering Seamless Global Communication at LetzChat Inc.
9mo
Report this post
𝗕𝘂𝗶𝗹𝗱, 𝗱𝗲𝗽𝗹𝗼𝘆 to 𝗔𝗪𝗦, 𝗜𝗮𝗖, and 𝗖𝗜/𝗖𝗗 for a 𝗱𝗮𝘁𝗮 𝗰𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 that 𝗰𝗿𝗮𝘄𝗹𝘀 your 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗱𝗮𝘁𝗮 → 𝗪𝗵𝗮𝘁 do you need 🤔 𝗧𝗵𝗲 𝗲𝗻𝗱 𝗴𝗼𝗮𝗹? 𝘈 𝘴𝘤𝘢𝘭𝘢𝘣𝘭𝘦 𝘥𝘢𝘵𝘢 𝘱𝘪𝘱𝘦𝘭𝘪𝘯𝘦 𝘵𝘩𝘢𝘵 𝘤𝘳𝘢𝘸𝘭𝘴, 𝘤𝘰𝘭𝘭𝘦𝘤𝘵𝘴, 𝘢𝘯𝘥 𝘴𝘵𝘰𝘳𝘦𝘴 𝘢𝘭𝘭 𝘺𝘰𝘶𝘳 𝘥𝘪𝘨𝘪𝘵𝘢𝘭 𝘥𝘢𝘵𝘢 𝘧𝘳𝘰𝘮: - LinkedIn - Medium - Substack - Github 𝗧𝗼 𝗯𝘂𝗶𝗹𝗱 𝗶𝘁 - 𝗵𝗲𝗿𝗲 𝗶𝘀 𝘄𝗵𝗮𝘁 𝘆𝗼𝘂 𝗻𝗲𝗲𝗱 ↓ 𝟭. 𝗦𝗲𝗹𝗲𝗻𝗶𝘂𝗺: a Python tool for automating web browsers. It’s used here to interact with web pages programmatically (like logging into LinkedIn, navigating through profiles, etc.) 𝟮. 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽: a Python library for parsing HTML and XML documents. It creates parse trees that help us extract the data quickly. 𝟯. 𝗠𝗼𝗻𝗴𝗼𝗗𝗕 (𝗼𝗿 𝗮𝗻𝘆 𝗼𝘁𝗵𝗲𝗿 𝗡𝗼𝗦𝗤𝗟 𝗗𝗕): a NoSQL database fits like a glove on our unstructured text data 𝟰. 𝗔𝗻 𝗢𝗗𝗠: a technique that maps between an object model in an application and a document database 𝟱. 𝗗𝗼𝗰𝗸𝗲𝗿 & 𝗔𝗪𝗦 𝗘𝗖𝗥: to deploy our code, we have to containerize it, build an image for every change of the main branch, and push it to AWS ECR 𝟲. 𝗔𝗪𝗦 𝗟𝗮𝗺𝗯𝗱𝗮: we will deploy our Docker image to AWS Lambda - a serverless computing service that allows you to run code without provisioning or managing servers. It executes your code only when needed and scales automatically, from a few daily requests to thousands per second 𝟳. 𝗣𝘂𝗹𝘂𝗺𝗻𝗶: IaC tool used to programmatically create the AWS infrastructure: MongoDB instance, ECR, Lambdas and the VPC 𝟴. 𝗚𝗶𝘁𝗛𝘂𝗯 𝗔𝗰𝘁𝗶𝗼𝗻𝘀: used to build our CI/CD pipeline - on any merged PR to the main branch, it will build & push a new Docker image and deploy it to the AWS Lambda service . 𝘾𝙪𝙧𝙞𝙤𝙪𝙨 𝙝𝙤𝙬 𝙩𝙝𝙚𝙨𝙚 𝙩𝙤𝙤𝙡𝙨 𝙬𝙤𝙧𝙠 𝙩𝙤𝙜𝙚𝙩𝙝𝙚𝙧? Then... ↓↓↓ Check out 𝗟𝗲𝘀𝘀𝗼𝗻 𝟮 from the FREE 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻 𝗖𝗼𝘂𝗿𝘀𝗲 created by Decoding ML ...where we will walk you 𝘀𝘁𝗲𝗽-𝗯𝘆-𝘀𝘁𝗲𝗽 through the 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 and 𝗰𝗼𝗱𝗲 of the 𝗱𝗮𝘁𝗮 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: #machinelearning #mlops #datascience #ml #mongodb #aws #azure #python #ai #artificialinteligense
Like Comment
To view or add a comment, sign in
MoonSys

1,835 followers
9mo
Report this post
𝗕𝘂𝗶𝗹𝗱, 𝗱𝗲𝗽𝗹𝗼𝘆 to 𝗔𝗪𝗦, 𝗜𝗮𝗖, and 𝗖𝗜/𝗖𝗗 for a 𝗱𝗮𝘁𝗮 𝗰𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 that 𝗰𝗿𝗮𝘄𝗹𝘀 your 𝗱𝗶𝗴𝗶𝘁𝗮𝗹 𝗱𝗮𝘁𝗮 → 𝗪𝗵𝗮𝘁 do you need 🤔 𝗧𝗵𝗲 𝗲𝗻𝗱 𝗴𝗼𝗮𝗹? 𝘈 𝘴𝘤𝘢𝘭𝘢𝘣𝘭𝘦 𝘥𝘢𝘵𝘢 𝘱𝘪𝘱𝘦𝘭𝘪𝘯𝘦 𝘵𝘩𝘢𝘵 𝘤𝘳𝘢𝘸𝘭𝘴, 𝘤𝘰𝘭𝘭𝘦𝘤𝘵𝘴, 𝘢𝘯𝘥 𝘴𝘵𝘰𝘳𝘦𝘴 𝘢𝘭𝘭 𝘺𝘰𝘶𝘳 𝘥𝘪𝘨𝘪𝘵𝘢𝘭 𝘥𝘢𝘵𝘢 𝘧𝘳𝘰𝘮: - LinkedIn - Medium - Substack - Github 𝗧𝗼 𝗯𝘂𝗶𝗹𝗱 𝗶𝘁 - 𝗵𝗲𝗿𝗲 𝗶𝘀 𝘄𝗵𝗮𝘁 𝘆𝗼𝘂 𝗻𝗲𝗲𝗱 ↓ 𝟭. 𝗦𝗲𝗹𝗲𝗻𝗶𝘂𝗺: a Python tool for automating web browsers. It’s used here to interact with web pages programmatically (like logging into LinkedIn, navigating through profiles, etc.) 𝟮. 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽: a Python library for parsing HTML and XML documents. It creates parse trees that help us extract the data quickly. 𝟯. 𝗠𝗼𝗻𝗴𝗼𝗗𝗕 (𝗼𝗿 𝗮𝗻𝘆 𝗼𝘁𝗵𝗲𝗿 𝗡𝗼𝗦𝗤𝗟 𝗗𝗕): a NoSQL database fits like a glove on our unstructured text data 𝟰. 𝗔𝗻 𝗢𝗗𝗠: a technique that maps between an object model in an application and a document database 𝟱. 𝗗𝗼𝗰𝗸𝗲𝗿 & 𝗔𝗪𝗦 𝗘𝗖𝗥: to deploy our code, we have to containerize it, build an image for every change of the main branch, and push it to AWS ECR 𝟲. 𝗔𝗪𝗦 𝗟𝗮𝗺𝗯𝗱𝗮: we will deploy our Docker image to AWS Lambda - a serverless computing service that allows you to run code without provisioning or managing servers. It executes your code only when needed and scales automatically, from a few daily requests to thousands per second 𝟳. 𝗣𝘂𝗹𝘂𝗺𝗻𝗶: IaC tool used to programmatically create the AWS infrastructure: MongoDB instance, ECR, Lambdas and the VPC 𝟴. 𝗚𝗶𝘁𝗛𝘂𝗯 𝗔𝗰𝘁𝗶𝗼𝗻𝘀: used to build our CI/CD pipeline - on any merged PR to the main branch, it will build & push a new Docker image and deploy it to the AWS Lambda service . 𝘾𝙪𝙧𝙞𝙤𝙪𝙨 𝙝𝙤𝙬 𝙩𝙝𝙚𝙨𝙚 𝙩𝙤𝙤𝙡𝙨 𝙬𝙤𝙧𝙠 𝙩𝙤𝙜𝙚𝙩𝙝𝙚𝙧? Then... ↓↓↓ Check out 𝗟𝗲𝘀𝘀𝗼𝗻 𝟮 from the FREE 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻 𝗖𝗼𝘂𝗿𝘀𝗲 created by Decoding ML ...where we will walk you 𝘀𝘁𝗲𝗽-𝗯𝘆-𝘀𝘁𝗲𝗽 through the 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 and 𝗰𝗼𝗱𝗲 of the 𝗱𝗮𝘁𝗮 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: #machinelearning #mlops #datascience #ml #mongodb #aws #azure #python #ai #artificialinteligense
Like Comment
To view or add a comment, sign in

958 followers

View Profile Connect

Ujas Dubal’s Post

More from this author

Mastering Essential Python Packages for Data Engineering: Code Examples and Best Practices

Explore topics