๐ ๏ธ Batch Data Processing with Python, MySQL, Docker, Airflow, and Streamlit
๐ Project Overview
This project showcases an automated data pipeline that scrapes trending posts from technology-focused subreddits using the Reddit API and presents them in an interactive web application built with Streamlit. The backend is powered by Airflow for scheduling, MySQL for storage, and Docker for seamless deployment.
๐ง What the App Does
- ๐ฐ Browse top daily posts from subreddits like r/datascience, r/MachineLearning, r/deeplearning, and more.
- ๐ก Understand complex tech jargon using Gemini AI-powered explanations.
- ๐ View posts based on date, score, and subreddit categories.
๐งฐ Technologies Used
๐ Python: Scripting and backend logic
๐๏ธ MySQL: Structured data storage
๐ณ Docker: Containerized environment for scraper and frontend
๐น๏ธ Airflow: DAG automation for scraping and exporting
๐ Streamlit: Interactive UI and visualization
๐ ๏ธ PRAW: Reddit API wrapper
๐ง Gemini API: Contextual explanations of technical terms
๐ Data Flow Summary
Reddit posts are scraped daily, stored in a MySQL database, and exported to CSV for use in the Streamlit dashboard. The scraper runs via Airflow DAGs, and Docker ensures environment consistency across systems. A separate container hosts the frontend app, offering a seamless experience for users exploring the latest in data science and technology.
๐ Live Demo & Code
๐ Live App: Reddit_News