The "COVID-19 Data Analysis & Visualization" project is a comprehensive Spark application designed to provide deep insights into the vast datasets related to the COVID-19 pandemic. With an interactive Command Line Interface (CLI), users can seamlessly query and analyze the data, uncovering trends, patterns, and correlations that shed light on the global impact of the virus.
The primary goal of this project is to offer a tool that facilitates a deeper understanding of COVID-19 data. By identifying trends and patterns, we aim to provide a clearer picture of the pandemic's progression and its multifaceted impacts. Our team has crafted 10 analytical queries to delve into various aspects of the data, aiming to uncover meaningful insights.
- Agile Scrum: Implemented the Agile Scrum methodology for project work. We had a Scrum Master who served as the team lead, conducted daily scrum meetings, and reported any blockers or tasks completed at the end of each day.
- Interactive CLI: A user-friendly interface to query and analyze COVID-19 data.
- User/Admin System: A robust user and admin system integrated within the Scala console, ensuring data security and facilitating CRUD operations. Passwords are securely encrypted using bCrypt.
- Visualization: Leveraging tools like Zeppelin (or Tableau), the project visualizes the analyzed data, making it easier to interpret and understand.
- Analytical Queries: Our team developed 10 specific analytical queries to dive deep into the data. These queries can be found here. Some of the queries include:
Our team developed 10 specific analytical queries to dive deep into the data. These queries aim to uncover meaningful insights into various aspects of the COVID-19 pandemic. You can explore each query in detail using the links below:
- Death Spread Speed
- Percentage Of Population Confirmed
- Average Confirmed, Death, and Recovery Rates
- Peak of Deaths
- Highest Death By Country
- General Disease Evolution
- Average Recovery Rate
- Confirmed Cases By Day
- Total Confirmed, Deaths, and Recoveries
- Confirmed Spread Speed
Each query provides a unique perspective on the data, offering insights that can aid in understanding the pandemic's progression and impact.
- Data Cleaning: One of the significant challenges faced was cleaning the extensive dataset, which comprised over 200,000 rows. Ensuring accuracy and relevance was paramount to the project's success.
- Apache Spark
- Spark SQL
- YARN
- HDFS
- Scala 2.12.10
- Git + GitHub
- Zeppelin (or Tableau)
This project serves as a testament to the power of data analysis and visualization in understanding complex scenarios like a global pandemic. Whether you're a researcher, data analyst, or someone keen on understanding the nuances of COVID-19, this tool provides a comprehensive platform for exploration and discovery.
This project was made possible thanks to the dedicated efforts of the following contributors:
- Jaceguai De Magalhaes - Scrum Master / Data Visualization with Zepplin
- Newyork Her - Analytical Queries
- Brandon Cho - Data Cleaning / Encryption
- Jack Nguyen - User/Admin System
- Aaron Schomer - Data Visualization with Tableau
We appreciate the hard work and collaboration of each team member in bringing this project to life.