CSE 6250: Big Data for Health Informatics

For the Fall 2024 semester, I took two courses, and here I will summarize my experience in the course “Big Data for Health Informatics.”

Overall

This course focuses on big data analysis in the healthcare industry, particularly using patient data obtained from hospitals. I enrolled with the expectation of learning about Hadoop and Spark to efficiently handle big data such as patient diagnosis histories. As a result, I was able to experience the entire process from data processing to predictions using machine learning. For the final project, I selected a previously published paper on data utilization in the healthcare industry, implemented the proposed methods, and reproduced the verification results.

CSE 6250: Big Data for Health Informatics

The primary dataset used in this course is called MIMIC-III, which consists of real data collected from patients during their stay in the ICU. While it requires an application for access, it is available as open data at no cost.

MIMIC-III Clinical Database

Big Data in Healthcare, MIMIC

Content

Regarding the grading, it is based on four individual assignments, a project where pairs work together to reproduce a paper on data analysis in the healthcare industry, and a final exam. There seems to be mixed opinions about group work in the OMSCS program, but since it is an opportunity to challenge myself at an overseas graduate school, I decided to choose a course that included group work for the first time.

The course primarily consists of videos from past lectures and practical sessions known as Labs. The schedule was set up so that we would progress through one or two lecture videos and a Lab each week. In the Labs, we worked with tools like Hadoop and Spark running on Docker. Although Scala is mainly used in the Labs, the assignments require us to work with PySpark, so while the concepts were helpful, I felt that spending too much time interpreting Scala was not necessary. Until recently, the assignments were also Scala-based, and there were as many as five assignments, making it quite a demanding course; however, it seems there have been some changes in recent years. I was aware that this course had a high workload, and since I had access to the Labs before the classes started, I managed to complete most of the work in advance. Thanks to this, I was able to focus on the assignments and lectures relatively well once the classes began.

Learning Environment

HW01

For HW01, preparations were required to handle actual patient data, which included applying for access. To learn about the ethics of handling medical data, we had to read online materials and pass a test. There was a considerable amount of information regarding existing rules and the incidents that led to their establishment, which took quite a bit of time. Additionally, the assignment involved handling data in a format similar to MIMIC III using Python, performing ETL (Extract, Transform, Load) processes, and conducting classification using scikit-learn.

HW02

The content of HW02 was similar to HW01, but it involved processing MIMIC III data to make predictions, with the distinction being that we used PySpark on Colab. This meant that we had to implement solutions in PySpark and become accustomed to its unique syntax. Furthermore, we implemented definitions of gradient descent and stochastic gradient descent, performing classifications using Logistic Regression.

HW03

Following HW02, we continued implementing in PySpark on Colab, focusing on patient phenotypes based on health data. The term “phenotypes” may not be very clear as it is a medical term, but it seems to refer to the classification of patients. The assignment involved comparing a theoretically constructed classification method based on patient observation data with a clustering method.

HW04

HW04 was centered around Neural Networks, utilizing a portion of MIMIC III data to implement RNNs and predict patient conditions. Through this assignment, a Kaggle competition exclusive to course participants was held, where students who submitted models with high prediction accuracy received Extra Credit (bonus points that count towards the course grade). Rankings were continuously published, allowing us to check our standings, but as the deadline approached, submissions surged, causing my rank to drop significantly. I felt that the last-minute rush was common among students, but since I was close to the line where I might not receive the bonus, I also started to rethink my approach right before the deadline, making the final moments quite hectic.

Exam

There was no midterm exam, only a final exam. The exam covered about ten weeks’ worth of material, making it quite extensive. Additionally, with only 20 questions, many of them focused on concepts. I took the exam relatively soon after it became available, but the test monitoring system used in other courses was not activated yet. As a result, I was concerned that I could not prove I had not cheated, which might lead to my test results being invalidated. Ultimately, since other students were in similar situations, and after informing the TA, my test results were accepted, which was a relief.

Project

For the project, students paired up and chose a paper from a list to reproduce. Since pairs were to be formed by the students themselves, I reached out to someone in a similar time zone. While selecting a paper from the list, I was aware that this choice would greatly impact the project’s workload, so I opted for something more manageable. My partner was very capable, allowing us to complete the project without major issues. The last month of the course was relatively calm, allowing us to dedicate most of our time to the project. The final deliverables included a report and a presentation video, making it important to plan our work carefully. The paper my team selected utilized MIMIC-III data to make predictions regarding ICU stays using LSTM and CNN models.

Reflection

I chose this course to gain experience with Hadoop and Spark for big data analysis, and I was satisfied with the overview of their operations and the experience of implementing them with PySpark. Group work was a new experience for me, but since I had an excellent partner, I was relieved that we were able to collaborate effectively.

I was able to gain experience in reproducing a research paper, and I can say that the successful completion of the project was thanks to my partner. At this point, I do not have the confidence to read a paper and reproduce its content on my own, but I would like to leverage this experience and challenge myself in other courses. Additionally, while I learned a lot from using deep learning models, I wish I could have focused more on the operations of data analysis frameworks.

This semester, I took two courses for the first time, but I didn’t feel as overwhelmed as I had feared. I believe this was due to the summer break and relatively lighter workloads. Miraculously, the deadlines for assignments from both courses rarely overlapped, with deadlines often alternating on weekends. As a result, while I had assignments due every week, it was better than having them overlap. I also realized that prior preparation was extremely important. Since I could work on the BDH Lab in advance, I was able to successfully complete both courses.