During my fall semester senior year, I worked with two things for the first time, Python and data analysis, in an atmospheric sciences-focused computing and data analysis class. At the end of the semester, each of us chose our own dataset and project to work on. I took a dataset that tracked air pollutants within each district of Seoul, South Korea each hour over a week and visualized it. Again, this was all with no prior experience in data analysis and Python, and with no domain knowledge of atmospheric sciences to boot.
Reminiscing on ATMS 305
My fall semester senior year at the University of Illinois was a bit of an experimental time. Being done with core classes, it was time to go off the beaten path and give some different technologies the ol’ college try. One of these pursuits was data analysis. During the spring semester the year before, my probability class gave me a taste of data analysis as we took representative datasets from a hospital and analyzed them in MATLAB. It was quite the learning experience for me, and I really wanted to dig deeper. This led me to ATMS 305, a computing and data analysis class with a focus on atmospheric sciences.
There’s a little detail you should probably know though – I haven’t used Python before at this point. Granted, I’ve seen Python in the few Codeacademy lessons I did out of curiosity trying to quickly learn it at a hackathon, but I’ve never used it it anything. This was my whirlwind tour into the wild world of Python, as I learned its ins-and-outs, how to use Jupyter, and how to use some of the big libraries out there – SciPy, NumPy, Matplotlib, and pandas. I even got a taste of shell scripting while learning a bit more about Unix.
Each of us had to create our own final project using any data set we can find. This was definitely a bit out of my wheelhouse. I could easily do the tasks of all the assignments in the class, but now I actually had to find my own data set and gather my own findings from it – and it had to be atmospheric sciences-related data and findings to boot. Oh dear.
The Final Project
It took a bit of time, searching, and talking with my teacher to get inspired, but I found the perfect data set to work on – an hourly log of various air pollutants in the districts of Seoul, South Korea. The data gives hourly average concentrations of pollutants taken over a seven day period from November 18th, 2017 to November 24th, 2017 in a .csv file.
Before I really got to play around with the data, I had to do a bit of work. The data had all its labels and non-numerical cells in Korean, so I had to take an hour translating all the labels in .csv to English. This is how I learned how comprehensive the data was – this wasn’t just data on air pollutants in Seoul, but data on air pollutants in each district of Seoul. With 196 hours of readings of six pollutants in all 25 “gu” of Seoul, I had plenty of data to look at (and some missing data to boot, yay!).
To look at the data, I needed a few libraries, in particular, pandas, Matplotlib, Cartopy, and Imageio. I started off by importing all the data into a pandas dataframe, indexing by time. To get an idea of how readings changed over time, I sliced the data to only include the district of Gangnam and generated line plots of the readings for each pollutant.
I noted how clear it was to see how nitrous dioxide, carbon monoxide, and dust were on the rise between the 20th and 23rd, but there was a drop in ozone at this time as well. This visualization and commentary were all well and good (and if I did this 24 more times to cover every other district and do some comparisons, it would have been sufficient for the project’s requirements). But, I couldn’t allow myself to go the easy route. I was invested to do my best. One of my favorite subreddits is /r/dataisbeautiful, where users present visualizations of data. I love visualizations in how they can compactly yet beautifully present data, and 25 graphs definitely isn’t compact. So, what could I do? How about plotting the data on a map of Seoul?
In class, we took a bit of time to look at Catopy, but by a bit of time, it went something like, “This exists; it can draw maps and data on it.” So, after some time, experimentation, and a lot of documentation reading, this is what I figured out. I could import my own maps as long as I had a shapefile (and thankfully, there’s this lovely shapefile of Seoul right here). To import the shapefile, I needed to use Cartopy’s shapereader. From the shapereader, I have access to each individual district on the map in a list. With this, I can now go through each district, add its shape into the map, and individually color each shape in the process. This is a snippet of me randomly coloring each district of Seoul.Fantastic! Now, I can look at each pollutant, create a range of colors normalized with respect to the minimum and maximum concentrations of a given pollutant from the dataset, and color each district based on the concentration. To show you what I mean, here is a picture of the nitrogen dioxide levels in all the districts at 11 pm on November 17th.
Now, this is where I really had fun. I went through each pollutant and batch-generated similar images for each time in the data. Then, using Imageio, I gathered all these images into a .gif. Without further ado, here are the animated visualizations I made for each pollutant.
The Takeaways
This was an extremely fun project for me. I really got out of my comfort zone and got my feet wet in a new field, and I also got to really show what I’ve learned in something I haven’t done before. With these visualizations, I really got a clear picture of how these pollutants change over time, not only for a given district, but with their surrounding districts in consideration as well. There’s a kind of motion you can perceive in some of these animations as pollutants move through the district. This is something you definitely could not perceive with line graphs.
Granted, there are probably better ways to make visualizations like this. But, for me, who hadn’t used Python much at all before, I consider this an accomplishment. My Python skills are definitely a lot more competent now, and I have a much greater appreciation for data science. In the greater scheme of things, this was a stepping stone to figuring out what I want to do with computer engineering. I love trying technologies I haven’t tried using before, and I love applying my skills in fields and disciplines that aren’t purely engineering. I love working on projects that make me use my skills in completely different places, like using data science to help in analyzing air pollution. Even when I’m completely in the dark entirely, like not knowing the technologies or subject matter, I find that I really want to learn. Wherever there’s a project for me, there’s a lot of learning and a lot of fun for me to have.