A Review on 2021 DSSG Summer Research Program

2021-09-18

This past summer, I was fortunate to have a flavor of data science research before officially ending my undergraduate journey. It was overall rewarding and, must say, one of the most engaging and supportive programs that expanded my horizon and shed a light on my future career path towards practicing social good with technologies. I have been thinking about doing a review of the program with reference to some personal experiences.

The program lasted for three months in total over the summer. The first week was about onboarding and getting to know each other. We had two kick-off meetings, one of which was with the same cohort of fellows, alumni and scientists, and the other one was only with the project team and the partner. The vibes were open and friendly. As a newbie researcher in data science, I felt welcome and blessed to be surrounded by these talented and approachable people.

Succeedingly, there were a couple of workshops aiming to help us get familiar with some fundamental statistical tools and methodologies so as to be prepared for tackling all sorts of data in the upcoming project. I personally found them quite helpful, since they did cover major topics in data cleaning, statistical modeling, and analysis with multiple useful R libraries as well as SE tools like GitLab and Shiny. The resources were compact and straightforward, turning to be a good revision of some R knowledge.

The project that I got to work on is called the Unbiased Mobility Data project, partnered with the Cedar Academy. Initially, the research question was circling around a potential “preferential sampling” in traffic camera installations in the City of Surrey, and the original goal was to identify the truthfulness of the statement and create a tangible solution, usually through a shiny app with data visualizations and covariate analysis. There were only three of us in the team, all of whom were undergraduate students. Not surprisingly, no one had officially done any statistical research before. Therefore in the beginning, we had spent a week or so digging into papers trying to figure out what “preferential sampling(PS)” referred to in geo-spatial data and booked several meetings with domain experts to clarify questions in this special case, where the data points were no longer randomly distributed but all lying on the straight routes, and the sampling locations wouldn’t change over time, etc. So if the goal were to interpolate traffic count in nearby areas based on the sampling data points while eliminating any potential bias introduced by PS, the existing approaches would not apply since the assumptions failed in the first place.

Simply put, after putting lots of effort into proving the proposal “wrong”, we needed to pivot and re-establish our problem statement. However, more challenges were coming through. There was one girl pointing out the issue with biases introduced in the earlier object detection phase and proposing the direction towards bias detection and analysis in the computer vision scope. Yet, not each one of us was familiar with relevant topics. Me and the other person were leaning towards statistical analysis with the existing traffic count data to extrapolate other useful insights(e.g. Business recovery, covid positivity rate, etc).

Normally, there would be two weekly review meetings, one with the scientists and the other with our stakeholders, where we each demonstrated the results and findings to them to hear about any feedback or comments on the next steps and dimensions to drill down. While we were struggling with dedicating ourselves to one single research direction, we were lucky enough to have the open support from them, encouraging us to try out different areas that we were interested in and focus more on the process than the results. I was then able to learn more about time series when analyzing the correlation between covid data and traffic count over time, image pre-processing methodologies in the efforts to mitigate the vehicles’ headlight overexposure issue, paired t-tests and chi-squared tests when examining any trend or covariates shared by nearby cameras, etc. Although it didn’t bring me much sense of teamwork since we all focused on distinct aspects, which made it a bit hard to ask for help from each other and easy to go out of sync sometimes.

After the midterm presentation, due to the limitation in time and shortage in (labelled) data, we were a bit lost on what else we could do with the data we had. In addition, it was also the time that we realized the urgent need to figure out a way to synthesize our preliminary work and come up with a solution. Eventually, we agreed on building a Shiny app as it was one of the simplest ways to embed native data visualizations to a web application, also considering the fact that one of us barely had any app development experience before. After kick-starting the actual development, although we were still working on different features with regard to our own research focus, we had a better feedback loop and task tracking with code reviews and ticket assignments respectively. Thus, I could tell that the second half of this project was more collaborative and efficient overall.

One of the most memorable highlights had to be our very first in-person event since the pandemic - the final presentation, also the very last one for me since I was about to graduate. It was a hybrid live event hosted at the student’s Nest, and despite I was thrilled to see everyone in person, it felt a bit hectic too, to present to real humans rather than just a screen. In terms of the presentation itself, except for some small technical issues during the demo, it went great overall in terms of showcasing what problems we were resolving along with numerical and logical reasoning behind the scene.

To summarize, this program has inspired me so much in tackling a research question with different lenses and allowed me to dive into the world of geo-spatial data and image data to peek into the intricacy of raw data processing and modeling, as well as the huge research potentials in both domains. So if you’re a data geek curious about research but not sure which area you want to pursue further, this program is definitely for you to explore those topics and navigate your passion. There are a variety of resources and connections you can take advantage of too, where you could chat with different DSI mentors to get to know their projects and potentially find the perfect team that you may want to be part of in the future!

Don’t miss out on this year’s DSSG summer research program and be ready to take on real-life data challenges, gain more hands-on analytical and research skills, and most importantly, find and explore what you are passionate about!

For more information on the project, please feel free to checkout the final documentation and our Github repo! 😉