Monday, June 15, 2015

Creating a dataset

We reflect here on the process we followed to come up with the data-set for the event. We hope that the notes shared here will help give you a framework to think about the nuances that need to be carefully thought through while selecting a data-set for such an exercise. In the process of coming up with the data-set, we were able to distill a clear idea of what it ought to look like and what it ought not. Here were a couple of considerations – 
  • We wanted the kids to work on the full life cycle of data analysis – data collection; data entry and cleaning; feature identification; visualization, feature selection; a simple model building exercise and then finally validating the model on some unseen data. This was important; we didn’t want the kid to just spend time starting at a bunch of videos or looking at an instructor working her way through data. It was important that she would get her hands dirty in understanding the data for herself.
  • We wanted this data-set to be easily relatable to kids of this age group – something they would see immediate utility for. Not oil or stock prices!
Our initial realization was that for a hands-on demonstration, it would be best to work with some simple supervised learning example. Unsupervised learning was a no-no. Keeping this broad framework in mind, we went ahead and dished out the following – 
  • Grocery consumption prediction – though this seemed good from a relatable angle, we weren’t sure about how hands-on this could get.
  • Weather-related predictions – We thought of predicting the weather by looking at the clothes a person was wearing. We could give out images of people wearing various clothes and predict what the weather was at the time the picture was taken. This probably would’ve worked best given the simplicity of the features (shirt vs. jacket; trousers vs. shorts). But what if this were too simple? The gathering could simply brush this aside and say “hey, stop fooling us with jargon. We know how to predict the weather by looking at clothes. What’s new here?”
  • Movie/book prediction – the crowd weren’t bookworms – we had a sense of their taste in books. We felt movies could be fairly noisy in taste. We tried out a quick experiment and had Varun’s young nieces to list down their favorite movies. It was fairly noisy and worse, the movies they listed as not-good were ones they hadn’t seen. This was turning out to be a problem in one-class classification where only the “liked” set was known – nope, we weren’t heading down that road.
  • A friend predictor - The idea was to show several images having faces of boys/girls with a brief description of their hobbies and ask each kid in our gathering independently whether they would befriend the person in the picture. We would then build a model for every kid which would predict his/her “friend- identifier”. This promised to be a fun exercise where the gathering could probably learn more about the rest in the group through such an analysis. We could see that this was easily relatable and would be entertaining in addition to being a real social experiment!
We decided to go ahead with the friend predictor exercise [more information here]. While deciding this, we asked ourselves whether every kid in our camp should work on the same data-set or whether we should have different groups working on different exercises. We thought it would get hard to manage and stuck to working on one exercise.

So let’s clearly lay out the details -
  • Each kid gets to see N faces and rates them on a scale of 1-5 on how likely s/he will befriend the people shown in the images. A rating of 1 is if s/he won’t befriend the person at all and 5 if s/he will surely make the person a friend. On rating N images on a “friendship propensity” scale, we create a classifier to predict these ratings. Our aim was to proxy a Na├»ve Bayes approach in training a classifier. We decided to go with discrete/categorical variables on purpose so that the exercises involved only counting and visualizing simple bar graphs instead of worrying about means, standard deviations and continuous valued distributions!
  • We focused on four features for this exercise – gender, names, faces and a hobby. Each feature could take two values – male/female; old sounding name/new sounding; smile/serious and sports/non-sports. We did a quick pilot again, thanks to Varun’s nieces, and saw that the hobby correlated the best with the final output (0.6 points). This gave us confidence on the features we should expect the kids to implicitly use and also felt this would be a good case to show poor performing features.
  • In our pilot, we also realized that one of the nieces’ responses looked different and on par with her known “friend-quotient” when she repeated the exercise a second time. We hence added in 8-10 additional images as a practice set, making it a total of a 56-point data set.
  • On finalizing the features, we had to think about the distribution of the features in our set. We decided it would be best if each feature could take two values 0-1 and we had four features in all – 16 unique permutations in total in the data-set. This created a balanced set, which meant we could trust more (why not completely? Will leave you to think) an inference/visualization with one feature at a time (since all the features were evenly distributed). 
  • Because the data-set was balanced, there was 50% of each category of a feature (say 50% males and females). As a result, when the kids looked at the “friend” class (those that they’d rated 4 or 5), they just had to see whether one category of features was overrepresented to see if there was a systematic preference they had. If we hadn’t done this, we would then have to first see the distribution of a feature in the entire training set and then compare it with  the distribution in the “friend” or the “non friend” class to understand whether the kid had a systematic inclination towards favoring a feature in deciding whom to befriend
  • Another challenge was that we were taking a 5-valued output vector, but finally wanted to do a two-class classification. Would kids have made this this leap in understanding? It turned out to be pretty easy for them - it was intuitive for them to consider those they’ve rated 4 and 5 as folks they’d befriend. (Each task here is a science in itself!)
  • We also had to have multiple points for each such permutation – we kept that number to three. We hence had 48 points in our data-set. We decided to use 32 points to train our model on and 16 to hold-out and validate our model. We were cheating a little here on choosing our validation set, but that was to maximize our chances of success in building a good predictor!
  • When brainstorming with Prof. Una-May from the ALFA group at CSAIL, MIT about this, she suggested we use a digital platform to collect the kids’ responses. This would have been a very nice layer atop our existing framework wherein the kids could play a ‘game’ at length before arriving at the camp, and we’d have our data digitized and prepped for them to start playing with! We couldn’t really pull this off given the time constraint, but we plan to tinker with it going forward. An additional bit which would have missed out as a result of using such an app was probably the kid would not have entered information herself onto an excel and wouldn’t have gotten a first-hand exposure to data-entry.
It turned out quite well in the end. We did get to see a varied response. The features did predict well. Males/Females was a stark differentiator for some while indoor/outdoor activities was the expected differentiator for the rest. In all, the data-set worked! Let us know if you can better this! And this is just the tip of the iceberg of what can be done! While we leave you here, we ask what the scratch for data science is. Anyone?

No comments:

Post a Comment