Creating a data-set
We reflect here on the process we followed to come up with
the data-set for the event. We hope that the notes shared here will help give you
a framework to think about the nuances that need to be carefully thought
through while selecting a data-set for such an exercise. In the process of
coming up with the data-set, we were able to distill a clear idea of what it ought
to look like and what it ought not.
Here were a couple of considerations –
- We wanted the kids to work on the full life cycle of data analysis – data collection; data entry and cleaning; feature identification; visualization, feature selection; a simple model building exercise and then finally validating the model on some unseen data. This was important; we didn’t want the kid to just spend time starting at a bunch of videos or looking at an instructor working her way through data. It was important that she would get her hands dirty in understanding the data for herself.
- We wanted this data-set to be easily relatable to kids of this age group – something they would see immediate utility for. Not oil or stock prices!
Our initial realization was that for a hands-on
demonstration, it would be best to work with some simple supervised learning
example. Unsupervised learning was a no-no. Keeping this broad framework in
mind, we went ahead and dished out the following –
Grocery consumption prediction – though this
seemed good from a relatable angle, we weren’t sure about how hands-on this
could get. Weather-related predictions – We thought of predicting the weather
by looking at the clothes a person was wearing. We could give out images of
people wearing various clothes and predict what the weather was at the time the
picture was taken. This probably would’ve worked best given the simplicity of
the features (shirt vs. jacket; trousers vs. shorts). But what if this were too
simple? The gathering could simply brush this aside and say “hey, stop fooling
us with jargon. We know how to predict the weather by looking at clothes. What’s
new here?”
Movie/book prediction – the crowd weren’t
bookworms – we had a sense of their taste in books. We felt movies could be
fairly noisy in taste. We tried out a quick experiment and had Varun’s young
nieces to list down their favorite movies. It was fairly noisy and worse, the
movies they listed as not-good were ones they hadn’t seen. This was turning out
to be a problem in one-class classification where only the “liked” set was
known – nope, we weren’t heading down that road.
A friend predictor - The idea was to show
several images having faces of boys/girls with a brief description of their
hobbies and ask each kid in our gathering independently whether they would
befriend the person in the picture. We would then build a model for every kid
which would predict his/her “friend- identifier”. This promised to be a fun
exercise where the gathering could probably learn more about the rest in the
group through such an analysis. We could see that this was easily relatable and
would be entertaining in addition to being a real social experiment!
We decided to go ahead with the friend predictor exercise
[more information here]. While deciding this, we asked ourselves whether every
kid in our camp should work on the same data-set or whether we should have
different groups working on different exercises. We thought it would get hard to
manage and stuck to working on one exercise.
So let’s clearly lay out the
details -
Each kid gets to see N faces and rates them on a
scale of 1-5 on how likely s/he will befriend the people shown in the images. A
rating of 1 is if s/he won’t befriend the person at all and 5 if s/he will
surely make the person a friend. On rating N images on a “friendship
propensity” scale, we create a classifier to predict these ratings. Our aim was
to proxy a Naïve Bayes approach in training a classifier. We decided to go with
discrete/categorical variables on purpose so that the exercises involved only
counting and visualizing simple bar graphs instead of worrying about means,
standard deviations and continuous valued distributions!
We focused on four features for this exercise – gender,
names, faces and a hobby. Each feature could take two values – male/female; old
sounding name/new sounding; smile/serious and sports/non-sports. We did a quick
pilot again, thanks to Varun’s nieces, and saw that the hobby correlated the
best with the final output (0.6 points). This gave us confidence on the
features we should expect the kids to implicitly use and also felt this would
be a good case to show poor performing features.
In our pilot, we also realized that one of the
nieces’ responses looked different and on par with her known “friend-quotient”
when she repeated the exercise a second time. We hence added in 8-10 additional
images as a practice set, making it a total of a 56-point data set.
On finalizing the features, we had to think
about the distribution of the features in our set. We decided it would be best
if each feature could take two values 0-1 and we had four features in all – 16
unique permutations in total in the data-set. This created a balanced set,
which meant we could trust more (why not completely? Will leave you to think)
an inference/visualization with one feature at a time (since all the features
were evenly distributed).
Because the data-set was balanced, there was 50%
of each category of a feature (say 50% males and females). As a result, when
the kids looked at the “friend” class (those that they’d rated 4 or 5), they
just had to see whether one category of features was overrepresented to see if
there was a systematic preference they had. If we hadn’t done this, we would
then have to first see the distribution of a feature in the entire training set
and then compare it with the
distribution in the “friend” or the “non friend” class to understand whether
the kid had a systematic inclination towards favoring a feature in deciding
whom to befriend
Another challenge was that we were taking a
5-valued output vector, but finally wanted to do a two-class classification. Would
kids have made this this leap in understanding? It turned out to be pretty easy
for them - it was intuitive for them to consider those they’ve rated 4 and 5 as
folks they’d befriend. (Each task here is a science in itself!)
We also had to have multiple points for each
such permutation – we kept that number to three. We hence had 48 points in our
data-set. We decided to use 32 points to train our model on and 16 to hold-out and
validate our model. We were cheating a little here on choosing our validation
set, but that was to maximize our chances of success in building a good
When brainstorming with Prof. Una-May from the
ALFA group at CSAIL, MIT about this, she suggested we use a digital platform to
collect the kids’ responses. This would have been a very nice layer atop our
existing framework wherein the kids could play a ‘game’ at length before
arriving at the camp, and we’d have our data digitized and prepped for them to
start playing with! We couldn’t really pull this off given the time constraint,
but we plan to tinker with it going forward. An additional bit which would have
missed out as a result of using such an app was probably the kid would not have
entered information herself onto an excel and wouldn’t have gotten a first-hand
exposure to data-entry.
It turned out quite well in the
end. We did get to see a varied response. The features did predict well. Males/Females
was a stark differentiator for some while indoor/outdoor activities was the
expected differentiator for the rest.
In all, the data-set worked! Let
us know if you can better this! And this is just the tip of the iceberg of what
can be done! While we leave you here, we ask what the scratch for data science
is. Anyone?
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.