Monday, June 15, 2015

A data science camp for kids

We thought we will experiment with teaching class 6th-9th kids some data science! We think it is important to introduce students very early to thinking in a data-driven way – but also kids are way more fun than dealing with college students!

So we sent a form out two weeks back and asked kids in classes 6-8th to apply. We thought 5th is too young and 9th/10th are more focused on school academics. We had 18 people apply with most of them interested in science and math and a few in history/arts (See pie chart).

Today, we had 14 people turn up (there was no selection, only self-selection). This included one 10th grade student and one sophomore undergraduate who tagged with the group to learn!
Some kids came in early. We put on a Youtube video onscratch for them. It was fun to discuss it with them and they related it to Lego. We asked kids to install scratch at home – and make a dancing Amir Khan (famous Bollywood movie actor) on it and also have him jump around from one building to another!

As all kids assembled, we had a quick ice breaker. Parth and Abhishek, interns from IIT Kanpur, divided senior and junior students in two sets in order to pick one from each to form a group. This was to maximize group success. They had interesting ways to do this - read here [link – Parth, Abhishek]!

Then Samarth, our intern from Harvard, introduced the idea of data science to kids [link - samarth]. He started with the famous John Snow cholera outbreak example. Kids were very quick- by a show of hands, everyone had seen a Google map. They understood that infected people were clustering around one pump and there were other vacant pumps. Couple of questions – Why some dots are large and small? Why did someone not go to a pump which was farther away. We told them there were three learnings for them: 
a. Don’t waste water- it wasn’t as easily available 100 years back and still not to many; 
b. Don’t run away from problems; try to solve them; else they will catch up with you (couple of them said that their way of solving this was to just run away from the city!), 
c. You can solve problems with data – here is a medical problem you solved by plotting infected families on a map; you did not need any exposure to biology or medicine to come up with a preliminary inference.

We moved on and started with the key data set and experiment. Our aim was to give kids an idea of the whole cycle of data science – data collection, data entry/cleaning, feature extraction, visualization and model building (if we could get there, we had presumed we wouldn’t due to a paucity of time) and also sensitize them to data security/permissions concerns.

The exercise [link – Gursimran lab] we designed was: Every kid will get a set of 48 faces with names and their hobbies. The kids had to rate 5 if they will make the person a friend, 1 if not and could choose other numbers in between. All the 7 groups completed the exercise with one mentor each. Out of these we pulled 16 samples out as a validation set :) The ‘train’ data sets were then exchanged among groups.

We then asked the groups that from these sheets, if we wanted to know what kind of people Raghav (one of the kids) makes friends, what would they look at? How would they come up with a solution? One of the kids suggested that we could look at what kind of games his friends played and then tell accordingly. We asked what else? And then introduced that it could be that some of the kids make friends with more and some with girls; asked a boy whom he makes friends with often and he said boys; couple more said neutral. And then we discussed two more features: we had smiling and neutral faces – would some people make smiling people friends more often? And also, we had old style names and new names – would some people like to make new name folks friends more often? Kids seemed to have understood that people could possibly, not necessarily, make choices on this basis. For the workshop we decided to go with three features: gender, hobby and name style.
The platform we were using was excel. We had a sheet with features already entered for the data set. The kids had to enter the ratings and check the features. The kids did find some features wrongly entered and also some ambiguities: is squash indoor or outdoor, is Shilpy a new name or an old name? :)

The first task of the kids was to find if the person they were analyzing was a friendly person or not. To get this right, they had to simply count how many people were marked each as 5, 4, 3, 2 and 1 by the kid. Some of the kids used filters to do this and others counted manually. They finally made a graph. Here is the first graph we discussed with the kids, where the red bar depicted percentages and the blue bar depicted the actual number in each bin.

We made two inferences:

  •  K (anonymous) was a friendly person: s/he more often makes friends than not.
  • K is clear-headed and a fast decision-maker. S/he doesn’t have many may be/may be not cases. S/he either decides to make a person a friend or not.
Then we discussed couple of more graphs of other kids. We said statements positively J: V makes lesser friends, but that is because s/he likes to spend time studying. One group said, she is confused since she had many may be/may be not: we said not confused, she takes time to decide who to make a friend or not, because she may be thinking deeply about it.
This was fun! Our next exercise was that they had to find among the people the person chose to befriend, were there, say, more males than females? And similarly for other features. <Footnote, We had created a balanced data set with 50-50 of each feature type; this created a simplification that we did not have to see the non-friends group> Again kids used filters and counted for input variables of the two types and plotted graphs. We had already inserted a template for the kids to put in their counts in their excel sheets; they then plotted the graphs themselves.

Here is a set of graphs we discussed.

So, we learnt – the student for whom we’d made this graph definitely likes to make friends with people who plays outdoor games – that is a clear trend. Next we talked about gender – the person makes male friends slightly more often; but this trend is still not completely clear, since the difference between males and females is too little. It needs further investigation. Same for the third feature.
The big take away was: we can find what kind of people each of us make friends with! Kids seem to understand and appreciate this. We told them that they could have done this differently, by interviewing the person and then trying to say who he will make friends – but we do it differently – ‘learning by example’, we see who they make friends, analyze it to figure out trends and then be able to predict!
Ideally we wanted kids to make a predictor with a simple point based system [link Shashank doc], but we didn’t get there. We however went ahead and took the example of the above kid, who had shown outdoor games as the key deciding factor, and considered that feature as the predictor – we took out her ‘validation’ data from the envelope and saw how we well we did – it was only ok, honestly! But kids got the concept.

We then got a data release form [link – template PDF] signed from them and explained to them that they have the right that their data isn’t publicly disclosed and we seek their permission – we will anonymize their data. One girl opted out. Rest of the data can be found here [link - data].

When I asked with a wink how many from the gathering would like to come over for a part 2 of the data camp the following week, ten of them raised their hands :) A good test for us. See the kids’ blog entries here [link – kids] and mentor experiences here [link- mentor]! Harsh also suggested to them that they should start making data entries of their expenditure and pocket money!
Do note, that we were using lot of assumptions to simplify this – correlation vs. causality, balanced sets, no significance testing, etc. Our aim was to lead them to a naiver na├»ve bayes. We think this is a fine approach like the famous Arundhati nyaya.

  • We need 5-6 hours to run this right and we would have done the model too and explained things a lot better.
  • We didn’t have a what-next? A strong take away and continuation.
  • Kids need to know the concept of percentages – we think 7th to 9th might be a better target.
  • Currently, we have 1 mentor for every 2 kids. We need this to be more scalable. Should be possible.
  • Would want to emphasize explaining data science vs. other ways of doing things through some examples. We give them a problem, they try it and then we give the data way of doing it.
  • More visualizations to share.
  • Need a resource sheet – we will be sending that to the group that attended.
  • Better sorting and group formation.
Great job Shashank for teaming up to initiate this – was great fun. Thanks Harsh, Bhanu, Nishant, Gursimran (for the photos also!), Parth, Abhishek, Vishal, Samarth – good show.


No comments:

Post a Comment