Data science for kids!

Two weeks back, we decided to experiment teaching data science to grade 6th-9th kids! We think it is important to introduce students to thinking in a data-driven way early on in their lives; also kids are way more fun than higher-ed students, so it was an easier choice for us to make!

We sent out a form  asking kids to apply to our cool Kids Data Camp - the first in the world?! We thought kids in 5th grade would have been too young and those in 10th would be more focused on school academics. We had 18 people apply to us, with most of them interested in science and math and a few in history/arts.

This weekend, we had 14 people turn up (there was no selection barring self-selection). This included one 10th grade student and one sophomore undergraduate who tagged with the group to learn!
Some kids came in early. We put on a Youtube video on Scratch for them. It was fun to discuss it with them and they related it to Lego right away. We asked kids to install scratch at home – and make a dancing Shah Rukh Khan (famous Bollywood movie actor) on it and also have him jump around from one building to another!

The Ice Breaker
Once all the kids had assembled, we had a quick ice breaker. Parth and Abhishek, interns from IIT Kanpur, divided senior and junior students in two sets in order to pick one from each to form a group. A simple way to maximize group success- read here to learn how they did it !



Then Samarth, our intern from Harvard, introduced the idea of data science to kids. He started with the famous John Snow cholera outbreak example. Kids were very quick- by a show of hands, everyone had seen a Google map. They understood that infected people were clustering around one pump and there were other vacant pumps. Couple of questions – Why some dots are large and small? Why did someone not go to a pump which was farther away. We answered. We told them there were three learnings for them: 
a. Don’t waste water- it wasn’t as easily available 100 years back and still not to many; 
b. Don’t run away from problems; try to solve them; else they will catch up with you (couple of them said that their way of solving the problem was to just run away from the city!), 
c. You can solve problems with data – here is a medical problem you solved by plotting infected families on a map; you did not need any exposure to biology or medicine to come up with a preliminary inference.



The data exercise
We moved on and started with the key data set and experiment. Our aim was to give kids an idea of the whole cycle of data science – data collection, data entry/cleaning, feature extraction, visualization and model building (if we could get there, we had presumed we wouldn’t due to a paucity of time) and also sensitize them to data security/permissions concerns.

We designed the following exercise: Every kid will get a set of 48 faces with names and their hobbies. The kids had to give a rating of 5 if they will make the person a friend, 1 if not and could choose other numbers in between. All the 7 groups completed the exercise with one mentor each. Out of these we pulled 16 samples out as a validation set :) The ‘train’ data sets were then exchanged among groups.



Introducing Features
We then asked the following: if we wanted to know what kind of people does Raghav (one of the kids) prefer to make friends, how could they infer this by going through these sheets? One of the kids suggested that we could look at what kind of games his friends played and then tell accordingly. We asked what else? We then introduced that it could be that some of the kids prefer making friends with boys and some with girls; we asked a boy whom does he prefer to make friends with more often - he said boys; couple more said neutral.

Then we discussed two more features: we had smiling and neutral faces – would some people make smiling people friends more often? And also, we had old style names and new names – would some people prefer to make folks with new names friends more often? Kids seemed to have understood that people could possibly, not necessarily, make choices on this basis. For the workshop we decided to go with three features: gender, hobby and name style.
We used excel as the platform for all experiments. We had a sheet with features already entered for the data set. The kids had to enter the ratings and check the features. The kids did find some features wrongly entered and also some ambiguities: is squash indoor or outdoor, is Shilpy a new name or an old name? :)

Question 1: Is this kid a friendly person?
The first task of the kids was to find if the person they were analyzing was a friendly person or not - will s/he more often make friends than not. To get this right, kids had to simply count how many people were marked each as 5, 4, 3, 2 and 1. Some of the kids used filters to do this and others counted manually. They finally made a graph. Here is the first graph we discussed with the whole group, where the red bar depicted percentages and the blue bar depicted the actual number in each bin.

* Original work by kids reproduced

We made two inferences:
  •  K (anonymous) was a friendly person: s/he more often makes friends than not.
  • K is clear-headed and a fast decision-maker. S/he doesn’t have many may be/may be not cases. S/he either decides to make a person a friend or not.
Then we discussed couple of more graphs of other kids. We said statements positively J: V makes lesser friends, but that is because s/he likes to spend time studying. One group said, she is confused since she had many may be/may be not: we corrected: not confused, she takes time to decide who to make a friend or not, because she could possibly be thinking deeply about it.

Question 2: What kind of people does s/he prefers making a friend?
This was fun! Our next exercise was that they had to find among the people the person chose to befriend, were there, say, more males than females? And similarly for other features. (We had created a balanced data set with 50-50 of each feature type; this created a simplification that we did not have to see the non-friends group) Again kids used filters and counted for input variables of the two types and plotted graphs. We had already inserted a template for the kids to put in their counts in their excel sheets; then plotted the graphs themselves.

Here is a set of graphs we discussed.

* Original work by kids reproduced

So, we learnt – the student for whom we’d made this graph definitely likes to make friends with people who plays outdoor games – this is a clear trend. Next we talked about gender – the person makes male friends slightly more often; but this trend was not completely clear, since the difference between males and females is too little. It needs further investigation. Same for the third feature.

The big take away was: we can find what kind of people each of us make friends with! Kids seem to understand and appreciate this. We told them that they could have done this differently, by interviewing the person and then trying to say who he will make friends – but we do it differently – ‘learning by example’, we see who they make friends, analyze it to figure out trends and then be able to predict!

Making and validating a super simple predictor
Ideally we wanted kids to make a predictor with a simple point based system, but we didn’t get there. We however went ahead and took the example of the kid just discussed, who had shown outdoor games as the key deciding factor, and considered that feature as the predictor – we took her ‘validation’ data from the envelope and saw how we well we did – it was only ok, honestly! But kids got the concept. They could predict unseen data based on a set of seen data.

[Edit - June 27th We decided to do a small follow-up session to close the loop and actually build the predictor. We worked with three kids this time on (rest were holidaying around the world, our ambassadors!). We re-did the whole exercise with the kids as a re-cap and came to the stage, where they identified 'features' - those that were overrepresented in the 'make friend' set (See table below-second column).  For gamma, whereas there was a difference, it wasn't as much.


Person
Features
Who to make a friend?
Train accuracy
(32)
Test Accuracy
(16)
ALPHA
'Outdoor' people
84.38
81.25
BETA
'Indoor' people
81.25
81.25
GAMMA
With 'new names'
62.50
75.00


Then kids made a simple predictor with one feature (such as, =if (c23=”Indoor”,5,1)). Each one wrote their own after we showed an example. Then they found the accuracy: simply typed in an adjoining column 1 if the predictor and the actual matched and 0 otherwise, then counted the 1s and got the percentage (Column 3 in table above). One kid actually predicted the accuracy saying it would be same as the percent they calculated to draw the feature graph -- smarter than we think!

Then came the real test -- we took out the envelopes containing the unseen test and we manually marked each one we got right (ideally to be done in excel!) - kids were so happy everytime we did it right and low when we didn't :) - we were nervous! Finally, kids were super-happy seeing the high accuracy each of their predictors had. We asked why did Gamma have a low accuracy -- after some struggle one did say, because the difference for 'new names' feature wasn't high.

Thus the kid predictors shined doing much better than the predictors we make all the time!!!
Edit end]

We then got a data release form signed from them and explained to them that they have the right that their data isn’t publicly disclosed and we seek their permission – we will anonymize their data. One girl opted out. Rest of the data can be found here.

When one of the authors asked with a wink how many from the gathering would like to come over for a part 2 of the data camp the following week, ten of them raised their hands :) A good test for us. See the kids’ blog entries here and mentor experiences here! Harsh also suggested to them that they should start making data entries of their expenditure and pocket money! Some really interesting suggestions came from kids regarding what they would do with this knowledge.

Do note, that we were using lot of assumptions to simplify this – correlation vs. causality, balanced sets, no significance testing, small sample size, etc. Our aim was to lead them to a naiver naïve bayes. We think this is a fine approach like the famous Arundhati nyaya.



Learnings:
  • We need 5-6 hours to run this right and we would have done the model too and explained things a lot better.
  • We didn’t have a what-next? A strong take away, resource sheet and continuity.
  • Kids need to know the concept of percentages – we think 7th to 9th might be a better target.
  • Currently, we have 1 mentor for every 2 kids. We need this to be more scalable. Should be possible.
  • Would want to emphasize explaining data science vs. other ways of doing things through some examples. We give them a problem, they try it and then we give the data way of doing it.
  • More visualizations to share.
And of course, this is just the TIP OF THE ICE BERG!

Thanks Harsh, Bhanu, Nishant, Gursimran (for the photos also!), Parth, Abhishek, Vishal, Samarth – good show. Thanks Una-May for the encouragement and helpful ideas!



-Varun & Shashank
2015


18 comments:

  1. I am very impressed with this data camp for children. Appreciate all the efforts by the team members... Good Job.

    ReplyDelete
  2. Great way to catch them early to think the data driven way. Good work.

    ReplyDelete
  3. Nice initiative, would like to be part of it.

    ReplyDelete
  4. This is awesome guys, what a clever way to share the learnings and pass the knowledge to the future of our world!! Encouraging them how to use data to help
    People and society - fantastic job!!!
    Thank you so much - from a fellow data scientist, I will do the similarly thing for my son's class ;)

    ReplyDelete
    Replies
    1. Thanks a lot, Angela. Have you managed to run this in your son's class? Would be great if you could let us know how it went.

      Delete
  5. i would like to conduct it too...Can you share your contact so that i can get a better understanding to conduct it ..my email is : ggaurika@gmail.com

    ReplyDelete
  6. Hello, if you plan on conducting such a workshop in Delhi I would be interested

    ReplyDelete
  7. Would love to send my kids when you do this next

    ReplyDelete
  8. I'm interested in conducting this along with some programming exercises that i have planned. Could you share more details. Thanks in advance!

    ReplyDelete
    Replies
    1. Sure, Soma. We'd be happy to get on a call. Will contact you through your email id.

      Delete
  9. This is quite helpful for kids.
    I would like to teach these to a bunch of kids in our neighborhood.
    Could we collaborate, pls?
    I can be reached at Ramya.krishna@hotmail.com
    Looking forward to hearing from you.

    ReplyDelete
    Replies
    1. Thanks, Ramya! We'll get in touch with you soon

      Delete
  10. For those who want to get in touch with us regarding these tutorials, please write us an email on the ID mentioned here -
    http://www.datasciencekids.org/p/about-am-research.html

    We'll get back to you right away.

    Thanks!

    ReplyDelete