We thought we will experiment with teaching class 6th-9th
kids some data science! We think it is important to introduce students very
early to thinking in a data-driven way – but also kids are way more fun than
dealing with college students!
So we sent a form out two weeks back and asked kids in classes 6-8th
to apply. We thought 5th is too young and 9th/10th
are more focused on school academics. We had 18 people apply with most of them
interested in science and math and a few in history/arts (See pie chart).
Today, we had 14 people turn up (there was no selection,
only self-selection). This included one 10th grade student and one sophomore
undergraduate who tagged with the group to learn!
Some kids came in early. We put on a Youtube video onscratch for them. It was fun to discuss it with them and they related
it to Lego. We asked kids to install scratch at home – and make a dancing Amir Khan
(famous Bollywood movie actor) on it and also have him jump around from one building to
another!
As all kids assembled, we had a quick ice breaker. Parth and
Abhishek, interns from IIT Kanpur, divided senior and junior students in two
sets in order to pick one from each to form a group. This was to maximize group
success. They had interesting ways to do this - read here!
Then Samarth, our intern from Harvard, introduced the idea
of data science to kids. He started with the famous John Snow
cholera outbreak example. Kids were very quick- by a show of hands, everyone
had seen a Google map. They understood that infected people were clustering
around one pump and there were other vacant pumps. Couple of questions – Why
some dots are large and small? Why did someone not go to a pump which was
farther away. We told them there were three learnings for them:
a. Don’t waste
water- it wasn’t as easily available 100 years back and still not to many;
b.
Don’t run away from problems; try to solve them; else they will catch up with you
(couple of them said that their way of solving this was to just run away from
the city!),
c. You can solve problems with data – here is a medical problem you
solved by plotting infected families on a map; you did not need any exposure to
biology or medicine to come up with a preliminary inference.
We moved on and started with the key data set and
experiment. Our aim was to give kids an idea of the whole cycle of data science
– data collection, data entry/cleaning, feature extraction, visualization and
model building (if we could get there, we had presumed we wouldn’t due to a
paucity of time) and also sensitize them to data security/permissions concerns.
The exercise we designed was: Every
kid will get a set of 48 faces with names and their hobbies. The kids had to
rate 5 if they will make the person a friend, 1 if not and could choose other
numbers in between. All the 7 groups completed the exercise with one mentor
each. Out of these we pulled 16 samples out as a validation set :) The ‘train’ data sets
were then exchanged among groups.
We then asked the groups that from these sheets, if we wanted
to know what kind of people Raghav (one of the kids) makes friends, what would
they look at? How would they come up with a solution? One of the kids suggested
that we could look at what kind of games his friends played and then tell
accordingly. We asked what else? And then introduced that it could be that some
of the kids make friends with more and some with girls; asked a boy whom he
makes friends with often and he said boys; couple more said neutral. And then
we discussed two more features: we had smiling and neutral faces – would some
people make smiling people friends more often? And also, we had old style names
and new names – would some people like to make new name folks friends more
often? Kids seemed to have understood that people could possibly, not
necessarily, make choices on this basis. For the workshop we decided to go with
three features: gender, hobby and name style.
The platform we were using was excel. We had a sheet with
features already entered for the data set. The kids had to enter the ratings
and check the features. The kids did find some features wrongly entered and
also some ambiguities: is squash indoor or outdoor, is Shilpy a new name or an
old name? :)
The first task of the kids was to find if the person they
were analyzing was a friendly person or not. To get this right, they had to
simply count how many people were marked each as 5, 4, 3, 2 and 1 by the kid.
Some of the kids used filters to do this and others counted manually. They
finally made a graph. Here is the first graph we discussed with the kids, where
the red bar depicted percentages and the blue bar depicted the actual number in
each bin.
We made two inferences:
- K (anonymous) was a friendly person: s/he more often makes friends than not.
- K is clear-headed and a fast decision-maker. S/he doesn’t have many may be/may be not cases. S/he either decides to make a person a friend or not.
Then we discussed couple of more graphs of other kids. We
said statements positively J:
V makes lesser friends, but that is because s/he likes to spend time studying.
One group said, she is confused since she had many may be/may be not: we said
not confused, she takes time to decide who to make a friend or not, because she
may be thinking deeply about it.
This was fun! Our next exercise was that they had to find among
the people the person chose to befriend, were there, say, more males than
females? And similarly for other features. <Footnote, We had created a
balanced data set with 50-50 of each feature type; this created a
simplification that we did not have to see the non-friends group> Again kids
used filters and counted for input variables of the two types and plotted
graphs. We had already inserted a template for the kids to put in their counts
in their excel sheets; they then plotted the graphs themselves.
Here is a set of graphs we discussed.
So, we learnt – the student for whom we’d made this graph definitely
likes to make friends with people who plays outdoor games – that is a clear
trend. Next we talked about gender – the person makes male friends slightly
more often; but this trend is still not completely clear, since the difference
between males and females is too little. It needs further investigation. Same
for the third feature.
The big take away was: we can find what kind of people each
of us make friends with! Kids seem to understand and appreciate this. We told
them that they could have done this differently, by interviewing the person and
then trying to say who he will make friends – but we do it differently –
‘learning by example’, we see who they make friends, analyze it to figure out
trends and then be able to predict!
Ideally we wanted kids to make a predictor with a simple
point based system, but we didn’t get there. We however
went ahead and took the example of the above kid, who had shown outdoor games as
the key deciding factor, and considered that feature as the predictor – we took
out her ‘validation’ data from the envelope and saw how we well we did – it was
only ok, honestly! But kids got the concept.
We then got a data release form signed
from them and explained to them that they have the right that their data isn’t
publicly disclosed and we seek their permission – we will anonymize their data.
One girl opted out. Rest of the data can be found here.
When I asked with a wink how many from the gathering would
like to come over for a part 2 of the data camp the following week, ten of them
raised their hands :) A good test for us. See the kids’ blog entries here and mentor
experiences here! Harsh also suggested to them that they should
start making data entries of their expenditure and pocket money!
Do note, that we were using lot of assumptions to simplify
this – correlation vs. causality, balanced sets, no significance testing, etc.
Our aim was to lead them to a naiver naïve bayes. We think this is a fine
approach like the famous Arundhati nyaya.
Learnings:
- We need 5-6 hours to run this right and we would have done the model too and explained things a lot better.
- We didn’t have a what-next? A strong take away and continuation.
- Kids need to know the concept of percentages – we think 7th to 9th might be a better target.
- Currently, we have 1 mentor for every 2 kids. We need this to be more scalable. Should be possible.
- Would want to emphasize explaining data science vs. other ways of doing things through some examples. We give them a problem, they try it and then we give the data way of doing it.
- More visualizations to share.
- Need a resource sheet – we will be sending that to the group that attended.
- Better sorting and group formation.
-Varun
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.