Data science for kids!: diy

Showing posts with label diy. Show all posts

Monday, June 15, 2015

Data

We've put together the resources we used and the data that we collected in our camp ON THIS GITHUB REPOSITORY.

There's also a YouTube video (click here) of us walking through the entire exercise. This will help in understanding some of the nuances and will give you a sense of the entire workflow of this exercise.

The folder contains the following documents -

Consent Form.pdf : The form we used to demonstrate the idea of data rights and receive the participants' consent in anonymously sharing the event's data.
Scientist Invention.pdf: File containing list of cards used for ice breaking session.
ss_gf1.png, ss_gf2.png: Screenshots of google form used for registration of data camp. You can open and view the original here.

flash_cards.pdf : The list of 56 cards we used to gather the participants' rating on their 'friendship propensity'.

graded datasets.rar : Contains excels of of the 7 groups that participated, each containing within it the grades given by the participants. This is for all you data-crazy enthusiasts out there.

Sample.Worksheet.xlsx : A sample worksheet template we used.

Kids.Worksheet.xlsx : A sample worksheet with pre-filled data on the input features and formulas to calculate the final result of the classifier we had designed.

Creating a dataset

We reflect here on the process we followed to come up with the data-set for the event. We hope that the notes shared here will help give you a framework to think about the nuances that need to be carefully thought through while selecting a data-set for such an exercise. In the process of coming up with the data-set, we were able to distill a clear idea of what it ought to look like and what it ought not. Here were a couple of considerations –

We wanted the kids to work on the full life cycle of data analysis – data collection; data entry and cleaning; feature identification; visualization, feature selection; a simple model building exercise and then finally validating the model on some unseen data. This was important; we didn’t want the kid to just spend time starting at a bunch of videos or looking at an instructor working her way through data. It was important that she would get her hands dirty in understanding the data for herself.
We wanted this data-set to be easily relatable to kids of this age group – something they would see immediate utility for. Not oil or stock prices!

Our initial realization was that for a hands-on demonstration, it would be best to work with some simple supervised learning example. Unsupervised learning was a no-no. Keeping this broad framework in mind, we went ahead and dished out the following –

Grocery consumption prediction – though this seemed good from a relatable angle, we weren’t sure about how hands-on this could get.
Weather-related predictions – We thought of predicting the weather by looking at the clothes a person was wearing. We could give out images of people wearing various clothes and predict what the weather was at the time the picture was taken. This probably would’ve worked best given the simplicity of the features (shirt vs. jacket; trousers vs. shorts). But what if this were too simple? The gathering could simply brush this aside and say “hey, stop fooling us with jargon. We know how to predict the weather by looking at clothes. What’s new here?”
Movie/book prediction – the crowd weren’t bookworms – we had a sense of their taste in books. We felt movies could be fairly noisy in taste. We tried out a quick experiment and had Varun’s young nieces to list down their favorite movies. It was fairly noisy and worse, the movies they listed as not-good were ones they hadn’t seen. This was turning out to be a problem in one-class classification where only the “liked” set was known – nope, we weren’t heading down that road.
A friend predictor - The idea was to show several images having faces of boys/girls with a brief description of their hobbies and ask each kid in our gathering independently whether they would befriend the person in the picture. We would then build a model for every kid which would predict his/her “friend- identifier”. This promised to be a fun exercise where the gathering could probably learn more about the rest in the group through such an analysis. We could see that this was easily relatable and would be entertaining in addition to being a real social experiment!

We decided to go ahead with the friend predictor exercise [more information here]. While deciding this, we asked ourselves whether every kid in our camp should work on the same data-set or whether we should have different groups working on different exercises. We thought it would get hard to manage and stuck to working on one exercise.

So let’s clearly lay out the details -

Each kid gets to see N faces and rates them on a scale of 1-5 on how likely s/he will befriend the people shown in the images. A rating of 1 is if s/he won’t befriend the person at all and 5 if s/he will surely make the person a friend. On rating N images on a “friendship propensity” scale, we create a classifier to predict these ratings. Our aim was to proxy a Naïve Bayes approach in training a classifier. We decided to go with discrete/categorical variables on purpose so that the exercises involved only counting and visualizing simple bar graphs instead of worrying about means, standard deviations and continuous valued distributions!
We focused on four features for this exercise – gender, names, faces and a hobby. Each feature could take two values – male/female; old sounding name/new sounding; smile/serious and sports/non-sports. We did a quick pilot again, thanks to Varun’s nieces, and saw that the hobby correlated the best with the final output (0.6 points). This gave us confidence on the features we should expect the kids to implicitly use and also felt this would be a good case to show poor performing features.
In our pilot, we also realized that one of the nieces’ responses looked different and on par with her known “friend-quotient” when she repeated the exercise a second time. We hence added in 8-10 additional images as a practice set, making it a total of a 56-point data set.
On finalizing the features, we had to think about the distribution of the features in our set. We decided it would be best if each feature could take two values 0-1 and we had four features in all – 16 unique permutations in total in the data-set. This created a balanced set, which meant we could trust more (why not completely? Will leave you to think) an inference/visualization with one feature at a time (since all the features were evenly distributed).
Because the data-set was balanced, there was 50% of each category of a feature (say 50% males and females). As a result, when the kids looked at the “friend” class (those that they’d rated 4 or 5), they just had to see whether one category of features was overrepresented to see if there was a systematic preference they had. If we hadn’t done this, we would then have to first see the distribution of a feature in the entire training set and then compare it with the distribution in the “friend” or the “non friend” class to understand whether the kid had a systematic inclination towards favoring a feature in deciding whom to befriend
Another challenge was that we were taking a 5-valued output vector, but finally wanted to do a two-class classification. Would kids have made this this leap in understanding? It turned out to be pretty easy for them - it was intuitive for them to consider those they’ve rated 4 and 5 as folks they’d befriend. (Each task here is a science in itself!)
We also had to have multiple points for each such permutation – we kept that number to three. We hence had 48 points in our data-set. We decided to use 32 points to train our model on and 16 to hold-out and validate our model. We were cheating a little here on choosing our validation set, but that was to maximize our chances of success in building a good predictor!
When brainstorming with Prof. Una-May from the ALFA group at CSAIL, MIT about this, she suggested we use a digital platform to collect the kids’ responses. This would have been a very nice layer atop our existing framework wherein the kids could play a ‘game’ at length before arriving at the camp, and we’d have our data digitized and prepped for them to start playing with! We couldn’t really pull this off given the time constraint, but we plan to tinker with it going forward. An additional bit which would have missed out as a result of using such an app was probably the kid would not have entered information herself onto an excel and wouldn’t have gotten a first-hand exposure to data-entry.

It turned out quite well in the end. We did get to see a varied response. The features did predict well. Males/Females was a stark differentiator for some while indoor/outdoor activities was the expected differentiator for the rest. In all, the data-set worked! Let us know if you can better this! And this is just the tip of the iceberg of what can be done! While we leave you here, we ask what the scratch for data science is. Anyone?

Steps to replicate the experiment

These notes will help you in replicating the data-exercise we conducted. The google form that we used for registration in data camp is here. Let us know in case you have any queries. You could write to varun -AT- aspiringminds.com or shashank.srikant -AT- aspiringminds.com if a query is bothering you.

Collecting data

Download the PDF document flash_cards.pdf and get it printed on A4 sheets. Each page of the document contains four flash cards which kids will rate.
First two pages of the document (eight flash cards) are for practice which can be excluded from the analysis.
Each flashcard contains an image containing a photo, a name and an interest written on it.
Divide the kids in teams of two or three and give one set of flashcards to each kid in the team.
Kids will rate these flashcards on a scale of 1 to 5 depending on the propensity with which they want to befriend the 'potential friend' on the flashcard, given the information on the card.
Make sure that you have some mechanism to identify who rated the given document/ bunch of flash cards. One idea is to ask kids to write their names on the top of flashcards document.
Once all teams have finished rating, segregate the ‘practice flash cards’ (first two sheets) and the ‘validation flash cards’ which are to be used for validation set (last four sheets in our case – depends on how you have arranged your flashcards. We went by the attached excel) from the ‘training flash cards’ which are to be used for model building. Put aside the practice and the validation set in an envelope and work with the training set.

Data entry

Get all teams to a common place, assign them a mentor and swap training flashcards between teams so that no one gets their own training flashcards. This way they can analyze other team’s data and learn how kids in other teams make choices about making friends.
Download the two excel files attached with this post (titled ‘worksheet.xls’ and ‘worksheet-sample.xls’). The file ‘worksheet-sample.xls’ contains sample data filled by one of the team in our experiment while the other file ‘worksheet.xls’ is empty that can be distributed to teams in a new experiment.
The excel files contains 4 sheets – sheet (‘All Data’) contains all data along with features for each flash card in the document ‘flash_cards.pdf’, sheet ‘Model’ contain prior and likelihood information for each feature, another sheet ‘Training Data’ contains training data along with predictions based on models and finally sheet ‘Visualization’ contains information on visualizations that kids can make. Each team works with one excel file.
To keep things simple, we have only four binary features viz boy/girl (column C), happy/serious (column E, but we have not used this feature in our experiment), traditional/modern name (column B) and indoor/outdoor activity (Column D).
Explain to the kids in detail about the importance and role of features in data science. Make them intuit features themselves pertaining to this experiment. Explain the features present in sheet ‘Training Data’ of excel file.
Mentors will ask their team to feed in the ratings (on the two sets of flash cards) in columns F and G of the excel file – if there’s a paucity of time, you could work with just one of the two kids’ entries. They will also confirm whether each row of features (which make sense) corresponds to the respective flashcards. This will also help them understand these features better.
To convert this exercise into a two class (0, 1) classification problem from multiclass (1-5), Column H and I have excel formulas to get binary ratings from the pent-nary ratings in column F and G.
Columns J, K and L have points (fixed cost) that are used for the naïve bayes like classifier we have designed. [more information in the following section]
Column P and Q have formulas for computing the output probabilities from the classifier for the two classes viz ‘friend’ and ‘not friend’. Column R is the final verdict (‘friend’ or ‘not friend’) on the train set.

Visualization

In the visualization sheet, mentors will help each team count and input all the relevant fields for one of the set of flashcards. The various inputs that are expected are –

Output visualizations

Distribution of ratings (classes 1-5)

Filtered visualizations (from binary class friends, column H or I). The reason for making visualizations within the friends category is to be able to identify a variable which is able to differentiate this output well. For instance, of those who were marked as friends, if one sees that 80% were males, then the data would suggest that the person has a propensity to befriend males more than females.

Distribution of gender among friends
Distribution of names among friends
Distribution of activity (hobby) among friends.

On completing this exercise, they can try plotting this data using bar, pie charts and make inferences.
Similarly teams can make such visualizations and inferences for other set of flashcards as well.

Model building

Based on what information is seen from the above visualization exercise, teams need to decide which of the three features would best predict friendship. We suggest to pick a feature which is a clear differentiator, i.e. those features which have more than a 60-40 split among the friends category [see above section where this information is learnt by visualizing the data]. If no feature is clearly able to differentiate, we suggest that anyone feature be picked while explaining to the kid the consequence of picking such a feature.
On picking a feature and the differentiating value of the feature (say, the visualizations revealed that 80% of the friends were males, our feature of interest is Gender and the feature value is Male), the idea is to assign a score of 5 in case that feature value exists for an entry in the data that'd been entered into the excel.
Sensitize the kid to use the excel formula if(relevantcell=”feature value”,5,1))get her to understand it, play with it and teach her to drag it so as to apply it to every row in the entered data.
Once the 1/5 column has been created, proceed to computing the classification accuracy by seeing how many points which were predicted as 5 were friends. This number should be the same as what was seen in the visualizations for this feature in the section above.

Validation

Set up the atmosphere as though magic were to follow :)
Get teams to come forth and enter the feature information pertaining to the sheets in the validation set (which were placed separately in an envelope. See the section on Data collection).
Once the information is filled, use the same predictor that was decided for train and compute the 1/5 column.
Report the classification accuracies in a table similar to the one shown below.

Person	Features Who to make a friend?	Train accuracy (32)	Test Accuracy (16)
ALPHA	'Outdoor' people	84.38	81.25
BETA	'Indoor' people	81.25	81.25
GAMMA	With 'new names'	62.50	75.00

Once validated, get everyone to appreciate the day's learnings by going through a quick recap of the steps.

We would like to hear (write on the mails mentioned above) about your experiences/ inferences with replicating experiment.

Cheers!

Friday, June 12, 2015

Our work is accepted at SIGCSE 2017!

We're pleased to share that our work's been accepted at SIGCSE 2017. ACM SIGCSE is a premier conference focusing on computer science education. Started in 1970, it's a fun conference where each year, tons of new ideas get published and discussed by educators and computer scientists. We believed that the principles we followed to define our fun tutorials should be shared with this community. There were a lot of nuances we considered to ensure that the tutorials ended up being fun, hands-on and engaging while ensuring that kids weren't cognitively burdened. This was a new way to teach kids this topic, which otherwise is taught in undergraduate courses.

This year's submission process was fairly elaborate. Five-seven reviewers went through each submission. A total of 300 papers were submitted this year, of which nearly 100 were selected. The reviewers were extremely pleased with our work, agreeing that this was definitely a template which educators could try at their schools.

A PDF of the paper can be downloaded from this page - http://research.aspiringminds.com/publications

The reviewers' detailed comments have been mentioned below -

----------------------- REVIEW 1 ---------------------
PAPER: 68
TITLE: Introducing Data Science to School Kids
AUTHORS: Shashank Srikant and Varun Aggarwal

OVERALL EVALUATION: 5 (Clear Accept: Content, presentation, and writing meet professional norms; improvements may be advisable but acceptable as is)

----------- Summary -----------
Data Science is an important field of study as industry increasingly need to employ Data Scientist. Marketing and creating awareness of Data Science as a career opportunity to students and scholars is important. This paper proposes and evaluates a tutorial that is presented to scholars/students completing Grades 5-9 that allows then to gain hands-on exposure to the field of Data Science.

----------- Strengths -----------
The idea is novice. Other institutions can use this tutorial as a basis to develop similar interventions to expose young adults to the exciting field of Data Science and possible career opportunities. The tutorial is well motivated and discussed, allowing young scholars/students to identify possible friends by exposing them to the complete process a data scientist will generally follow and finally make decisions.

----------- OVERALL EVALUATION -----------
The idea and tutorial is new and provides the basis for similar studies at other institutions, promoting careers in Computer Science. Data Science is a new career opportunity and Higher Education Institutions are increasingly implementing Data Science programmes. The tutorial is a new novice approach and presents a well thought-through case study. A detailed discussion of the tutorial is provided, including motivation for certain decisions and providing hands-on experience to scholars/students. The authors' provide a detailed motivation for their choice of technologies and data, including the evaluation and discussion of the results.

Personally, I do not recommend the author's write a paper in the first person (We, I, Us). I'm further not sure but are Grade 5's Secondary School? Secondary school grades are generally Grade 8-12? Further, in the Abstract the authors' indicate that they "limited the pre-requisites for the kids to the knowledge of counting, addition, percentages and comparisons". Generalising that this knowledge allows scholars/students in different countries to use spread sheets easily is not true, for example in Africa limited scholars in Grade 5 has ever been exposed to computer technologies.

I suggest the authors review the paper critically to ensure a scientific writing style. Table 4 -the general scientific presentation of a 5 point Likert Scale is fro Strongly Disagree (1) to Strongly Agree (5). Finally the authors must indicate if the scholars/students learned or in the future will consider a career as a Data Scientist. This was the objective of the exercise.

----------------------- REVIEW 2 ---------------------

OVERALL EVALUATION: 4 (Marginal Tend to Accept: Content has merit, but accuracy, clarity, completeness, and/or writing should and could be improved in time)

----------- Summary -----------
This paper describes how school children in grades 5 - 9 were given a half day tutorial in data science in several cities. Efforts were made to minimise prerequisite knowledge,to maximise engagement and avoid the need to use complicated tools. The authors list their design principles for the hands-on exercise and the workflow of the tutorial as well as some feedback from the children participating. Less than 5% of the children did not find it interesting, according to the authors. Being able to build predictions and see if they worked on real data seemed to hold the children's attention.

Those colleagues who teach Data Science at undergraduate level may be interested in the lessons learned from this novel exercise of making the subject interesting to children.

----------- Strengths -----------
Novel nature of taking a subject , normally taught at undergraduate level, and engaging much younger students.
The authors' design principles used.

----------- OVERALL EVALUATION -----------
Data Science has not yet become standard material on many Computing Science courses. This paper may encourage conference attendees to consider its inclusion, based on what has been achieved with much younger students in less than a day. Others may be encouraged to use this approach to do 'outreach' and encourage potential students to consider enrolling for Computing related degrees and could challenge some misconceptions about Computing.

----------------------- REVIEW 3 ---------------------

OVERALL EVALUATION: 4 (Marginal Tend to Accept: Content has merit, but accuracy, clarity, completeness, and/or writing should and could be improved in time)

----------- Summary -----------
The submission reports on a half-day data science organised for school children, and presents design principles for creating such an exercise. It will be of interest to teachers who want to introduce data science ideas to young students

----------- Strengths -----------
The paper clearly describes the principles on which the tutorial was based, and gives a detailed account of the way the tutorial was conducted. This would be very helpful for anyone who wants to implement a similar activity.

----------- OVERALL EVALUATION -----------
The paper gives a detailed account of the design principles and conduct of the tutorial, from which it would be easy for other teachers to create similar activities. In fact, it appears that the supporting materials for this particular exercise will be made widely available. The design is strongly justified and this appears to be an interesting exercise that was well received. There is a mention of students being asked to blog about what they had learned, in addition to a questionnaire - was this separate from the "subjective comments" that are mentioned also? In some places the paper is a bit verbose or repetitive, though, for example much of table 1 repeats points made in the text.

Recommendations:
State the age range of the students - "5th to 9th grades" may bot be meaningful for all of the audience
Replace the word "kids" with a more formal term
"We also interchangeably refer to the dependent variables as output variables and independent variables as input variables respectively" - choose one set of terminology and stick to it
Table 2 - put data under the correct headings
Figure 2 - three parts in one figure, separate into different figures or clearly label a,b and c on figure

----------------------- REVIEW 4 ---------------------

OVERALL EVALUATION: 4 (Marginal Tend to Accept: Content has merit, but accuracy, clarity, completeness, and/or writing should and could be improved in time)

----------- Summary -----------
This is an interesting paper which presents the authors half day tutorial for pupils in grades 5 to 9 introducing Data Science. The approach appears to engage the pupils through, what seems, a fun and practical hands on experience. The authors have attempted to keep the activity highly visual and fun. This paper should appeal to teachers and the authors appear to be offering their work as a template for development. The authors have attempted to cover various aspects of Data science including gathering, process and visualising.

----------- Strengths -----------
I think the strength of this paper is the overall description of how the half day tutorial was conceived and implemented. The undoubted enthusiasm that the authors show throughout the paper for the subject area and the honesty with which they write.

----------- OVERALL EVALUATION -----------
The authors are clearly passionate about the subject area and have strived to create a tutorial that will cover the major aspects of data science in a fun and interactive way for the participants. They have attempted to keep the level of knowledge required by the participants as low as possible while still delivering a meaningful learning experience. On page 2 the symbol in front of each of the numbered sections should be replaced with the word section and the frequent use of the word "kid" be replaced with a more suitable term such as "pupil". I think the paper would be of interest to teachers and worth including at the conference.

----------------------- REVIEW 5 ---------------------

OVERALL EVALUATION: 5 (Clear Accept: Content, presentation, and writing meet professional norms; improvements may be advisable but acceptable as is)

----------- Summary -----------
Authors organized a half-day long data science tutorial for kids in grades 5 through 9. Their aim was to expose them to the full cycle of a typical supervised learning approach - data collection, data entry, data visualization, feature engineering, model building, model testing and data permissions. In the paper, they discuss the design choices made while developing the dataset, the method
and the pedagogy for the tutorial.

----------- Strengths -----------
The approach draws from different pedagogic theories like experiential learning, problem-based learning, cooperative learning, cognitive apprenticeship, and blended learning; hence the design is theoretically grounded.

----------- OVERALL EVALUATION -----------
Text under the title "2 Design Consideration" does not match the information in Table 1, hence this is confusing. Other than that the writing is good, and the material is validated with data from 4 different contexts.

------------------------- METAREVIEW ------------------------
PAPER: 68
TITLE: Introducing Data Science to School Kids
AUTHORS: Shashank Srikant and Varun Aggarwal

The authors provide a nice case study of the design of an introduction to data science, aimed at students in grades 5-9. The treatment is novel; the activities are interesting and fun. The overly informal presentation (which used words such as "kids") drew some criticism from reviewers. Other small flaws include wordiness and bugs in table setup.

Thursday, June 11, 2015

Our work's reviews at EAAI 2016

On the completion of our three camps, it dawned on us that the framework we'd created to teach kids data science had a lot of nuances - we'd taken a lot of decisions in order to ensure that the framework was accessible to a high school audience and while it ensuring that the material was engaging and could be conveyed in a half-day setting.

We decided to write up a paper on it and submit it to a AI-education/pedagogy conference.
We submitted it to EAAI 2016, a co-located conference at AAAI 2016.

Tod Neller and group put up a great conference every year at AAAI where they focus on discussing interesting ways of teaching AI/data science to high schoolers and undergraduates. A regular feature in their conference is the model AI assignments track, which showcases some fun ways to design assignments in AI.

Results were out in the first week of November - and we didn't make the cut.
However, we thought the reviewers' comments showed just how nascent this area was, getting them to opine on various aspects of the paper. We decided to post the reviews here and get a discussion going on it.

Here are the reviews -

TITLE: Teaching Data Science to School Kids
AUTHORS: Shashank Srikant and Varun Aggarwal

REVIEW 1
The paper describes a half-day “camp” for elementary and middle school students on the full supervised learning pipeline, which assumes very basic background only.

In addition to the factors listed, the design decisions for the exercise are also (presumably) driven by a desire to fit this exercise into a half-day. This hardly satisfies my intuitions about what a camp is! Rather, it appears that you are designing an exercise that can be (presumably) easily adapted by teachers and camp leaders for inclusion as a lesson in a larger curriculum. This is good, and in my mind is the best framing of your project. Add “fit into one half day” as an explicit design constraint with this new framing.

Give an example of a simple balanced dataset to illustrate your intentions.

Why are you using Naive Bayes — I don’t imagine that school kids will view this as intuitive at all. Why not decision trees or decision rules, which began as computational models of concept formation in human subjects (Hunt, Marin, Stone 1966 Concept Learning System, CLS, if I recall correctly). Isn’t that basis in psychological modeling a compelling reason for selecting decision rules and trees?

A domain was ruled out (movie prediction), because of missing and noisy data (p. 4). Isn’t this something to cover in a longer camp, or perhaps another module? Again, I would be thinking about what you are doing as creating modules for adoption by instructors, rather than a standalone camp (but if you want to think of this as a camp, then one-half day does not satisfy the length criterion for a camp.

There are clearly opportunities for discussion of ethics here — a 61% average accuracy on unseen data is compelling caution for the exuberance of one student “I learned how to predict a stranger’s choices”, but moreover, the naive Bayes form may limit the ability to talk about overfitting as well (under fitting is the real “danger”)

I’m glad that students were asked to sign consent forms (for reasons related to their own learning and maturation), but was the process vetted by an IRB? I think it should be, even more particularly because it involves minors.

Overall, this is an interesting exercise, and while I question the particular predictive modeling language, consent procedures, and some other particulars, it is within EAAI scope, I believe, to design lessons (again, I don’t think this is a camp) that can be adapted by middle and high school teachers.

REVIEW 2
This paper presents a case study where the authors taught a selection of data science topics to 5th-9th graders. The authors probably would have been better served by making this a "Model AI Assignment." As it is, their paper introduces some hypotheses and guidelines, but do not test any of them. They provide a brief qualitative assessment of whether students liked their module or if the students thought they learned.

There were minor typos, but overall the paper was well written.

Overall, I'm not sure how much people will learn from reading this paper / watching the authors present. However, it could spark some interesting discussion:

Is it reasonable to teach this age group data science? Would it be better to teach older students this material? Would students this age be better served by learning some programming rather than using MS Excel?

REVIEW 3
The paper presents the results of a data science camp for 5th through 9th graders. It explains the goals of the camp and also tried to provide an analysis of the outcomes.

While this paper has a good idea in mind, there are some issues with the paper as written that would result in a stronger paper if they were addressed. Those issues include:

* The authors state that their hypothesis is that a student learns best by problem solving themselves. However, this hypothesis is never directly tested (for example, comparing a group of students who did hands on projects to those that learned from instructional videos on a common post-test).

* While I can understand an appreciation for manual data collection, there is a good deal of data that is collected automatically these days (or the people who collect the data are not the same as those who analyze it). Why should 5th-9th graders be spending their time on data entry rather than analysis? This needs more justification with respect to the actual learning outcomes students achieve by doing manual data entry.

* The sample space of the data analysis tasks is very small (8-16 instances). Why even do inference (rather than just building a lookup table) for such a small problem?

* In several places you refer to reducing cognitive load, but this is something that is not actually measured at all. You need more precision/assessment here to make claims about cognitive load.

* There is a significant problem in how probabilities are dealt with in the pseudo-Naive Bayes model. By having students sum (rather than multiply) probabilities, you give students an erroneous sense of how probabilities work. Having taught probability for many years, this is actually a big problem for students, and the task you give them further reinforces this incorrect interpretation. That seems very problematic to me as an educator in this area. If you are using addition for interpretability, then why use a Bayesian classifier at all? Why not just use a decision tree which is more more readily interpretable.

* Similar to the point about data entry, having student each rate 50 images seems like a lot of effort whose educational outcome is unclear. This needs more explanation (and perhaps some assessment) to determine if it's actually a worthwhile use of time by the students.

* The average validation accuracy is low (62%) for a binary classification task. Is that due to a low Bayes rate for the task or is it that the models built were just not very good? This really needs some exploration.

* It would be very useful to provide quantitative results from either the students or mentors or both as to their thoughts about the utility of the camp and the tasks they were involved in.

* The paper would benefit from a round of general English editing.

Wednesday, June 10, 2015

Predictive features - unleash your creativity

An old hand at machine learning and data sciences will tell you that she eats, sleeps and drinks feature engineering when it comes to building solid predictive models. We document here some of the interesting features tried out by the kids at the camps. As with any other experimental science, some are successful and some not. The features listed out here should motivate mentors at subsequent camps to get the kids to think along these lines and open their minds to the possibility of such ideas having an impact while building their models!

In the current setting. while introducing features, we motivate how the name (whether it's an old-sounding or a new-sounding name), hobby (whether it's an indoor or an outdoor activity) and gender affect how friends are made. Kids and mentors are then encouraged to go beyond these features and explore others which might signal friendship.

A note - the features listed below may have been motivated by looking at one particular data set which the kid was analyzing. The discrimination percentage mentioned below may not generalize to other data sets.

The information is presented in the following format -
[Feature description] - [Camp] - [Discrimination on the friend-set]*

Artsy vs Non-artsy hobbies - Bangalore - 88%
Happy vs Grumpy looking faces - Pune - 55%
Weird vs Common name - Pune
Hobbies involving hand held tools vs otherwise - Pune

*Please read the experiment details to understand this better.

Test