Monday, June 15, 2015

Steps to replicate the experiment

These notes will help you in replicating the data-exercise we conducted. The google form that we used for registration in data camp is here. Let us know in case you have any queries. You could write to varun -AT- aspiringminds.com or shashank.srikant -AT- aspiringminds.com if a query is bothering you. 

Collecting data
  1. Download the PDF document flash_cards.pdf and get it printed on A4 sheets. Each page of the document contains four flash cards which kids will rate.
  2. First two pages of the document (eight flash cards) are for practice which can be excluded from the analysis.
  3. Each flashcard contains an image containing a photo, a name and an interest written on it. 
  4. Divide the kids in teams of two or three and give one set of flashcards to each kid in the team.
  5. Kids will rate these flashcards on a scale of 1 to 5 depending on the propensity with which they want to befriend the 'potential friend' on the flashcard, given the information on the card.
  6. Make sure that you have some mechanism to identify who rated the given document/ bunch of flash cards. One idea is to ask kids to write their names on the top of flashcards document.
  7. Once all teams have finished rating, segregate the ‘practice flash cards’ (first two sheets) and the ‘validation flash cards’ which are to be used for validation set (last four sheets in our case – depends on how you have arranged your flashcards. We went by the attached excel) from the ‘training flash cards’ which are to be used for model building. Put aside the practice and the validation set in an envelope and work with the training set.
Data entry
  1. Get all teams to a common place, assign them a mentor and swap training flashcards between teams so that no one gets their own training flashcards. This way they can analyze other team’s data and learn how kids in other teams make choices about making friends.
  2. Download the two excel files attached with this post (titled ‘worksheet.xls’ and ‘worksheet-sample.xls’). The file ‘worksheet-sample.xls’ contains sample data filled by one of the team in our experiment while the other file ‘worksheet.xls’ is empty that can be distributed to teams in a new experiment. 
  3. The excel files contains 4 sheets – sheet (‘All Data’) contains all data along with features for each flash card in the document ‘flash_cards.pdf’, sheet ‘Model’ contain prior and likelihood information for each feature, another sheet ‘Training Data’ contains training data along with predictions based on models and finally sheet ‘Visualization’ contains information on visualizations that kids can make. Each team works with one excel file.
  4. To keep things simple, we have only four binary features viz boy/girl (column C), happy/serious (column E, but we have not used this feature in our experiment), traditional/modern name (column B) and indoor/outdoor activity (Column D).
  5. Explain to the kids in detail about the importance and role of features in data science. Make them intuit features themselves pertaining to this experiment. Explain the features present in sheet ‘Training Data’ of excel file.
  6. Mentors will ask their team to feed in the ratings (on the two sets of flash cards) in columns F and G of the excel file – if there’s a paucity of time, you could work with just one of the two kids’ entries. They will also confirm whether each row of features (which make sense) corresponds to the respective flashcards. This will also help them understand these features better.  
  7. To convert this exercise into a two class (0, 1) classification problem from multiclass (1-5), Column H and I have excel formulas to get binary ratings from the pent-nary ratings in column F and G.
  8. Columns J, K and L have points (fixed cost) that are used for the naïve bayes like classifier we have designed. [more information in the following section]
  9. Column P and Q have formulas for computing the output probabilities from the classifier for the two classes viz ‘friend’ and ‘not friend’. Column R is the final verdict (‘friend’ or ‘not friend’) on the train set.
Visualization
  1. In the visualization sheet, mentors will help each team count and input all the relevant fields for one of the set of flashcards. The various inputs that are expected are – 
    1. Output visualizations
      1. Distribution of ratings (classes 1-5)
    2. Filtered visualizations (from binary class friends, column H or I). The reason for making visualizations within the friends category is to be able to identify a variable which is able to differentiate this output well. For instance, of those who were marked as friends, if one sees that 80% were males, then the data would suggest that the person has a propensity to befriend males more than females.
      1. Distribution of gender among friends
      2. Distribution of names among friends
      3. Distribution of activity (hobby) among friends.
  2. On completing this exercise, they can try plotting this data using bar, pie charts and make inferences.
  3. Similarly teams can make such visualizations and inferences for other set of flashcards as well.
Model building
  1. Based on what information is seen from the above visualization exercise, teams need to decide which of the three features would best predict friendship. We suggest to pick a feature which is a clear differentiator, i.e. those features which have more than a 60-40 split among the friends category [see above section where this information is learnt by visualizing the data]. If no feature is clearly able to differentiate, we suggest that anyone feature be picked while explaining to the kid the consequence of picking such a feature.
  2. On picking a feature and the differentiating value of the feature (say, the visualizations revealed that 80% of the friends were males, our feature of interest is Gender and the feature value is Male), the idea is to assign a score of 5 in case that feature value exists for an entry in the data that'd been entered into the excel.
  3. Sensitize the kid to use the excel formula if(relevantcell=”feature value”,5,1))get her to understand it, play with it and teach her to drag it so as to apply it to every row in the entered data.
  4. Once the 1/5 column has been created, proceed to computing the classification accuracy by seeing how many points which were predicted as 5 were friends. This number should be the same as what was seen in the visualizations for this feature in the section above.
Validation
  1. Set up the atmosphere as though magic were to follow :)
  2. Get teams to come forth and enter the feature information pertaining to the sheets in the validation set (which were placed separately in an envelope. See the section on Data collection).
  3. Once the information is filled, use the same predictor that was decided for train and compute the 1/5 column.
  4. Report the classification accuracies in a table similar to the one shown below.



Person
Features
Who to make a friend?
Train accuracy (32)
Test Accuracy (16)
ALPHA
'Outdoor' people
84.38
81.25
BETA
'Indoor' people
81.25
81.25
GAMMA
With 'new names'
62.50
75.00

Once validated, get everyone to appreciate the day's learnings by going through a quick recap of the steps.

We would like to hear (write on the mails mentioned above) about your experiences/ inferences with replicating experiment.

Cheers!

No comments:

Post a Comment