SAMPLE TESTS

Part I: Multiple Choice Questions 40%

This part will have 20 questions with 4 answers (students select the best possible answers) and each question is worth 2 points. Some sample questions are:

  1. If a data set has "n" observations and "p" features, and it is highdimensional data, then which one of the following is true: a) n=p; b) n < p; c) n > p; d) log(n)=log(p).
  2. Which of the following is not a class characteristics? a) intensity data; b) imbalanced data; c) inaccurate data; d) incomplete data.
  3. Which of the following statement is correct? a) bagging is applied at the testing phase of the random forest; b) bagging is applied at the training phase of random forest; c) bagging is applied only in support vector machine; d) bagging is randomly selected in random forest.


Part II: Essay Questions 60%

This part will have 4 questions (students will be asked to answer 3 questions) and each question is worth 20 points. Some sample questions are:

  1. Researchers an Statisticians have studied missing data problem for several decades; however, big data is emerging and its data complexity is unmanageable. Using modern examples, describe the missing data problem in big data domain, and suggest a feasible solution.
  2. Suppose you have two data sets. The first set has "n" observations and "p" features with a class label "c1". The second table has "m" observations and the same features in the same order with a class label "c2". Write a program using a programming language (e.g. R, Matlab, Java, ..) to merge these files with class randomization and ready for classification problem.
  3. Explain random forest algorithm using an example, figures, and pseudo code as appropriate. You must present all the steps involved in the training and testing phases of this algorithm.