About The Book:

This book presents machine learning models and algorithms to address big data classification problems. Existing machine learning techniques like the decision tree (a hierarchical approach), random forest (an ensemble hierarchical approach), and deep learning (a layered approach) are highly suitable for the system that can handle such problems. This book helps readers, especially students and newcomers to the field of big data and machine learning, to gain a quick understanding of the techniques and technologies; therefore, the theory, examples, and programs (Matlab and R) presented in this book have been simplified, hardcoded, repeated, or spaced for improvements. They provide vehicles to test and understand the complicated concepts of various topics in the field. It is expected that the readers adopt these programs to experiment with the examples, and then modify or write their own programs toward advancing their knowledge for solving more complex and challenging problems.

The presentation format of this book focuses on simplicity, readability, and dependability so that both undergraduate and graduate students as well as new researchers, developers, and practitioners in this field can easily trust and grasp the concepts, and learn them effectively. It has been written to reduce the mathematical complexity and help the vast majority of readers to understand the topics and get interested in the field. This book consists of four parts, with the total of 14 chapters. The first part mainly focuses on the topics that are needed to help analyze and understand data and big data. The second part covers the topics that can explain the systems required for processing big data. The third part presents the topics required to understand and select machine learning techniques to classify big data. Finally, the fourth part concentrates on the topics that explain the scaling-up machine learning, an important solution for modern big data problems.

About the Author:

Shan Suthaharan is a Professor of Computer Science at the University of North Carolina at Greensboro (UNCG), North Carolina, USA. He also serves as the Director of Undergraduate Studies at the Department of Computer Science at UNCG. He has more than twenty-five years of university teaching and administrative experience, and has taught both undergraduate and graduate courses. His aspiration is to educate and train students so that they can prosper in the computer field by understanding current real-world and complex problems, and develop efficient techniques and technologies. His current teaching interests include big data analytics and machine learning, cryptography and network security, and computer networking and analysis. He earned his doctorate in Computer Science from Monash University, Australia. Since then, he has been actively working on disseminating his knowledge and experience through teaching, advising, seminars, research, and publications. Dr. Suthaharan enjoys investigating real-world, complex problems, and developing and implementing algorithms to solve those problems using modern technologies. The main theme of his current research is the signature discovery and event detection for a secure and reliable environment. The ultimate goal of his research is to build a secure and reliable environment using modern and emerging technologies. His current research primarily focuses on the characterization and detection of environmental events, the exploration of machine learning techniques, and the development of advanced statistical and computational techniques to discover key signatures and detect emerging events from structured and unstructured big data. Dr. Suthaharan has authored or co-authored more than seventy-five research papers in the areas of computer science, and published them in international journals and referred conference proceedings. He also invented a key management and encryption technology, which has been patented in Australia, Japan, and Singapore. He also received visiting scholar awards from and served as a visiting researcher at the University of Sydney, Australia; the University of Melbourne, Australia; and the University of California, Berkeley, USA. He was a senior member of the Institute of Electrical and Electronics Engineers, and volunteered as an elected chair of the Central North Carolina Section twice. He is a member of Sigma Xi, the Scientific Research Society, and a Fellow of the Institution of Engineering and Technology.


Click on the chapter to see the abstract and download the code and data.

Chapter 1: Science of Information

Abstract: The main objective of this chapter is to provide an overview of the modern field of data science and some of the current progress in this field. The overview focuses on two important paradigms: (1) big data paradigm, which describes a problem space for the big data analytics, and (2) machine learning paradigm, which describes a solution space for the big data analytics. It also includes a preliminary description of the important elements of data science. These important elements are the data, the knowledge (also called responses), and the operations. The terms knowledge and responses will be used interchangeably in the rest of the book. A preliminary information of the data format, the data types and the classification are also presented in this chapter. This chapter emphasizes the importance of collaboration between the experts from multiple disciplines and provides the information on some of the current institutions that show collaborative activities with useful resources.

Chapter 2: Big Data Essentials

Abstract: The main objective of this chapter is to organize the big data essentials that contribute to the analytics of big data systematically. It includes their presentations in a simple form that can help readers conceptualize and summarize the classification objectives easily. The topics are organized into three sections: big data analytics, big data classification, and big data scalability. In the big data analytics section, the big data controllers that play major roles in data representation and knowledge extraction will be presented and discussed in detail. These controllers, the problems and challenges that they bring to big data analytics, and the solutions to address these problems and challenges will also be discussed. In the big data classification section, the machine learning processes, the classification modeling that is characterized by the big data controllers, and the classification algorithms that can manage the effect of big data controllers will be discussed. In the big data scalability section, the importance of the low-dimensional structures that can be extracted from a high-dimensional system for addressing scalability issues will be discussed as well.

Chapter 3: Big Data Analytics

Abstract: An in-depth analysis of data can reveal many interesting properties of the data, which can help us predict the future characteristics of the data. The objective of this chapter is to illustrate some of the meaningful changes that may occur in a set of data when it is transformed into big data through evolution. To make this objective practical and interesting, a split-merge-split framework is developed, presented, and applied in this chapter. A set of file-split, file-merge, and feature-split tasks is used in this framework. It helps explore the evolution of patterns from the cause of transformation from a set of data to a set of big data. Four digital images are used to create data sets, and statistical and geometrical techniques are applied with the split-merge-split framework to understand the evolution of patterns under different class characteristics, domain characteristics, and error characteristics scenarios.

Download Codes and Data Files

Chapter 4: Distributed File System

Abstract: The main objective of this chapter is to provide information and guidance for building a Hadoop distributed file system to address the big data classification problem. This system can help one to implement, test, and evaluate various machine-learning techniques presented in this book for learning purposes. The objectives include a detailed explanation of the Hadoop framework and the Hadoop system, the presentation of the Internet resources that can help you build a virtual machine-based Hadoop distributed file system with the R programming platform, and the establishment of an easy-to-follow, step-by-step instruction to build the RevolutionAnalytics’ RHadoop system for your big data computing environment. The objective also includes the presentation of simple examples to test the system to ensure the Hadoop system works. A brief discussion on setting up a multi node Hadoop system is also presented.

Chapter 5: MapReduce Programming Platform

Abstract: The main objective of this chapter is to explain the MapReduce framework based on RevolutionAnalytics’ RHadoop environment. The MapReduce framework relies on its underlying structures, the parametrization, and the parallelization. These structures have been explained clearly in this chapter. The implementation of these structures requires a MapReduce programming platform. An explanation of this programming platform is also presented together with a discussion on the three important functions, mapper(), reducer(), and mapreduce(). These functions help the implementation of the parametrization and parallelization structures to address scalability problems in big data classification. The chapter also presents a set of coding principles, which provide good programming practices to the users of the MapReduce programming platform in the context of big data processing and analysis. Several programming examples are also presented to help the reader to practice coding principles and better understand the MapReduce framework.

Download Codes and Data Files

Chapter 6: Modeling and Algorithms

Abstract: The main objective of this chapter is to explain the machine learning concepts, mainly modeling and algorithms; batch learning and online learning; and supervised learning (regression and classification) and unsupervised learning (clustering) using examples. Modeling and algorithms will be explained based on the domain division characteristics, batch learning and online learning will be explained based on the availability of the data domain, and supervised learning and unsupervised learning will be explained based on the labeling of the data domain. This objective will be extended to the comparison of the mathematical models, hierarchical models, and layered models, using programming structures, such as control structures, modularization, and sequential statements.

Download Codes and Data Files

Chapter 7: Supervised Learning Models

Abstract: The main objective of this chapter is to discuss various supervised learning models in detail. The supervised learning models provide parametrized mapping that projects a data domain into a response set, and thus helps extract knowledge (known) from data (unknown). These learning models, in simple form, can be grouped into predictive models and classification models. Firstly, the predictive models, such as the standard regression, ridge regression, lasso regression, andelastic-net regression are discussed in detail with their mathematical and visual interpretations using simple examples. Secondly, the classification models are discussed and grouped into three models: mathematical models, hierarchical models, and layered models. Also discussed are the mathematical models, such as the logistic regression and support vector machine; the hierarchical models, like the decision tree and the random forest; and the layered models, like the deep learning. They are discussed only from the modeling point of view, and they will be discussed in detail together as the modeling and algorithms in separate chapters later in the book.

Download Codes and Data Files

Chapter 8: Supervised Learning Algorithms

Abstract: Supervised learning algorithms help the learning models to be trained efficiently, so that they can provide high classification accuracy. In general, the supervised learning algorithms support the search for optimal values for the model parameters by using large data sets without overfitting the model. Therefore, a careful design of the learning algorithms with systematic approaches is essential. The machine learning field suggests three phases for the design of a supervised learning algorithm: training phase, validation phase, and testing phase. Hence, it recommends three divisions (or subsets) of the data sets to carry out these tasks. It also suggests defining or selecting suitable performance evaluation metrics to train, validate, and test the supervised learning models. Therefore, the objectives of this chapter are to discuss these three phases of a supervised learning algorithm and the three performance evaluation metrics called domain division, classification accuracy, and oscillation characteristics. The chapter objectives include the introduction of five new performance evaluation metrics called delayed learning, sporadic learning, deteriorate learning, heedless learning, and stabilized learning, which can help to measure classification accuracy under oscillation characteristics.

Download Codes and Data Files

Chapter 9: Support Vector Machine

Abstract: Support Vector Machine is one of the classical machine learning techniques that can still help solve big data classification problems. Especially, it can help the multidomain applications in a big data environment. However, the support vector machine is mathematically complex and computationally expensive. The main objective of this chapter is to simplify this approach using process diagrams and data flow diagrams to help readers understand theory and implement it successfully. To achieve this objective, the chapter is divided into three parts: (1) modeling of a linear support vector machine; (2) modeling of a nonlinear support vector machine; and (3) Lagrangian support vector machine algorithm and its implementations. The Lagrangian support vector machine with simple examples is also implemented using the R programming platform on Hadoop and non-Hadoop systems.

Download Codes and Data Files

Chapter 10: Decision Tree Learning

Abstract: The main objective of this chapter is to introduce you to hierarchical supervised learning models. One of the main hierarchical models is the decision tree. It has two categories: classification tree and regression tree. The theory and applications of these decision trees are explained in this chapter. These techniques require tree split algorithms to build the decision trees and require quantitative measures to build an efficient tree via training. Hence, the chapter dedicates some discussion to the measures like entropy, cross-entropy, Gini impurity, and information gain. It also discusses the training algorithms suitable for classification tree and regression tree models. Simple examples and visual aids explain the difficult concepts so that readers can easily grasp the theory and applications of decision tree.

Download Codes and Data Files

Chapter 11: Random Forest Learning

Abstract: The main objective of this chapter is to introduce you to the random forest supervised learning model. The random forest technique uses the decision tree model for parametrization, but it integrates a sampling technique, a subspace method, and an ensemble approach to optimize the model building. The sampling approach is called the bootstrap, which adopts a random sampling approach with replacement. The subspace method also adopts a random sampling approach, but it helps extract smaller subsets (i.e., subspaces) of features. It also helps build decision trees based on them and select decision trees for the random forest construction. The ensemble approach helps build classifiers based on the so-called bagging approach. The objectives of this chapter include detailed discussions on these approaches. The chapter also discusses the training and testing algorithms that are suitable for the random forest supervised learning. The chapter also presents simple examples and visual aids to better understand the random forest supervised learning technique.

Download Codes and Data Files

Chapter 12: Deep Learning Models

Abstract: The main objective of this chapter is to discuss the modern deep learning techniques, called the no-drop, the dropout, and the dropconnect in detail and provide programming examples that help you clearly understand these approaches. These techniques heavily depend on the stochastic gradient descent approach; and this approach is also discussed in detail with simple iterative examples. These parametrized deep learning techniques are also dependent on two parameters (weights), and the initial values of these parameters can significantly affect the deep learning models; therefore, a simple approach is presented to enhance the classification accuracy and improve computing performance using perceptual weights. The approach is called the perceptually inspired deep learning framework, and it incorporates edge-sharpening filters and their frequency responses for the classifier and the connector parameters of the deep learning models. They preserve class characteristics and regularize the deep learning model parameters.

Download Codes and Data Files

Chapter 13: Chandelier Decision Tree

Abstract: This chapter proposes two new techniques called the chandelier decision tree and the random chandelier. This pair of techniques is similar to the well-known pair of techniques, the decision tree and the random forest. The chapter also presents a previously proposed algorithm called the unit circle algorithm (UCA) and proposes a family of UCA-based algorithms called the unit circle machine (UCM), unit ring algorithm (URA), and unit ring machine (URM). The unit circle algorithm integrates a normalization process to define a unit circle domain, and thus the other proposed algorithms adopt the phrase “unit circle.” The chandelier decision tree and the random chandelier use the unit ring machine to build the chandelier trees.

Download Codes and Data Files

Chapter 14: Dimensionality Reduction

Abstract: The main objective of this chapter is to explain the two important dimensionality reduction techniques, feature hashing and principal component analysis, that can support scaling-up machine learning. The standard and flagged feature hashing approaches are explained in detail. The feature hashing approach suffers from the hash collision problem, and this problem is reported and discussed in detail in this chapter, too. Two collision controllers, feature binning and feature mitigation, are also proposed in this chapter to address this problem. The principal component analysis uses the concepts of eigenvalues and eigenvectors, and these terminologies are explained in detail with examples. The principal component analysis is also explained using a simple two-dimensional example, and several coding examples are also presented.

Download Codes and Data Files


This guideline covers the following topics:


This guideline covers the following topics: