This book presents machine learning models and algorithms to address big data classification problems. Existing machine learning techniques like the decision tree (a hierarchical approach), random forest (an ensemble hierarchical approach), and deep learning (a layered approach) are highly suitable for the system that can handle such problems. This book helps readers, especially students and newcomers to the field of big data and machine learning, to gain a quick understanding of the techniques and technologies; therefore, the theory, examples, and programs (Matlab and R) presented in this book have been simplified, hardcoded, repeated, or spaced for improvements. They provide vehicles to test and understand the complicated concepts of various topics in the field. It is expected that the readers adopt these programs to experiment with the examples, and then modify or write their own programs toward advancing their knowledge for solving more complex and challenging problems.

The presentation format of this book focuses on simplicity, readability, and dependability so that both undergraduate and graduate students as well as new researchers, developers, and practitioners in this field can easily trust and grasp the concepts, and learn them effectively. It has been written to reduce the mathematical complexity and help the vast majority of readers to understand the topics and get interested in the field. This book consists of four parts, with the total of 14 chapters. The first part mainly focuses on the topics that are needed to help analyze and understand data and big data. The second part covers the topics that can explain the systems required for processing big data. The third part presents the topics required to understand and select machine learning techniques to classify big data. Finally, the fourth part concentrates on the topics that explain the scaling-up machine learning, an important solution for modern big data problems.

Chapter 1: Science of Information

Chapter 2: Big Data Essentials

Chapter 4: Distributed File System

Chapter 5: MapReduce Programming Platform

Chapter 6: Modeling and Algorithms

Chapter 7: Supervised Learning Models

Chapter 8: Supervised Learning Algorithms

Chapter 9: Support Vector Machine

Chapter 10: Decision Tree Learning

Chapter 11: Random Forest Learning

Chapter 12: Deep Learning Models

Chapter 13: Chandelier Decision Tree

Chapter 14: Dimensionality Reduction

Abstract: The main objective of this chapter is to provide an overview of the modern ﬁeld of data science and some of the current progress in this ﬁeld. The overview focuses on two important paradigms: (1) big data paradigm, which describes a problem space for the big data analytics, and (2) machine learning paradigm, which describes a solution space for the big data analytics. It also includes a preliminary description of the important elements of data science. These important elements are the data, the knowledge (also called responses), and the operations. The terms knowledge and responses will be used interchangeably in the rest of the book. A preliminary information of the data format, the data types and the classiﬁcation are also presented in this chapter. This chapter emphasizes the importance of collaboration between the experts from multiple disciplines and provides the information on some of the current institutions that show collaborative activities with useful resources.

Abstract: The main objective of this chapter is to organize the big data essentials that contribute to the analytics of big data systematically. It includes their presentations in a simple form that can help readers conceptualize and summarize the classiﬁcation objectives easily. The topics are organized into three sections: big data analytics, big data classiﬁcation, and big data scalability. In the big data analytics section, the big data controllers that play major roles in data representation and knowledge extraction will be presented and discussed in detail. These controllers, the problems and challenges that they bring to big data analytics, and the solutions to address these problems and challenges will also be discussed. In the big data classiﬁcation section, the machine learning processes, the classiﬁcation modeling that is characterized by the big data controllers, and the classiﬁcation algorithms that can manage the effect of big data controllers will be discussed. In the big data scalability section, the importance of the low-dimensional structures that can be extracted from a high-dimensional system for addressing scalability issues will be discussed as well.

Abstract: An in-depth analysis of data can reveal many interesting properties of the data, which can help us predict the future characteristics of the data. The objective of this chapter is to illustrate some of the meaningful changes that may occur in a set of data when it is transformed into big data through evolution. To make this objective practical and interesting, a split-merge-split framework is developed, presented, and applied in this chapter. A set of ﬁle-split, ﬁle-merge, and feature-split tasks is used in this framework. It helps explore the evolution of patterns from the cause of transformation from a set of data to a set of big data. Four digital images are used to create data sets, and statistical and geometrical techniques are applied with the split-merge-split framework to understand the evolution of patterns under different class characteristics, domain characteristics, and error characteristics scenarios.

Abstract: The main objective of this chapter is to provide information and guidance for building a Hadoop distributed ﬁle system to address the big data classiﬁcation problem. This system can help one to implement, test, and evaluate various machine-learning techniques presented in this book for learning purposes. The objectives include a detailed explanation of the Hadoop framework and the Hadoop system, the presentation of the Internet resources that can help you build a virtual machine-based Hadoop distributed ﬁle system with the R programming platform, and the establishment of an easy-to-follow, step-by-step instruction to build the RevolutionAnalytics’ RHadoop system for your big data computing environment. The objective also includes the presentation of simple examples to test the system to ensure the Hadoop system works. A brief discussion on setting up a multi node Hadoop system is also presented.

Abstract: The main objective of this chapter is to explain the MapReduce framework based on RevolutionAnalytics’ RHadoop environment. The MapReduce framework relies on its underlying structures, the parametrization, and the parallelization. These structures have been explained clearly in this chapter. The implementation of these structures requires a MapReduce programming platform. An explanation of this programming platform is also presented together with a discussion on the three important functions, mapper(), reducer(), and mapreduce(). These functions help the implementation of the parametrization and parallelization structures to address scalability problems in big data classiﬁcation. The chapter also presents a set of coding principles, which provide good programming practices to the users of the MapReduce programming platform in the context of big data processing and analysis. Several programming examples are also presented to help the reader to practice coding principles and better understand the MapReduce framework.

Abstract: The main objective of this chapter is to explain the machine learning concepts, mainly modeling and algorithms; batch learning and online learning; and supervised learning (regression and classiﬁcation) and unsupervised learning (clustering) using examples. Modeling and algorithms will be explained based on the domain division characteristics, batch learning and online learning will be explained based on the availability of the data domain, and supervised learning and unsupervised learning will be explained based on the labeling of the data domain. This objective will be extended to the comparison of the mathematical models, hierarchical models, and layered models, using programming structures, such as control structures, modularization, and sequential statements.

Abstract: The main objective of this chapter is to discuss various supervised learning models in detail. The supervised learning models provide parametrized mapping that projects a data domain into a response set, and thus helps extract knowledge (known) from data (unknown). These learning models, in simple form, can be grouped into predictive models and classiﬁcation models. Firstly, the predictive models, such as the standard regression, ridge regression, lasso regression, andelastic-net regression are discussed in detail with their mathematical and visual interpretations using simple examples. Secondly, the classiﬁcation models are discussed and grouped into three models: mathematical models, hierarchical models, and layered models. Also discussed are the mathematical models, such as the logistic regression and support vector machine; the hierarchical models, like the decision tree and the random forest; and the layered models, like the deep learning. They are discussed only from the modeling point of view, and they will be discussed in detail together as the modeling and algorithms in separate chapters later in the book.

Abstract: Supervised learning algorithms help the learning models to be trained efﬁciently, so that they can provide high classiﬁcation accuracy. In general, the supervised learning algorithms support the search for optimal values for the model parameters by using large data sets without overﬁtting the model. Therefore, a careful design of the learning algorithms with systematic approaches is essential. The machine learning ﬁeld suggests three phases for the design of a supervised learning algorithm: training phase, validation phase, and testing phase. Hence, it recommends three divisions (or subsets) of the data sets to carry out these tasks. It also suggests deﬁning or selecting suitable performance evaluation metrics to train, validate, and test the supervised learning models. Therefore, the objectives of this chapter are to discuss these three phases of a supervised learning algorithm and the three performance evaluation metrics called domain division, classiﬁcation accuracy, and oscillation characteristics. The chapter objectives include the introduction of ﬁve new performance evaluation metrics called delayed learning, sporadic learning, deteriorate learning, heedless learning, and stabilized learning, which can help to measure classiﬁcation accuracy under oscillation characteristics.

Abstract: Support Vector Machine is one of the classical machine learning techniques that can still help solve big data classiﬁcation problems. Especially, it can help the multidomain applications in a big data environment. However, the support vector machine is mathematically complex and computationally expensive. The main objective of this chapter is to simplify this approach using process diagrams and data ﬂow diagrams to help readers understand theory and implement it successfully. To achieve this objective, the chapter is divided into three parts: (1) modeling of a linear support vector machine; (2) modeling of a nonlinear support vector machine; and (3) Lagrangian support vector machine algorithm and its implementations. The Lagrangian support vector machine with simple examples is also implemented using the R programming platform on Hadoop and non-Hadoop systems.

Abstract: The main objective of this chapter is to introduce you to hierarchical supervised learning models. One of the main hierarchical models is the decision tree. It has two categories: classiﬁcation tree and regression tree. The theory and applications of these decision trees are explained in this chapter. These techniques require tree split algorithms to build the decision trees and require quantitative measures to build an efﬁcient tree via training. Hence, the chapter dedicates some discussion to the measures like entropy, cross-entropy, Gini impurity, and information gain. It also discusses the training algorithms suitable for classiﬁcation tree and regression tree models. Simple examples and visual aids explain the difﬁcult concepts so that readers can easily grasp the theory and applications of decision tree.

Abstract: The main objective of this chapter is to introduce you to the random forest supervised learning model. The random forest technique uses the decision tree model for parametrization, but it integrates a sampling technique, a subspace method, and an ensemble approach to optimize the model building. The sampling approach is called the bootstrap, which adopts a random sampling approach with replacement. The subspace method also adopts a random sampling approach, but it helps extract smaller subsets (i.e., subspaces) of features. It also helps build decision trees based on them and select decision trees for the random forest construction. The ensemble approach helps build classiﬁers based on the so-called bagging approach. The objectives of this chapter include detailed discussions on these approaches. The chapter also discusses the training and testing algorithms that are suitable for the random forest supervised learning. The chapter also presents simple examples and visual aids to better understand the random forest supervised learning technique.

Abstract: The main objective of this chapter is to discuss the modern deep learning techniques, called the no-drop, the dropout, and the dropconnect in detail and provide programming examples that help you clearly understand these approaches. These techniques heavily depend on the stochastic gradient descent approach; and this approach is also discussed in detail with simple iterative examples. These parametrized deep learning techniques are also dependent on two parameters (weights), and the initial values of these parameters can signiﬁcantly affect the deep learning models; therefore, a simple approach is presented to enhance the classiﬁcation accuracy and improve computing performance using perceptual weights. The approach is called the perceptually inspired deep learning framework, and it incorporates edge-sharpening ﬁlters and their frequency responses for the classiﬁer and the connector parameters of the deep learning models. They preserve class characteristics and regularize the deep learning model parameters.

Abstract: This chapter proposes two new techniques called the chandelier decision tree and the random chandelier. This pair of techniques is similar to the well-known pair of techniques, the decision tree and the random forest. The chapter also presents a previously proposed algorithm called the unit circle algorithm (UCA) and proposes a family of UCA-based algorithms called the unit circle machine (UCM), unit ring algorithm (URA), and unit ring machine (URM). The unit circle algorithm integrates a normalization process to deﬁne a unit circle domain, and thus the other proposed algorithms adopt the phrase “unit circle.” The chandelier decision tree and the random chandelier use the unit ring machine to build the chandelier trees.

Abstract: The main objective of this chapter is to explain the two important dimensionality reduction techniques, feature hashing and principal component analysis, that can support scaling-up machine learning. The standard and ﬂagged feature hashing approaches are explained in detail. The feature hashing approach suffers from the hash collision problem, and this problem is reported and discussed in detail in this chapter, too. Two collision controllers, feature binning and feature mitigation, are also proposed in this chapter to address this problem. The principal component analysis uses the concepts of eigenvalues and eigenvectors, and these terminologies are explained in detail with examples. The principal component analysis is also explained using a simple two-dimensional example, and several coding examples are also presented.

Preperation of a cluster node with Ubuntu

Installation of Hadoop on Ubuntu