Internshala Summer Training Report On Data Science [PDF]

  • 0 0 0
  • Gefällt Ihnen dieses papier und der download? Sie können Ihre eigene PDF-Datei in wenigen Minuten kostenlos online veröffentlichen! Anmelden
Datei wird geladen, bitte warten...
Zitiervorschau

Data Science

Summer Training Report On “Data Science” Submitted to Kurukshetra University in partial fulfilment of the requirement for the award of the Degree of Bachelor of Technology

ELECTRONICS AND COMMUNICATION ENGINEERING

Submitted by: Buland 251701150, ECE-B, 7th Sem Submitted to: Mr. Puneet Bansal Asst. Prof.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, UNIVERSITY INSTITUTE OF ENGINEERING AND TECHNOLOGY, KURUKSHETRA UNIVERSITY, KURUKSHETRA

251701150

1

Data Science

DECLARATION I hereby certify that the work which is being presented in the report entitled “Data Science” in fulfilment of the requirement for completion of one-month industrial training in Department of Electronics and Communication Engineering of “University Institute of Engineering and Technology, Kurukshetra University” is an authentic record of my own work carried out during industrial training.

Buland 251701150 ECE-B 7th sem.

251701150

2

Data Science

ACKNOWLEDGEMENT The work in this report is an outcome of continuous work over a period and drew intellectual support from Internshala and other sources. I would like to articulate our profound gratitude and indebtedness to Internshala helped us in completion of the training. I am thankful to Internshala Training Associates for teaching and assisting me in making the training successful.

Buland 251701150 ECE-B 7th sem.

251701150

3

Data Science

Introduction to Organization: Internshala is an internship and online training platform, based in Gurgaon, India. Founded by Sarvesh Agrawal, an IIT Madras alumnus, in 2010, the website helps students find internships with organisations in India. The platform started as a WordPress blog which aggregated internships across India and articles on education, technology and skill gap in 2010. The website was launched in 2013. Internshala launched its online trainings in 2014. The platform is used by 2.0 Mn + students and 70000+ companies. At the core of the idea is the belief that internships, if managed well, can make a positive difference to the student, to the employer, and to the society at large. Hence, the ad-hoc culture surrounding internships in India should and would change. Internshala aims to be the driver of this change.

251701150

4

Data Science

About Training: The Data Science Training by Internshala is a 6-week online training program in which Internshala aim to provide you with a comprehensive introduction to data science. In this training program, you will learn the basics of python, statistics, predictive modeling, and machine learning. This training program has video tutorials and is packed with assignments, assessments tests, quizzes, and practice exercises for you to get a hands-on learning experience. At the end of this training program, you will have a solid understanding of data science and will be able to build an end-to-end predictive model. For doubt clearing, you can post your queries on the forum and get answers within 24 hours.

251701150

5

Data Science

Table of Content Introduction to Organization About Training Module-1: Introduction to Data Science 1.1. Data Science Overview Module-2: Python for Data Science 2.1. Introduction to Python 2.2. Understanding Operators 2.3. Variables and Data Types 2.4. Conditional Statements 2.5. Looping Constructs 2.6. Functions 2.7. Data Structure 2.8. Lists 2.9. Dictionaries 2.10. Understanding Standard Libraries in Python 2.11. Reading a CSV File in Python 2.12. Data Frames and basic operations with Data Frames 2.13. Indexing Data Frame Module-3: Understanding the Statistics for Data Science 3.1. Introduction to Statistics 3.2. Measures of Central Tendency 3.3. Understanding the spread of data 3.4. Data Distribution 3.5. Introduction to Probability 3.6. Probabilities of Discreet and Continuous Variables 3.7. Central Limit Theorem and Normal Distribution 3.8. Introduction to Inferential Statistics 3.9. Understanding the Confidence Interval and margin of error 251701150

6

Data Science

3.10. Hypothesis Testing 3.11. T tests 3.12. Chi Squared Tests 3.13. Understanding the concept of Correlation Module-4: Predictive Modeling and Basics of Machine Learning 4.1. Introduction to Predictive Modeling 4.2. Understanding the types of Predictive Models 4.3. Stages of Predictive Models 4.4. Hypothesis Generation 4.5. Data Extraction 4.6. Data Exploration 4.7. Reading the data into Python 4.8. Variable Identification 4.9. Univariate Analysis for Continuous Variables 4.10. Univariate Analysis for Categorical Variables 4.11. Bivariate Analysis 4.12. Treating Missing Values 4.13. How to treat Outliers 4.14. Transforming the Variables 4.15. Basics of Model Building 4.16. Linear Regression 4.17. Logistic Regression 4.18. Decision Trees 4.19. K-means

251701150

7

Data Science

Module-1: Introduction to Data Science 1.1. Data Science Overview

Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it’s the study of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on them. Data Science involves data and some signs. It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data. It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s when you translate data into a story. So, use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution. We can also define data science as a field which is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured.

Predictive modeling:

Predictive modeling is a form of artificial intelligence that uses data mining and probability to forecast or estimate more granular, specific outcomes. For example, predictive modeling could help identify customers who are likely to purchase our new One AI software over the next 90 days.

Machine Learning:

Machine learning is a branch of artificial intelligence (ai) where computers learn to act and adapt to new data without being programmed to do so. The computer is able to act independently of human interaction.

Forecasting:

Forecasting is a process of predicting or estimating future events based on past and present data and most commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a broken watch is right two times a day.

251701150

8

Data Science

Applications of Data Science:

Data science and big data are making an undeniable impact on businesses, changing day-to-day operations, financial analytics, and especially interactions with customers. It's clear that businesses can gain enormous value from the insights data science can provide. But sometimes it's hard to see exactly how. So let's look at some examples. In this era of big data, almost everyone generates masses of data every day, often without being aware of it. This digital trace reveals the patterns of our online lives. If you have ever searched for or bought a product on a site like Amazon, you'll notice that it starts making recommendations related to your search. This type of system known as a recommendation engine is a common application of data science. Companies like Amazon, Netflix, and Spotify use algorithms to make specific recommendations derived from customer preferences and historical behavior. Personal assistants like Siri on Apple devices use data science to devise answers to the infinite number of questions end users may ask. Google watches your every move in the world, you're online shopping habits, and your social media. Then it analyzes that data to create recommendations for restaurants, bars, shops, and other attractions based on the data collected from your device and your current location. Wearable devices like Fitbits, Apple watches, and Android watches add information about your activity levels, sleep patterns, and heart rate to the data you generate. Now that we know how consumers generate data, let's take a look at how data science is impacting business. In 2011, McKinsey & Company said that data science was going to become the key basis of competition. Supporting new waves of productivity, growth, and innovation. In 2013, UPS announced that it was using data from customers, drivers, and vehicles, in a new route guidance system aimed to save time, money, and fuel. Initiatives like this support the statement that data science will fundamentally change the way businesses compete and operate. How does a firm gain a competitive advantage? Let's take Netflix as an example. Netflix collects and analyzes massive amounts of data from millions of users, including which shows people are watching at what time a day when people pause, rewind, and fast-forward, and which shows directors and actors they search for. Netflix can be confident that a show will be a hit before filming even begins by analyzing users preference for certain directors and acting talent, and discovering which combinations people enjoy. Add this to the success of earlier versions of a show and you have a hit. For example, Netflix knew many of its users had streamed to the work of David Fincher. They also knew that films featuring Robin Wright had always done well, and that the British version of House of Cards was very successful. 251701150

9

Data Science

Netflix knew that significant numbers of people who liked Fincher also liked Wright. All this information combined to suggest that buying the series would be a good investment for the company.

251701150

10

Data Science

Module-2: Python for Data Science 2.1. Introduction to Python

Python is a high-level, general-purpose and a very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java. Below are some facts about Python Programming Language: • • •

• • • • • • • • • • •

251701150

Python is currently the most widely used multi-purpose, high-level programming language. Python allows programming in Object-Oriented and Procedural paradigms. Python programs generally are smaller than other programming languages like Java. Programmers have to type relatively less and indentation requirement of the language, makes them readable all the time. Python language is being used by almost all tech-giant companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc. The biggest strength of Python is huge collection of standard library which can be used for the following: Machine Learning GUI Applications (like Kivy, Tkinter, PyQt etc. ) Web frameworks like Django (used by YouTube, Instagram, Dropbox) Image processing (like OpenCV, Pillow) Web scraping (like Scrapy, BeautifulSoup, Selenium) Test frameworks Multimedia Scientific computing Text processing and many more.

11

Data Science

2.2. Understanding Operators a.

Arithmetic operators:

Arithmetic operators are used to perform mathematical operations like addition, subtraction, multiplication and division.

OPERATOR

DESCRIPTION

SYNTAX

+

Addition: adds two operands

x+y

-

Subtraction: subtracts two operands

x-y

*

Multiplication: multiplies two operands

x*y

/

Division (float): divides the first operand by the second

x/y

//

Division (floor): divides the first operand by the second

x // y

Modulus: returns the remainder when first operand is %

divided by the second

x%y

**

Power : Returns first raised to power second

x ** y

251701150

12

Data Science

b.

Relational Operators:

Relational operators compares the returns True or False according to the condition.

OPERATOR

values.

It

DESCRIPTION

either

SYNTAX

>

Greater than: True if left operand is greater than the right

x>y




Bitwise right shift

x>>

=b

a=a>>b

Performs Bitwise AND on operands and &=

assign value to left operand

Performs Bitwise OR on operands and assign |=

value to left operand

Performs Bitwise xOR on operands and assign ^=

value to left operand

Performs Bitwise right shift on operands and >>=

assign value to left operand

a Histogram -> Density Plot

251701150

37

Data Science

Boxplot : It is based on the percentiles of the data as shown in the figure below. The top and bottom of the boxplot are 75th and 25th percentile of the data. The extended lines are known as whiskers that includes the range of rest of the data. # BoxPlot Population In Millions fig, ax1 = plt.subplots() fig.set_size_inches(9, 15)

ax1 = sns.boxplot(x = data.PopulationInMillions, orient ="v") ax1.set_ylabel("Population by State in Millions", fontsize = 15) ax1.set_title("Population - BoxPlot", fontsize = 20)

Frequency Table : It is a tool to distribute the data into equally spaced ranges, segments and tells us how many values fall in each segment. Histogram: It is a way of visualizing data distribution through frequency table with bins on the x-axis and data count on the y-axis. Code – Histogram # Histogram Population In Millions fig, ax2 = plt.subplots() fig.set_size_inches(9, 15) ax2 = sns.distplot(data.PopulationInMillions, kde = False) ax2.set_ylabel("Frequency", fontsize = 15) ax2.set_xlabel("Population by State in Millions", fontsize = 15) ax2.set_title("Population - Histogram", fontsize = 20)

251701150

38

Data Science

Output :

Density Plot: It is related to histogram as it shows data-values being distributed as continuous line. It is a smoothed histogram version. The output below is the density plor superposed over histogram. Code – Density Plot for the data 251701150

39

Data Science # Density Plot - Population fig, ax3 = plt.subplots() fig.set_size_inches(7, 9) ax3 = sns.distplot(data.Population, kde = True) ax3.set_ylabel("Density", fontsize = 15) ax3.set_xlabel("Murder Rate per Million", fontsize = 15) ax3.set_title("Desnsity Plot - Population", fontsize = 20)

Output :

3.5 Introduction to Probability Probability refers to the extent of occurrence of events. When an event occurs like throwing a ball, picking a card from deck, etc ., then the must be some probability associated with that event. 251701150

40

Data Science

In terms of mathematics, probability refers to the ratio of wanted outcomes to the total number of possible outcomes. There are three approaches to the theory of probability, namely: 1. Empirical Approach 2. Classical Approach 3. Axiomatic Approach In this article, we are going to study about Axiomatic Approach.In this approach, we represent the probability in terms of sample space(S) and other terms. Basic Terminologies: • Random Event :- If the repetition of an experiment occurs several times under similar conditions, if it does not produce the same outcome everytime but the outcome in a trial is one of the several possible outcomes, then such an experiment is called random event or a probabilistic event. • Elementary Event – The elementary event refers to the outcome of each random event performed. Whenever the random event is performed, each associated outcome is known as elementary event. • Sample Space – Sample Space refers to the set of all possible outcomes of a random event.Example, when a coin is tossed, the possible outcomes are head and tail. • Event – An event refers to the subset of the sample space associated with a random event. • Occurrence of an Event – An event associated with a random event is said to occur if any one of the elementary event belonging to it is an outcome. • Sure Event – An event associated with a random event is said to be sure event if it always occurs whenever the random event is performed. • Impossible Event – An event associated with a random event is said to be impossible event if it never occurs whenever the random event is performed. • Compound Event – An event associated with a random event is said to be compound event if it is the disjoint union of two or more elementary events. • Mutually Exclusive Events – Two or more events associated with a random event are said to be mutually exclusive events if any one of the event occurrs, it prevents the occurrence of all other events.This means that no two or more events can occur simultaneously at the same time.

251701150

41

Data Science

Exhaustive Events – Two or more events associated with a random event are said to be exhaustive events if their union is the sample space. Probability of an Event – If there are total p possible outcomes associated with a random experiment and q of them are favourable outcomes to the event A, then the probability of event A is denoted by P(A) and is given by P(A) = q/p •

3.6 Probabilities of Discreet and Continuous Variables

Random variable is basically a function which maps from the set of sample space to set of real numbers. The purpose is to get an idea about result of a particular situation where we are given probabilities of different outcomes.

Discrete Random Variable: A random variable X is said to be discrete if it takes on finite number of values. The probability function associated with it is said to be PMF = Probability mass function. P(xi) = Probability that X = xi = PMF of X = pi. 1. 0 ≤ pi ≤ 1. 2. ∑pi = 1 where sum is taken over all possible values of x. Continuous Random Variable: A random variable X is said to be continuous if it takes on infinite number of values. The probability function associated with it is said to be PDF = Probability density function. PDF: If X is continuous random variable. P (x < X < x + dx) = f(x)*dx. 1. 0 ≤ f(x) ≤ 1; for all x 2. ∫ f(x) dx = 1 over all values of x Then P (X) is said to be PDF of the distribution.

251701150

42

Data Science

3.7 Central Limit Theorem and Normal Distribution Whenever a random experiment is replicated, the Random Variable that equals the average (or total) result over the replicates tends to have a normal distribution as the number of replicates becomes large. It is one of the cornerstones of probability theory and statistics, because of the role it plays in the Central Limit Theorem, and because many real-world phenomena involve random quantities that are approximately normal (e.g., errors in scientific measurement). It is also known by other names such as- Gaussian Distribution, Bell shaped Distribution.

It can be observed from the above graph that the distribution is symmetric about its center, which is also the mean (0 in this case). This makes the probability of events at equal deviations from the mean, equally probable. The density is highly centered around the mean, which translates to lower probabilities for values away from the mean. Probability Density Function – The probability density function of the general normal distribution is given as-

In the above formula, all the symbols have their usual meanings, is the Standard Deviation and is the Mean. It is easy to get overwhelmed by the above formula while trying to understand everything in one glance, but we can try to break it down into smaller pieces so as to get an intuition as to what is going on. The z-score is a measure of how many standard deviations away a data point is 251701150

43

Data Science

from

the

mean.

Mathematically,

The exponent of in the above formula is the square of the z-score times . This is actually in accordance to the observations that we made above. Values away from the mean have a lower probability compared to the values near the mean. Values away from the mean will have a higher z-score and consequently a lower probability since the exponent is negative. The opposite is true for values closer to the mean. This gives way for the 68-95-99.7 rule, which states that the percentage of values that lie within a band around the mean in a normal distribution with a width of two, four and six standard deviations, comprise 68%, 95% and 99.7% of all the values. The figure given below shows this rule-

The effects of and on the distribution are shown below. Here is used to reposition the center of the distribution and consequently move the graph left or right, and is used to flatten or inflate the curve-

251701150

44

Data Science

3.8 Introduction to Inferential Statistics

Inferential Statistics makes inference and prediction about population based on a sample of data taken from population. It generalizes a large dataset and applies probabilities to draw a conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to analyze, interpret result, and draw conclusion. Inferential Statistics is mainly related to and associated with hypothesis testing whose main target is to reject null hypothesis. Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate and assess credibility of a hypothesis about a population. Inferential statistics are generally used to determine how strong relationship is within sample. But it is very difficult to obtain a population list and draw a random sample. Types of inferential statistics – Various types of inferential statistics are used widely nowadays and are very easy to interpret. These are given below: • One sample test of difference/One sample hypothesis test • Confidence Interval • Contingency Tables and Chi-Square Statistic • T-test or Anova

251701150

45

Data Science

3.9 Understanding the Confidence Interval and margin of error In simple terms, Confidence Interval is a range where we are certain that true value exists. The selection of a confidence level for an interval determines the probability that the confidence interval will contain the true parameter value. This range of values is generally used to deal with population-based data, extracting specific, valuable information with a certain amount of confidence, hence the term ‘Confidence Interval’. Fig. Shows how a confidence interval generally looks like.

3.10 Hypothesis Testing

Hypothesis are statement about the given problem. Hypothesis testing is a statistical method that is used in making a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. Parameters of hypothesis testing • Null hypothesis(H0): In statistics, the null hypothesis is a general given statement or default position that there is no relationship between two measured cases or no relationship among groups.

251701150

46

Data Science







• •

In other words, it is a basic assumption or made based on the problem knowledge. Example: A company production is = 50 unit/per day etc. Alternative hypothesis(H1): The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis. Example : A company production is not equal to 50 unit/per day etc. Level of significance It refers to the degree of significance in which we accept or reject the nullhypothesis. 100% accuracy is not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually 5%. This is normally denoted with and generally, it is 0.05 or 5%, which means your output should be 95% confident to give similar kind of result in each sample. P-value The P value, or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis. Error in Hypothesis Testing Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha. Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta.

3.11 T tests

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features. If t-value is large => the two groups belong to different groups. If t-value is small => the two groups belong to same group. There are three types of t-tests, and they are categorized as dependent and independent t-tests. 1. Independent samples t-test: compares the means for two groups. 2. Paired sample t-test: compares means from the same group at different times (say, one year apart). 3. One sample t-test test: the mean of a single group against a known mean.

251701150

47

Data Science

3.12 Chi Squared Tests

Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population. Chi- square score is given by :

3.13 Understanding the concept of Correlation Correlation – 1. It show whether and how strongly pairs of variables are related to each other. 2. Correlation takes values between -1 to +1, wherein values close to +1 represents strong positive correlation and values close to -1 represents strong negative correlation. 3. In this variable are indirectly related to each other. 4. It gives the direction and strength of relationship between variables. Formula –

Here, x’ and y’ = mean of given sample set n = total no of sample xi and yi = individual sample of set Example –

251701150

48

Data Science

Module-4: Predictive Modeling and Basics of Machine Learning 4.1. Introduction to Predictive Modeling Predictive analytics involves certain manipulations on data from existing data sets with the goal of identifying some new trends and patterns. These trends and patterns are then used to predict future outcomes and trends. By performing predictive analysis, we can predict future trends and performance. It is also defined as the prognostic analysis, the word prognostic means prediction. Predictive analytics uses the data, statistical algorithms and machine learning techniques to identify the probability of future outcomes based on historical data.

4.2. Understanding the types of Predictive Models Supervised learning Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data. Unsupervised learning Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.

4.3. Stages of Predictive Models

Steps To Perform Predictive Analysis: Some basic steps should be performed in order to perform predictive analysis. 1. Define Problem Statement: Define the project outcomes, the scope of the effort, objectives, identify the data sets that are going to be used. 2. Data Collection: Data collection involves gathering the necessary details required for the analysis.

251701150

49

Data Science

3.

4.

5.

6.

7. 8.

It involves the historical or past data from an authorized source over which predictive analysis is to be performed. Data Cleaning: Data Cleaning is the process in which we refine our data sets. In the process of data cleaning, we remove un-necessary and erroneous data. It involves removing the redundant data and duplicate data from our data sets. Data Analysis: It involves the exploration of data. We explore the data and analyze it thoroughly in order to identify some patterns or new outcomes from the data set. In this stage, we discover useful information and conclude by identifying some patterns or trends. Build Predictive Model: In this stage of predictive analysis, we use various algorithms to build predictive models based on the patterns observed. It requires knowledge of python, R, Statistics and MATLAB and so on. We also test our hypothesis using standard statistic models. Validation: It is a very important step in predictive analysis. In this step, we check the efficiency of our model by performing various tests. Here we provide sample input sets to check the validity of our model. The model needs to be evaluated for its accuracy in this stage. Deployment: In deployment we make our model work in a real environment and it helps in everyday discussion making and make it available to use. Model Monitoring: Regularly monitor your models to check performance and ensure that we have proper results. It is seeing how model predictions are performing against actual data sets.

4.4. Hypothesis Generation

A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. To better understand the Hypothesis Space and Hypothesis consider the following coordinate that shows the distribution of some data:

251701150

50

Data Science

4.5. Data Extraction

In general terms, “Mining” is the process of extraction of some valuable material from the earth e.g. coal mining, diamond mining etc. In the context of computer science, “Data Mining” refers to the extraction of useful information from a bulk of data or data warehouses. One can see that the term itself is a little bit confusing. In case of coal or diamond mining, the result of extraction process is coal or diamond. But in case of Data Mining, the result of extraction process is not data!! Instead, the result of data mining is the patterns and knowledge that we gain at the end of the extraction process. In that sense, Data Mining is also known as Knowledge Discovery or Knowledge Extraction. Data Mining as a whole process The whole process of Data Mining comprises of three main phases: 1. Data Pre-processing – Data cleaning, integration, selection and transformation takes place 2. Data Extraction – Occurrence of exact data mining 3. Data Evaluation and Presentation – Analyzing and presenting results

4.6. Data Exploration 251701150

51

Data Science

Steps of Data Exploration and Preparation Remember the quality of your inputs decide the quality of your output. So, once you have got your business hypothesis ready, it makes sense to spend lot of time and efforts here. With my personal estimate, data exploration, cleaning and preparation can take up to 70% of your total project time. Below are the steps involved to understand, clean and prepare your data for building your predictive model: • • • • • • •

Variable Identification Univariate Analysis Bi-variate Analysis Missing values treatment Outlier treatment Variable transformation Variable creation Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined model.

4.7. Reading the data into Python Python provides inbuilt functions for creating, writing and reading files. There are two types of files that can be handled in python, normal text files and binary files (written in binary language, 0s and 1s). • •

Text files: In this type of file, Each line of text is terminated with a special character called EOL (End of Line), which is the new line character (‘\n’) in python by default. Binary files: In this type of file, there is no terminator for a line and the data is stored after converting it into machine-understandable binary language. Access modes govern the type of operations possible in the opened file. It refers to how the file will be used once it’s opened. These modes also define the location of the File Handle in the file. File handle is like a cursor, which defines from where the data has to be read or written in the file. Different access modes for reading a file are –

1. Read Only (‘r’) : Open text file for reading. The handle is positioned at the beginning of the file. If the file does not exists, raises I/O error. This is also the default mode in which file is opened. 2. Read and Write (‘r+’) : Open the file for reading and writing. The handle is positioned at the beginning of the file. Raises I/O error if the file does not exists.

251701150

52

Data Science

3. Append and Read (‘a+’) : Open the file for reading and writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

4.8. Variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables. Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target variable, data type of variables and category of variables.

Below, the variables have been defined in different category:

251701150

53

Data Science

4.9. Univariate Analysis for Continuous Variables Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods as shown below: Note: Univariate analysis is also used to highlight missing and outlier values. In

the upcoming part of this series, we will look at methods to handle missing and outlier values.

4.10. Univariate Analysis for Categorical Variables

For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.

4.11. Bivariate Analysis Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process. Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.

251701150

54

Data Science

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.



-1: perfect negative linear correlation +1:perfect positive



0: No correlation



linear

correlation

and

Correlation can be derived using following formula: Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y)) Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship between two variables:

251701150

55

Data Science

In above example, we have good positive relationship(0.65) between two variables X and Y. Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods: •





Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories. Stacked Column Chart: This method is more of a visual form of Two-way table.

Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom. Probability of 0: It indicates that both categorical variable are dependent Probability of 1: It shows that both variables are independent. Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chi-square test statistic for a test of independence of two categorical variables is found by:

frequency

where O represents the observed frequency. E is the expected under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be of small size is 0.22. It is derived by taking the row total for Size (9) times the column 251701150

56

Data Science

total for Product category (2) then dividing by the sample size (81). This is procedure is conducted for each cell. Statistical Measures used to analyze the power of relationship are: • •

Cramer’s V for Nominal Categorical Variable Mantel-Haenszed Chi-Square for ordinal categorical variable. Different data science language and tools have specific methods to perform chisquare test. In SAS, we can use Chisq as an option with Proc freq to perform this test. Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or ANOVA.



Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically

different from each other or not. If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.



ANOVA:- It assesses whether the average of more than two groups is statistically different.

251701150

57

Data Science

4.12. Treating Missing Values Why missing values treatment is required? Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males. Why my data has missing values? We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for occurrence of these missing values. They may occur at two stages: 1. Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well. 2. Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types:

251701150

58

Data Science o

o

o

o

Missing completely at random: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value. Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male. Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients. Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning. Which are the methods to treat missing values ?

1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion. o In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size. o In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.

251701150

59

Data Science

Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output. 2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:o Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it. o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25. 3. Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach: o The model estimated values are usually more well-behaved than the true values o If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values. 4. KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages. o Advantages:  k-nearest neighbour can predict both qualitative & quantitative attributes  Creation of predictive model for each attribute with missing data is not required o

251701150

60

Data Science   o  

Attributes with multiple missing values can be easily treated Correlation structure of the data is taken into consideration Disadvantage: KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances. Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

4.13. How to treat Outliers

Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample. Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Some analysts also various thumb rules to detect outliers. Some of them are:

• • • • •



Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier Data points, three or more standard deviation away from mean are considered outlier Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers. In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and influential observation, we also look at statistical measure like STUDENT, COOKD, RSTUDENT and others. How to remove Outliers? Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

251701150

61

Data Science

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers. Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values. Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

4.14. Transforming the Variables •



When we want to change the scale of a variable or standardize the values of a variable for better understanding. While this transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution When we can transform complex non-linear relationships into linear relationships. Existence of a linear relationship between variables is easier to

251701150

62

Data Science

comprehend compared to a non-linear or curved relation. Transformation helps us to convert a non-linear relation into linear relation. Scatter plot can be used to find the relationship between two continuous variables. These transformations also improve the prediction. Log transformation is one of the commonly used transformation technique used in these situations.





Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.

Variable Transformation is also done from an implementation point of view (Human involvement). Let’s understand it more clearly. In one of my project on employee performance, I found that age has direct correlation with performance of the employee i.e. higher the age, better the performance. From an implementation stand point, launching age based progamme might present implementation challenge. However, categorizing the sales agents in three age group buckets of 45 and then formulating three different strategies for each group is a judicious approach. This categorization technique is known as Binning of Variables.

251701150

63

Data Science

4.15. Basics of Model Building • • • • • • • • •

Lifecycle of Model Building – Select variables Balance data Build models Validate Deploy Maintain Define success Explore data Condition data Data exploration is used to figure out gist of data and to develop first step assessment of its quality, quantity, and characteristics. Visualization techniques can be also applied. However, this can be difficult task in high dimensional spaces with many input variables. In the conditioning of data, we group functional data which is applied upon modeling techniques after then rescaling is done, in some cases rescaling is an issue if variables are coupled. Variable section is very important to develop quality model. This process is implicity model-dependent since it is used to configure which combination of variables should be used in ongoing model development. Data balancing is to partition data into appropriate subsets for training, test, and validation. Model building is to focus on desired algorithms. The most famous technique is symbolic regression, other techniques can also be preferred.

4.16. Linear Regression Linear Regression is a machine learning algorithm based on supervised regression algorithm. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between the dependent and independent variables, they are considering and the number of independent variables being used.

4.17. Logistic Regression Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for a given set of features(or inputs), X.

251701150

64

Data Science

Any change in the coefficient leads to a change in both the direction and the steepness of the logistic function. It means positive slopes result in an S-shaped curve and negative slopes result in a Z-shaped curve.

4.18. Decision Trees

Decision Tree : Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. Decision Tree Representation : Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. An instance is classified by starting at the root node of the tree,testing the attribute specified by this node,then moving down the tree branch corresponding to the value of the attribute as shown in the above figure.This process is then repeated for the subtree rooted at the new node.

• • • •

• • •

Strengths and Weakness of Decision Tree approach The strengths of decision tree methods are: Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which fields are most important for prediction or classification. The weaknesses of decision tree methods : Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute. Decision trees are prone to errors in classification problems with many class and relatively small number of training examples. Decision tree can be computationally expensive to train. The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared.

251701150

65

Data Science

4.19. K-means k-means clustering tries to group similar kinds of items in form of clusters. It finds the similarity between the items and groups them into the clusters. K-means clustering algorithm works in three steps. Let’s see what are these three steps. 1. Select the k values. 2. Initialize the centroids. 3. Select the group and find the average. Let us understand the above steps with the help of the figure because a good picture is better than the thousands of words.

We will understand each figure one by one. •



Figure 1 shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values. In figure 2, Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will

251701150

66

Data Science





notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items. The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points. The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups. How to choose the value of K? One of the most challenging tasks in this clustering algorithm is to choose the right values of k. What should be the right k-value? How to choose the k-value? Let us find the answer to these questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you will choose the wrong value then it will directly affect your model performance. So there are two methods by which you can select the right value of k.

1. Elbow Method. 2. Silhouette Method. Now, Let’s understand both the concept one by one in detail. Elbow Method Elbow is one of the most famous methods by which you can select the right value of k and boost your model performance. We also perform the hyperparameter tuning to chose the best value of k. Let us see how this elbow method works. It is an empirical method to find out the best value of k. it picks up the range of values and takes the best among them. It calculates the sum of the square of the points and calculates the average distance.

251701150

67

Data Science

When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease. Finally, we will plot a graph between k-values and the within-cluster sum of the square to get the k value. we will examine the graph carefully. At some point, our graph will decrease abruptly. That point will be considered as a value of k.

Silhouette

Method

The silhouette method is somewhat different. The elbow method it also picks up the range of the k values and draws the silhouette graph. It calculates the silhouette coefficient of every point. It calculates the average distance of points within its cluster a (i) and the average distance of the points to its next closest cluster called b (i).

251701150

68

Data Science

Note : The a (i) value must be less than the b (i) value, that is ai