Super simple multi-linear regression application - application, Diverse, linear, regression, simple, super

First express my understanding of multiple linear regression:

Equation:

y is the correct result. p0 is a constant term, e is the error, p1, p2, p3, etc. are the regression coefficients we need to obtain through the sklearn training data set, and x1, x2, x3, etc. are the feature vectors in our training set.

The data set I used this time is Kaggle’s enrollment probability prediction data set:

Go to kaggle and search for admission.

https://www. kaggle.com/datasets

It looks like this:

Share a picture

Chance of Admit is the label that will ultimately be predicted by yourself

The idea is very simple, just go to the code ~
(Oh, I am running on jupyter)

One: Data Exploration

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt





csv_data = pd.read_csv('./data/Admission_Predict.csv ')

#  read csv file content

print(csv_data.info())

# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values. 

print(csv_data.describe())

# Understand some statistics such as total, average, standard deviation< /span>

print(csv_data.head())

# Understand what the data looks like~

csv_data.drop('Serial No.',axis=1,inplace=True)

# remove the useless ID column



# Normalize the data and simply divide by their maximum...< /span>

csv_data['GRE Score'] = csv_data['GRE Score']/340

csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120

csv_data['University Rating'] = csv_data['University Rating']/5

csv_data['SOP'] = csv_data['SOP< span style="color: #800000;">']/5

csv_data['LOR '] = csv_data['LOR < span style="color: #800000;">']/5

csv_data['CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10



#Data exploration

Run results:

Share pictures

Two: Simple visualization

import seaborn as sns





print(csv_data.columns)

sns.regplot('GRE Score','Chance of Admit ',data=csv_data)

share picture

View all feature links:

sns.pairplot(csv_data,diag_kind='kde',plot_kws={'alpha':0.2})

Share a picture

As you can see from the picture, there is indeed a bit of regression~

Three: Model Construction

from sklearn import linear_model





features = ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]

# Feature selection

X = csv_data[features].iloc[:420,:-1]

Y = csv_data.iloc[:420,-1]

#Select training set

X_test = csv_data[features].iloc[420:,:-1]

Y_test = csv_data.iloc[420:,-1]

#Select the test set



regr = linear_model.LinearRegression()

#Construct a linear regression model

regr.fit(X,Y)

#Model training

print(regr.predict(X_test)) # Forecast

print(list(Y_test)) #Answer

print(regr.score(X_test,Y_test)) #Accuracy

Result:

share picture

Hey, 88% accuracy is reached, useful and happy/

The End~

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt





csv_data = pd.read_csv('./data/Admission_Predict.csv ')

#  read csv file content

print(csv_data.info())

# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values. 

print(csv_data.describe())

# Understand some statistics such as total, average, standard deviation< /span>

print(csv_data.head())

# Understand what the data looks like~

csv_data.drop('Serial No.',axis=1,inplace=True)

# remove the useless ID column



# Normalize the data and simply divide by their maximum...< /span>

csv_data['GRE Score'] = csv_data['GRE Score']/340

csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120

csv_data['University Rating'] = csv_data['University Rating']/5

csv_data['SOP'] = csv_data['SOP< span style="color: #800000;">']/5

csv_data['LOR '] = csv_data['LOR < span style="color: #800000;">']/5

csv_data['CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10



#Data exploration

import seaborn as sns





print(csv_data.columns)

sns.regplot('GRE Score','Chance of Admit ',data=csv_data)

sns.pairplot(csv_data,diag_kind='kde',plot_kws={ 'alpha':0.2})

from sklearn import  linear_model





features = ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]

# Feature selection

X = csv_data[features].iloc[:420,:-1]

Y = csv_data.iloc[:420,-1]

#Select training set

X_test = csv_data[features].iloc[420:,:-1]

Y_test = csv_data.iloc[420:,-1]

#Select the test set



regr = linear_model.LinearRegression()

#Construct a linear regression model

regr.fit(X,Y)

#Model training

print(regr.predict(X_test)) # Forecast

print(list(Y_test)) #Answer

print(regr.score(X_test,Y_test)) #Accuracy

Leave a Comment Cancel reply