Super simple multi-linear regression application

First express my understanding of multiple linear regression:

Equation: share picture

y is the correct result. p0 is a constant term, e is the error, p1, p2, p3, etc. are the regression coefficients we need to obtain through the sklearn training data set, and x1, x2, x3, etc. are the feature vectors in our training set.

The data set I used this time is Kaggle’s enrollment probability prediction data set:

Go to kaggle and search for admission.

https://www. kaggle.com/datasets

It looks like this:

Share a picture

Chance of Admit is the label that will ultimately be predicted by yourself

The idea is very simple, just go to the code ~
(Oh, I am running on jupyter)

One: Data Exploration
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt


csv_data
= pd.read_csv('./data/Admission_Predict.csv ')
# read csv file content
print(csv_data.info())
# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values.
print(csv_data.describe())
# Understand some statistics such as total, average, standard deviation< /span>
print(csv_data.head())
# Understand what the data looks like~
csv_data.drop('Serial No.',axis=1,inplace=True)
# remove the useless ID column

# Normalize the data and simply divide by their maximum...< /span>
csv_data['GRE Score'] = csv_data['GRE Score']/340
csv_data[
'TOEFL Score'] = csv_data['TOEFL Score']/120
csv_data[
'University Rating'] = csv_data['University Rating']/5
csv_data[
'SOP'] = csv_data['SOP< span style="color: #800000;">'
]/5
csv_data[
'LOR '] = csv_data['LOR < span style="color: #800000;">'
]/5
csv_data[
'CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10

#Data exploration

Run results:

Share pictures

Two: Simple visualization

import seaborn as sns



print(csv_data.columns)
sns.regplot(
'GRE Score','Chance of Admit ',data=csv_data)

share picture

View all feature links: 
sns.pairplot(csv_data,diag_kind='kde',plot_kws={'alpha':0.2})

Share a picture

As you can see from the picture, there is indeed a bit of regression~

Three: Model Construction
from sklearn import linear_model



features
= ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]
# Feature selection
X = csv_data[features].iloc[:420,:-1]
Y
= csv_data.iloc[:420,-1]
#Select training set
X_test = csv_data[features].iloc[420:,:-1]
Y_test
= csv_data.iloc[420:,-1]
#Select the test set

regr
= linear_model.LinearRegression()
#Construct a linear regression model
regr.fit(X,Y)
#Model training
print(regr.predict(X_test)) # Forecast
print(list(Y_test)) #Answer
print(regr.score(X_test,Y_test)) #Accuracy

Result:

share picture

Hey, 88% accuracy is reached, useful and happy/

The End~

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt


csv_data
= pd.read_csv('./data/Admission_Predict.csv ')
# read csv file content
print(csv_data.info())
# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values.
print(csv_data.describe())
# Understand some statistics such as total, average, standard deviation< /span>
print(csv_data.head())
# Understand what the data looks like~
csv_data.drop('Serial No.',axis=1,inplace=True)
# remove the useless ID column

# Normalize the data and simply divide by their maximum...< /span>
csv_data['GRE Score'] = csv_data['GRE Score']/340
csv_data[
'TOEFL Score'] = csv_data['TOEFL Score']/120
csv_data[
'University Rating'] = csv_data['University Rating']/5
csv_data[
'SOP'] = csv_data['SOP< span style="color: #800000;">'
]/5
csv_data[
'LOR '] = csv_data['LOR < span style="color: #800000;">'
]/5
csv_data[
'CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10

#Data exploration

import seaborn as sns



print(csv_data.columns)
sns.regplot(
'GRE Score','Chance of Admit ',data=csv_data)

sns.pairplot(csv_data,diag_kind='kde',plot_kws={ 'alpha':0.2})

from sklearn import  linear_model



features
= ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]
# Feature selection
X = csv_data[features].iloc[:420,:-1]
Y
= csv_data.iloc[:420,-1]
#Select training set
X_test = csv_data[features].iloc[420:,:-1]
Y_test
= csv_data.iloc[420:,-1]
#Select the test set

regr
= linear_model.LinearRegression()
#Construct a linear regression model
regr.fit(X,Y)
#Model training
print(regr.predict(X_test)) # Forecast
print(list(Y_test)) #Answer
print(regr.score(X_test,Y_test)) #Accuracy

Leave a Comment

Your email address will not be published.