First express my understanding of multiple linear regression:
Equation:
y is the correct result. p0 is a constant term, e is the error, p1, p2, p3, etc. are the regression coefficients we need to obtain through the sklearn training data set, and x1, x2, x3, etc. are the feature vectors in our training set.
The data set I used this time is Kaggle’s enrollment probability prediction data set:
Go to kaggle and search for admission.
https://www. kaggle.com/datasets
It looks like this:
Chance of Admit is the label that will ultimately be predicted by yourself
The idea is very simple, just go to the code ~
(Oh, I am running on jupyter)
One: Data Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
csv_data = pd.read_csv('./data/Admission_Predict.csv ')
# read csv file content
print(csv_data.info())
# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values.
print(csv_data.describe())
# Understand some statistics such as total, average, standard deviation< /span>
print(csv_data.head())
# Understand what the data looks like~
csv_data.drop('Serial No.',axis=1,inplace=True)
# remove the useless ID column
# Normalize the data and simply divide by their maximum...< /span>
csv_data['GRE Score'] = csv_data['GRE Score']/340
csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120
csv_data['University Rating'] = csv_data['University Rating']/5
csv_data['SOP'] = csv_data['SOP< span style="color: #800000;">']/5
csv_data['LOR '] = csv_data['LOR < span style="color: #800000;">']/5
csv_data['CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10
#Data exploration
Run results:
Two: Simple visualization
import seaborn as sns
print(csv_data.columns)
sns.regplot('GRE Score','Chance of Admit ',data=csv_data)
View all feature links:
sns.pairplot(csv_data,diag_kind='kde',plot_kws={'alpha':0.2})
As you can see from the picture, there is indeed a bit of regression~
Three: Model Construction
from sklearn import linear_model
features = ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]
# Feature selection
X = csv_data[features].iloc[:420,:-1]
Y = csv_data.iloc[:420,-1]
#Select training set
X_test = csv_data[features].iloc[420:,:-1]
Y_test = csv_data.iloc[420:,-1]
#Select the test set
regr = linear_model.LinearRegression()
#Construct a linear regression model
regr.fit(X,Y)
#Model training
print(regr.predict(X_test)) # Forecast
print(list(Y_test)) #Answer
print(regr.score(X_test,Y_test)) #Accuracy
Result:
Hey, 88% accuracy is reached, useful and happy/
The End~
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
csv_data = pd.read_csv('./data/Admission_Predict.csv ')
# read csv file content
print(csv_data.info())
# Understand the basic situation of the data table: number of rows, number of columns, The data type and data integrity of each column. You can see that there are 500 rows in each column, so it can be said that there are no missing values.
print(csv_data.describe())
# Understand some statistics such as total, average, standard deviation< /span>
print(csv_data.head())
# Understand what the data looks like~
csv_data.drop('Serial No.',axis=1,inplace=True)
# remove the useless ID column
# Normalize the data and simply divide by their maximum...< /span>
csv_data['GRE Score'] = csv_data['GRE Score']/340
csv_data['TOEFL Score'] = csv_data['TOEFL Score']/120
csv_data['University Rating'] = csv_data['University Rating']/5
csv_data['SOP'] = csv_data['SOP< span style="color: #800000;">']/5
csv_data['LOR '] = csv_data['LOR < span style="color: #800000;">']/5
csv_data['CGPA'] = csv_data['CGPA< span style="color: #800000;">']/10
#Data exploration
import seaborn as sns
print(csv_data.columns)
sns.regplot('GRE Score','Chance of Admit ',data=csv_data)
sns.pairplot(csv_data,diag_kind='kde',plot_kws={ 'alpha':0.2})
from sklearn import linear_model
features = ['GRE Score', 'TOEFL Score< span style="color: #800000;">', 'University Rating span>', 'SOP ', 'LOR ', 'CGPA', 'Research',]
# Feature selection
X = csv_data[features].iloc[:420,:-1]
Y = csv_data.iloc[:420,-1]
#Select training set
X_test = csv_data[features].iloc[420:,:-1]
Y_test = csv_data.iloc[420:,-1]
#Select the test set
regr = linear_model.LinearRegression()
#Construct a linear regression model
regr.fit(X,Y)
#Model training
print(regr.predict(X_test)) # Forecast
print(list(Y_test)) #Answer
print(regr.score(X_test,Y_test)) #Accuracy