PANDAS packet and aggregation - aggregation, Group, Pandas

Pandas grouping and aggregation

groupby

Group the data sets, and then perform statistical analysis on each group
SQL can filter data, group and aggregate
Pandas can use groupby to perform more complex grouping operations
The grouping operation process: split->apply->combine

Split: the basis for grouping

2. Application: the calculation rules for each grouping operation

3. Combine: combine each Combine the calculation results of the groups

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1': ['a','b','a','b', 
'a','b','a','a'],
'key2': ['one','one','two','three',
'two','two','one','three'],
'data1' : np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

Run result:

data1 data2 key1 key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

1. GroupBy objects: DataFrameGroupBy, SeriesGroupBy

1. Grouping operations

groupby() performs grouping, the GroupBy object does not perform actual calculations, but contains the intermediate data of the grouping

Group by column name: obj.groupby(‘label’)< /p>

Sample code:

# Dataframe is grouped according to key1
print(type(df_obj.groupby('key1')))

# The data1 column of the dataframe is grouped according to key1
print(type(df_obj['data1'].groupby(df_obj['key1']))))

Run result:

2. Grouping operation

Perform grouping operations/multiple grouping operations on GroupBy objects, such as mean()

Non-numeric data does not perform grouping operations

Sample code:

# Grouping operation
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby (df_obj['key1'])
print(grouped2.mean())

Run result:

data1 data2
key1 
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64

size() returns the number of elements in each group Number

Sample code:

# size
print(grouped1.size())
print(grouped2.size()) Run result:
 key1
a 5
b 3
dtype: int64
key1
a 5
 b 3
dtype: int64
3. Group by custom key

obj.groupby(self_def_key)
Custom The key can be a list or a multi-layer list
obj.groupby(['label1','label2'])->Multi-layer dataframe

Sample code:
# Group by custom key, list
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())< br />
# Group by custom key, multi-layer list
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# Multi-layer grouping by multiple columns
grouped2 = df_obj.groupby(['key1','key2'])
print(grouped2.size())
 
# Multi-layer grouping is carried out in the order of key
grouped3 = df_obj.groupby(['key2','key1'])
print(grouped3.mean())
# Unstack can convert the result of multi-layer index into a single-layer dataframeprint(grouped3.mean().unstack())

Run result:
0 1
1 1
2 1
 3 2
4 1
5 1
7 1
dtype: int64

key1 key2 
a one 2
 three 1 
 two 2
b one 1
 three 1
 two 1
dtype: int64

key1 key2 
a one 2< br /> three 1
 two 2
b one 1
 three 1
 two 1
dtype: int64

 data1 data2
key2 key1 
one a 0.174988 -0.110804
 b -0.214324 0.758372
three a -1.019229 -1.143825
 b 0.522911 0.630814
two a 1.428099 0.107465
 b -0.264616 1.017155

 data1 data2 
key1 abab
key2 
one 0.174988 -0.214324 -0.110804 0 .758372
three -1.019229 0.522911 -1.143825 0.630814
two 1.428099 -0.264616 0.107465 1.017155
Second, the GroupBy object supports iteration operations

each iteration Return a tuple (group_name, group_data)
Specific operations that can be used to group data

1. Single layer grouping
Sample code:< /p> 
# Single layer grouping, according to key1
for group_name, group_data in grouped1:
 print(group_name)
 print(group_data)
Run result: 
a
 data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

b
 data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two
2. Multi-layer grouping
Sample code:
# Multi-layer grouping, according to key1 and key2 
for group_name, group_data in grouped2:
 print(group_name)
 print(group_data)
Run result:
( 'a','one')
 data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one

('a', ' three')
 data1 data2 key1 key2
7 -1.019229 -1.143825 a three

('a','two')
 data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two

('b','one')
 data1 data2 key1 key2
1 -0.214324 0.758372 b one

('b','three')
 data1 data2 key1 key2
3 0.522911 0.630814 b three

('b','two ')
 data1 data2 key1 key2
5 -0.264616 1.017155 b two
Three, GroupBy objects can be converted into lists or dictionaries
Sample code:
# GroupBy object conversion list
print(list(grouped1))

# GroupBy object conversion dict
print(dict(list(grouped1)))
 Run result:
 [('a', data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359- 0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three), 
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]

{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three, 
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}
1. Group by column, group by data type
 < p>Sample code:
# Group by column
print(df_obj.dtypes)

# Group by data type
print(df_obj.groupby(df_obj .dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())
Run result:
 < pre>data1 float64
data2 float64
key1 object
key2 object
dtype: obje ct
float64 2
object 2
dtype: int64
 float64 object
0 0.302191 a one
1 0.544048 b one 
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three  
2. Other grouping methods
Sample code:
df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5) ),
 columns=['a','b','c','d','e'],
 index=['A','B','C', ' D','E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)
Run result:
abcde
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1
3. Group by dictionary
Sample code:
# Group by dictionary
mapping_dict = {'a':'Python', 'b':'Python','c':'Java','d':'C','e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size ())
print(df_obj2.groupby(mapping_dict, axis=1 ).count()) # Number of non-NaN
print(df_obj2.groupby(mapping_dict, axis=1).sum())
Run result:
 C 1
Java 2
Python 2
dtype: int64

 C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2

 C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0< br />C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0
4. Group by function, the parameter passed in by function is row index or column index
Sample code:
# Group by function
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
 columns =['a','b','c','d','e'],
 index=['AA','BBB','CC','D','EE'] )
#df_obj3

def group_key(idx):
 """
 idx is column index or row index
 """
 #return idx
 return len(idx)

print(df_obj3.groupby(group_key).size())

# The above custom function is equivalent to< br />#df_obj3.groupby(len).size()
Run Result:
1 1
2 3
3 1
dtype: int64
5. Group by index level
 Sample code:
# Group by index level
columns = pd.MultiIndex.from_arrays([['Python','Java','Python','Java','Python'], 
 ['A','A','B','C','B']], names=['language','index'])
df_obj4 = pd.DataFrame(np .random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

# Group by language
print(df_obj4.groupby (level='language', axis=1).sum())
# Group by index
print(df_obj4.groupby(level='index', axis=1).sum()) 
Run results:
language Python Java Python Java Python
index AABCB
0 2 7 8 4 3
1 5 2 6 1 2< br />2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8

language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18

index ABC
0 9 11 4
1 7 8 1
2 10 6 5 
3 11 5 3
4 11 11 4

aggregation


The process of generating a scalar from an array, Such as mean(), count(), etc.
Commonly used to calculate the grouped data

Sample code:
dict_obj = {' key1': ['a','b','a','b', 
'a','b','a','a'],
'key2': [ 'one','one','two','three',
'two','two','one','three'],
'data1': np.random.randint (1,10, 8),
'data2': np.random.randint(1,10, 8))
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)< /pre> Run result:
 data1 data2 key1 key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three
1. Built-in aggregate functions

sum(), mean(), max(), min() , count(), size(), describe()

Sample code:
print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1') .mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5 .groupby('key1').describe())
Run result:
data1 data2
key1 
a 27 24
b 12 18 

 data1 data2 key2
key1 
a 8 7 two
b 9 9 two

 data1 data2 key2
key1 
a 3 4 one
b 1 4 one

 data1 data2
key1 
a 5.4 4.8
b 4.0 6.0

key1
a 5
b 3
dtype: int64< br />
 data1 data2 key2
key1 
a 5 5 5
b 3 3 3

 data1 data2
key1 
 a count 5.000000 5.000000
 mean 5.400000 4.800000
 std 2.302173 1.303840
 min 3.000000 4.000000
 25% 3.000000 4.000000
 50% 6.000000 4.000000
 75% 7.000000 5.000000< br /> max 8.000000 7.000000
b count 3.000000 3.000000
 mean 4.000000 6.000000
 std 4.358899 2.645751
 min 1.000000 4.000000
 25% 1.500000 4.500000
 50% 2.000000 5.000000
 75% 5.500000 7.000000
 max 9.000000 9.000000
2. Customizable functions, passed into the agg method

grouped.agg(func )
The parameter of func is the record corresponding to the groupby index

Sample code:
# Custom aggregate function
def peak_range( df):
 """
 Return value range
 """ #print type(df) #The parameter is the record corresponding to the index
 return df.max()-df.min()

print(df_obj5.groupby('key1' ).agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df: df.max()-df.min()))

Run result: 
data1 data2
key1 
a 5 3
b 8 5

 data1 data2
key1 
a 2.528067 1.594711 
b 0.787527 0.386341
In [25]:
3. Apply multiple aggregation functions

Apply multiple functions at the same time for aggregation operations, use Function list

Sample code:
# Apply multiple aggregate functions

# Apply multiple aggregate functions at the same time
print( df_obj.groupby('key1').agg(['mean','std','count', peak_range])) # The default column name is the function name

print(df_obj.groupby(' key1').agg(['mean','std','count', ('range', peak_range)])) # Provide new column names through tuples
Run results: 
data1 data2 
 mean std count peak_range mean std count peak_range
key1 
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

 data1 data2 
 mean std count range mean std count range< br />key1 
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341
4. Apply different aggregate functions to different columns , Use dict
Sample code:
# Different aggregation functions for each column
dict_mapping = {'data1':'mean',
'data2' :'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1':['mean','max'], 
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))
Run result:
data 1 data2
key1 
a 0.437389 -1.150505
b 0.014657 2.406341

 data1 data2
 mean max sum
key1 
a 0.437389 1.508838 -1.150505
b 0.014657 0.522911 2.406341
5. Commonly used built-in aggregate functions

###
Function name description
count: grouping Number of non-NA values
sum: sum of non-NA values
mean: average of non-NA values
median: arithmetic median of non-NA values
std, var: unbiased ( The denominator is n-1) Standard deviation and variance
min, max: the minimum and maximum of non-NA values
prod: the product of non-NA values
first, last: the first and last non-NA values NA value

Data grouping operation
Sample code:
import pandas as pd
import numpy as np
 
dict_obj = {'key1': ['a','b','a','b', 
'a','b','a','a'],< br />'key2': ['one','one','two','three',
'two','two','one','three'],
 ' data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj )
print(df_obj)

# After grouping by key1, calculate the statistics of data1 and data2 and append them to the original table, and add the header prefix
k1_sum = df_obj. groupby('key1').sum().add_prefix('sum_')
print(k1_sum)
Run result:
data1 data2 key1 key2
 0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two 
6 3 5 a one
7 8 3 a three

 sum_data1 sum_data2
key1 
a 26 26
b 17 17
  The shape of the original data will be changed after the aggregation operation.
 How to keep the shape of the original data?
 
 1. merge
 < blockquote> External connection using merge is more complicated
  Sample code:
 # Method 1, use merge
k1_sum_merge = pd.merge( df_obj, k1_sum, left_on='key1', right_index=True)
print(k1_sum_merge)
Run result:
data1 data2 key1 key2 sum_data1 sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
 1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17
2. transform

transform The calculation result of the data is consistent with the shape of the original data,
For example: grouped.transform(np.sum)

Sample code:
# Method 2, use transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
print (df_obj)
Run results:
data1 data2 key1 key2 sum_data1 sum_data2 sum_key2
0 5 1 a one 26 26 onetwotwoonethree
1 7 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 one twotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 a one 26 26 onetwotwoonethree< br />7 8 3 a three 26 26 onetwotwoonethree

Custom functions can also be passed in,

Sample code:
 # Custom function input transform
def diff_mean(s):
 """
 Return the difference between the data and the mean value
 """
 return s-s .mean()

print(df_obj.groupby('key1').transform(diff_mean))
Run result:
data1 data2 sum_data1 sum_data2< br />0 -0.200000 -4.200000 0 0
1 1.333333 2.333333 0 0
2 -4.200000 3.800000 0 0
3 -3.666667 0.333333 0 0
4 3.800000 2.800000 0 0
5 2.333333 -2.666667 0 0
6 -2.200000 -0.200000 0 0
7 2.800000 -2.200000 0 0

groupby.apply(func)
The func function can also be called on each group separately, and the final result is passed through pd .concat assembled together (data merging)

Sample code:
import pandas as pd
import numpy as np

 dataset_path ='./starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex','Age','HoursPerWeek', 
'TotalHours','APM'])< br />
def top_n(df, n=3, column='APM'):
 """
 Return the top n data of each group by column
 "" "
 return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))
Run results:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex 
1 2214 1 20.0 12.0 730.0 172.9530
 2246 1 27.0 8.0 250.0 141.6282
 1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
 3229 2 16.0 24.0 110.0 156.7380
 1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
 484 3 19.0 42.0 450.0 220.0692
 2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
 1759 4 16.0 6.0 75.0 229.9122
 2637 4 23.0 24.0 650.0 227.2272< br />5 3277 5 18.0 16.0 950.0 372.6426
 93 5 17.0 36.0 720.0 335.4990
 202 5 37.0 14.0 800.0 327.7218
 6 734 6 16.0 28.0 730.0 389.8314
 2746 6 16.0 28.0 4000.0 350.4114
 1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
 104 7 21.0 24.0 1000.0 286.4538
 1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
 3373 8 NaN NaN NaN 364.8504
 3372 8 NaN NaN NaN 355.3518
1. Generation level Index: The outer index is the group name, and the inner index is the row index of df_obj
Sample code:
# The parameters received by the apply function will be passed into the custom function
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))
Run result:
LeagueIndex Age HoursPerWeek TotalHours APM 
LeagueIndex 
1 3146 1 40.0 12.0 150.0 38.5590
 3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
 2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
 2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
 1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
 2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
 2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300. 0 188.5512
 1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
 3341 8 NaN NaN NaN 287.8128
2. Hierarchical indexing is prohibited, group_keys=False< /h2> 
Sample code:
print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))
Run result:
LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

apply can be used to fill in missing data in different groups and fill in the mean of the group.

groupby

1. GroupBy objects: DataFrameGroupBy, SeriesGroupBy

1. Grouping operations

groupby() performs grouping, the GroupBy object does not perform actual calculations, but contains the intermediate data of the grouping

2. Grouping operation

Perform grouping operations/multiple grouping operations on GroupBy objects, such as mean()

size() returns the number of elements in each group Number

3. Group by custom key

obj.groupby(self_def_key)

Second, the GroupBy object supports iteration operations

each iteration Return a tuple (group_name, group_data)

1. Single layer grouping

2. Multi-layer grouping

Three, GroupBy objects can be converted into lists or dictionaries

1. Group by column, group by data type

2. Other grouping methods

3. Group by dictionary

4. Group by function, the parameter passed in by function is row index or column index

5. Group by index level

aggregation

1. Built-in aggregate functions

sum(), mean(), max(), min() , count(), size(), describe()

2. Customizable functions, passed into the agg method

grouped.agg(func )

3. Apply multiple aggregation functions

Apply multiple functions at the same time for aggregation operations, use Function list

4. Apply different aggregate functions to different columns , Use dict

5. Commonly used built-in aggregate functions

Data grouping operation

The shape of the original data will be changed after the aggregation operation.

1. merge

External connection using merge is more complicated

2. transform

transform The calculation result of the data is consistent with the shape of the original data,

Custom functions can also be passed in,

groupby.apply(func)

1. Generation level Index: The outer index is the group name, and the inner index is the row index of df_obj

apply can be used to fill in missing data in different groups and fill in the mean of the group.

Leave a Comment Cancel reply