PANDAS packet and aggregation

Pandas grouping and aggregation

groupby

  • Group the data sets, and then perform statistical analysis on each group
  • SQL can filter data, group and aggregate
  • Pandas can use groupby to perform more complex grouping operations
  • The grouping operation process: split->apply->combine

  • p>

    1. Split: the basis for grouping

    2. Application: the calculation rules for each grouping operation

    3. Combine: combine each Combine the calculation results of the groups

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1': ['a','b','a','b',
'a','b','a','a'],
'key2': ['one','one','two','three',
'two','two','one','three'],
'data1' : np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

Run result:

data1 data2 key1 key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

1. GroupBy objects: DataFrameGroupBy, SeriesGroupBy

1. Grouping operations

groupby() performs grouping, the GroupBy object does not perform actual calculations, but contains the intermediate data of the grouping

Group by column name: obj.groupby(‘label’)< /p>

Sample code:

# Dataframe is grouped according to key1
print(type(df_obj.groupby('key1')))

# The data1 column of the dataframe is grouped according to key1
print(type(df_obj['data1'].groupby(df_obj['key1']))))

Run result:


2. Grouping operation

Perform grouping operations/multiple grouping operations on GroupBy objects, such as mean()

Non-numeric data does not perform grouping operations

Sample code:

# Grouping operation
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby (df_obj['key1'])
print(grouped2.mean())

Run result:

data1 data2
key1
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64

size() returns the number of elements in each group Number

Sample code:

# size
print(grouped1.size())
print(grouped2.size())

Run result:

key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64

3. Group by custom key

obj.groupby(self_def_key)

Custom The key can be a list or a multi-layer list
obj.groupby(['label1','label2'])->Multi-layer dataframe

Sample code:

# Group by custom key, list
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())< br />
# Group by custom key, multi-layer list
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# Multi-layer grouping by multiple columns
grouped2 = df_obj.groupby(['key1','key2'])
print(grouped2.size())

# Multi-layer grouping is carried out in the order of key
grouped3 = df_obj.groupby(['key2','key1'])
print(grouped3.mean())
# Unstack can convert the result of multi-layer index into a single-layer dataframeprint(grouped3.mean().unstack())

Run result:

0 1
1 1
2 1
3 2
4 1
5 1
7 1
dtype: int64

key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64

key1 key2
a one 2< br /> three 1
two 2
b one 1
three 1
two 1
dtype: int64

data1 data2
key2 key1
one a 0.174988 -0.110804
b -0.214324 0.758372
three a -1.019229 -1.143825
b 0.522911 0.630814
two a 1.428099 0.107465
b -0.264616 1.017155

data1 data2
key1 abab
key2
one 0.174988 -0.214324 -0.110804 0 .758372
three -1.019229 0.522911 -1.143825 0.630814
two 1.428099 -0.264616 0.107465 1.017155

Second, the GroupBy object supports iteration operations

each iteration Return a tuple (group_name, group_data)

Specific operations that can be used to group data

1. Single layer grouping

Sample code:< /p>

# Single layer grouping, according to key1
for group_name, group_data in grouped1:
print(group_name)
print(group_data)

Run result:

a
data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

b
data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two

2. Multi-layer grouping

Sample code:

# Multi-layer grouping, according to key1 and key2 
for group_name, group_data in grouped2:
print(group_name)
print(group_data)

Run result:

( 'a','one')
data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one

('a', ' three')
data1 data2 key1 key2
7 -1.019229 -1.143825 a three

('a','two')
data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two

('b','one')
data1 data2 key1 key2
1 -0.214324 0.758372 b one

('b','three')
data1 data2 key1 key2
3 0.522911 0.630814 b three

('b','two ')
data1 data2 key1 key2
5 -0.264616 1.017155 b two

Three, GroupBy objects can be converted into lists or dictionaries

Sample code:

# GroupBy object conversion list
print(list(grouped1))

# GroupBy object conversion dict
print(dict(list(grouped1)))

Run result:

[('a', data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359- 0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three),
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]

{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three,
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}

1. Group by column, group by data type

< p>Sample code:

# Group by column
print(df_obj.dtypes)

# Group by data type
print(df_obj.groupby(df_obj .dtypes, axis=1).size())
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())

Run result:

< pre>data1 float64
data2 float64
key1 object
key2 object
dtype: obje ct

float64 2
object 2
dtype: int64

float64 object
0 0.302191 a one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three

2. Other grouping methods

Sample code:

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5) ),
columns=['a','b','c','d','e'],
index=['A','B','C', ' D','E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)

Run result:

abcde
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1

3. Group by dictionary

Sample code:

# Group by dictionary
mapping_dict = {'a':'Python', 'b':'Python','c':'Java','d':'C','e':'Java'}
print(df_obj2.groupby(mapping_dict, axis=1).size ())
print(df_obj2.groupby(mapping_dict, axis=1 ).count()) # Number of non-NaN
print(df_obj2.groupby(mapping_dict, axis=1).sum())

Run result:

 C 1
Java 2
Python 2
dtype: int64

C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2

C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0< br />C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0

4. Group by function, the parameter passed in by function is row index or column index

Sample code:

# Group by function
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns =['a','b','c','d','e'],
index=['AA','BBB','CC','D','EE'] )
#df_obj3

def group_key(idx):
"""
idx is column index or row index
"""
#return idx
return len(idx)

print(df_obj3.groupby(group_key).size())

# The above custom function is equivalent to< br />#df_obj3.groupby(len).size()

Run Result:

1 1
2 3
3 1
dtype: int64

5. Group by index level

Sample code:

# Group by index level
columns = pd.MultiIndex.from_arrays([['Python','Java','Python','Java','Python'],
['A','A','B','C','B']], names=['language','index'])
df_obj4 = pd.DataFrame(np .random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

# Group by language
print(df_obj4.groupby (level='language', axis=1).sum())
# Group by index
print(df_obj4.groupby(level='index', axis=1).sum())

Run results:

language Python Java Python Java Python
index AABCB
0 2 7 8 4 3
1 5 2 6 1 2< br />2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8

language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18

index ABC
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4

aggregation

  • The process of generating a scalar from an array, Such as mean(), count(), etc.
  • Commonly used to calculate the grouped data

Sample code:

dict_obj = {' key1': ['a','b','a','b', 
'a','b','a','a'],
'key2': [ 'one','one','two','three',
'two','two','one','three'],
'data1': np.random.randint (1,10, 8),
'data2': np.random.randint(1,10, 8))
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)< /pre>

Run result:

data1 data2 key1 key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three

1. Built-in aggregate functions

sum(), mean(), max(), min() , count(), size(), describe()

Sample code:

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1') .mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5 .groupby('key1').describe())

Run result:

data1 data2
key1
a 27 24
b 12 18

data1 data2 key2
key1
a 8 7 two
b 9 9 two

data1 data2 key2
key1
a 3 4 one
b 1 4 one

data1 data2
key1
a 5.4 4.8
b 4.0 6.0

key1
a 5
b 3
dtype: int64< br />
data1 data2 key2
key1
a 5 5 5
b 3 3 3

data1 data2
key1
a count 5.000000 5.000000
mean 5.400000 4.800000
std 2.302173 1.303840
min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000< br /> max 8.000000 7.000000
b count 3.000000 3.000000
mean 4.000000 6.000000
std 4.358899 2.645751
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
max 9.000000 9.000000

2. Customizable functions, passed into the agg method

grouped.agg(func )

The parameter of func is the record corresponding to the groupby index

Sample code:

# Custom aggregate function
def peak_range( df):
"""
Return value range
""" #print type(df) #The parameter is the record corresponding to the index
return df.max()-df.min()

print(df_obj5.groupby('key1' ).agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df: df.max()-df.min()))

Run result:

data1 data2
key1
a 5 3
b 8 5

data1 data2
key1
a 2.528067 1.594711
b 0.787527 0.386341
In [25]:

3. Apply multiple aggregation functions

Apply multiple functions at the same time for aggregation operations, use Function list

Sample code:

# Apply multiple aggregate functions

# Apply multiple aggregate functions at the same time
print( df_obj.groupby('key1').agg(['mean','std','count', peak_range])) # The default column name is the function name

print(df_obj.groupby(' key1').agg(['mean','std','count', ('range', peak_range)])) # Provide new column names through tuples

Run results:

data1 data2 
mean std count peak_range mean std count peak_range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

data1 data2
mean std count range mean std count range< br />key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

4. Apply different aggregate functions to different columns , Use dict

Sample code:

# Different aggregation functions for each column
dict_mapping = {'data1':'mean',
'data2' :'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1':['mean','max'],
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

Run result:

data 1 data2
key1
a 0.437389 -1.150505
b 0.014657 2.406341

data1 data2
mean max sum
key1
a 0.437389 1.508838 -1.150505
b 0.014657 0.522911 2.406341

5. Commonly used built-in aggregate functions

###
Function name description
count: grouping Number of non-NA values
sum: sum of non-NA values
mean: average of non-NA values
median: arithmetic median of non-NA values
std, var: unbiased ( The denominator is n-1) Standard deviation and variance
min, max: the minimum and maximum of non-NA values
prod: the product of non-NA values
first, last: the first and last non-NA values NA value

Data grouping operation

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1': ['a','b','a','b',
'a','b','a','a'],< br />'key2': ['one','one','two','three',
'two','two','one','three'],
' data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj )
print(df_obj)

# After grouping by key1, calculate the statistics of data1 and data2 and append them to the original table, and add the header prefix
k1_sum = df_obj. groupby('key1').sum().add_prefix('sum_')
print(k1_sum)

Run result:

data1 data2 key1 key2
0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 a one
7 8 3 a three

sum_data1 sum_data2
key1
a 26 26
b 17 17

The shape of the original data will be changed after the aggregation operation.

How to keep the shape of the original data?

1. merge

< blockquote>

External connection using merge is more complicated

Sample code:

# Method 1, use merge
k1_sum_merge = pd.merge( df_obj, k1_sum, left_on='key1', right_index=True)
print(k1_sum_merge)

Run result:

data1 data2 key1 key2 sum_data1 sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17

2. transform

transform The calculation result of the data is consistent with the shape of the original data,

For example: grouped.transform(np.sum)

Sample code:

# Method 2, use transform
k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_')
df_obj[k1_sum_tf.columns] = k1_sum_tf
print (df_obj)

Run results:

data1 data2 key1 key2 sum_data1 sum_data2 sum_key2
0 5 1 a one 26 26 onetwotwoonethree
1 7 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 one twotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 a one 26 26 onetwotwoonethree< br />7 8 3 a three 26 26 onetwotwoonethree

Custom functions can also be passed in,

Sample code:

 # Custom function input transform
def diff_mean(s):
"""
Return the difference between the data and the mean value
"""
return s-s .mean()

print(df_obj.groupby('key1').transform(diff_mean))

Run result:

data1 data2 sum_data1 sum_data2< br />0 -0.200000 -4.200000 0 0
1 1.333333 2.333333 0 0
2 -4.200000 3.800000 0 0
3 -3.666667 0.333333 0 0
4 3.800000 2.800000 0 0
5 2.333333 -2.666667 0 0
6 -2.200000 -0.200000 0 0
7 2.800000 -2.200000 0 0

groupby.apply(func)

The func function can also be called on each group separately, and the final result is passed through pd .concat assembled together (data merging)

Sample code:

import pandas as pd
import numpy as np

dataset_path ='./starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex','Age','HoursPerWeek',
'TotalHours','APM'])< br />
def top_n(df, n=3, column='APM'):
"""
Return the top n data of each group by column
"" "
return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))

Run results:

LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272< br />5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

1. Generation level Index: The outer index is the group name, and the inner index is the row index of df_obj

Sample code:

# The parameters received by the apply function will be passed into the custom function
print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age'))

Run result:

LeagueIndex Age HoursPerWeek TotalHours APM 
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 ​​3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300. 0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
3341 8 NaN NaN NaN 287.8128

2. Hierarchical indexing is prohibited, group_keys=False< /h2>

Sample code:

print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))

Run result:

LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

apply can be used to fill in missing data in different groups and fill in the mean of the group.

Leave a Comment

Your email address will not be published.