Python/pandas data code analysis

Import pandas as pd

Df = pd.DataFrame({'key1':list('aabba'),

'key2': ['one','two','one','two','one'],

'data1': np.random.randn(5),

'data2': np.random.randn(5)})

Df123456

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Grouped=df['data1'].groupby(df['key1'])

Grouped.mean()12

The above grouping keys are Series, in fact the grouping key can be any array of appropriate length

States=np.array(['Ohio','California', 'California', 'Ohio', 'Ohio'])

Years=np.array([2005,2005,2006,2005,2006])

Df['data1'].groupby([states,years]).mean()123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Df.groupby('key1').mean()1

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

It can be seen that there is no key2 column, because df['key2'] is not numeric data, so it is removed from the result. By default, all numeric columns are aggregated, although sometimes they may be filtered into a subset.

Iterate over the group

For name, group in df.groupby('key1'):

Print (name)

Print (group)123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

It can be seen that name is the value of key1 in groupby, and group is the content to be output.

The same reason:

For (k1,k2),group in df.groupby(['key1','key2']):

Print ('===k1,k2:')

Print (k1,k2)

Print ('===k3:')

Print (group)12345

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Operate the contents of group by, such as converting to a dictionary

Piece=dict(list(df.groupby('key1')))

Piece

{'a': data1 data2 key1 key2

0 -0.233405 -0.756316 a one

1 -0.232103 -0.095894 a two

4 1.056224 0.736629 a one, 'b': data1 data2 key1 key2

2 0.200875 0.598282 b one

3 -1.437782 0.107547 b two}

Piece['a']123456789101112

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Groupby defaults to grouping on axis=0, and can be grouped on any other axis by setting.

Grouped=df.groupby(df.dtypes, axis=1)

Dict(list(grouped))

{dtype('float64'): data1 data2

0 -0.233405 -0.756316

1 -0.232103 -0.095894

2 0.200875 0.598282

3 -1.437782 0.107547

4 1.056224 0.736629, dtype('O'): key1 key2

0 a one

1 a two

2 b one

3 b two

4 a one

123456789101112131415

Select one or a group of columns

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

For big data, in many cases, only some columns need to be aggregated.

Df.groupby(['key1','key2'])[['data2']].mean()

12

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Group by dictionary or series

People=pd.DataFrame(np.random.randn(5,5),

Columns=list('abcde'),

Index=['Joe','Steve', 'Wes', 'Jim', 'Travis'])

People.ix[2:3,['b','c']]=np.nan #Set a few nan

People123456

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Known grouping relationship of columns

Mapping={'a':'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange' }

By_column=people.groupby(mapping, axis=1)

By_column.sum()12345

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

If you do not add axis=1, only abcde will appear.

The same is true for Series

Map_series=pd.Series(mapping)

Map_series

a red

b red

c blue

d blue

e red

f orange

Dtype: object

People.groupby(map_series,axis=1).count()123456789101112

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Group by function

Compared to dic or Series, python functions are more creative when defining grouping relationship mapping. Any function that is treated as a grouping key will be called once on each index, and its return value will be used as the group name. Suppose you group by the length of the person's name, just pass in len

People.groupby(len).sum() abcde 3 -1.308709 -2.353354 1.585584 2.908360 -1.267162 5 -0.688506 -0.187575 -0.048742 1.491272 -0.636704 6 0.110028 -0.932493 1.343791 -1.928363 -0.36474512

Mixing functions with arrays, lists, dictionaries, and Series is not a problem, because anything will eventually be converted to an array.

Key_list=['one','one','one','two','two'] people.groupby([len,key_list]).sum()1

Group by index level

The most convenient part of a hierarchical index is that he can aggregate based on the index level. To achieve this, you can enter the level number or name through the level keyword:

Columns=pd.MulTIIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names=['cty' , 'tenor'])

Hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)

Hier_df123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Hier_df.groupby(level='cty', axis=1).count()1

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Data aggregation

Call a custom aggregate function

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Column-oriented multi-function application

Aggregate operations on Series or DataFrame columns actually use aggregate or call mean, std, and so on. Next we want to use different aggregate functions for different columns, or apply multiple functions at once.

Grouped=TIps.groupby(['sex','smoker'])

Grouped_pct=grouped['TIp_pct'] #TIp_pct

Grouped_pct.agg('mean')# For the statistics described in the 9-1 icon, you can pass the function name directly as a string.

#If you pass in a set of functions, the column name of the obtained df will be named 12345 with the corresponding function.

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

The automatically given column name is low. If a list of (name, function) tuples is passed in, the first element of each tuple will be used as the column name of df.

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

For df, you can define a set of functions for all columns, or apply different functions in different columns.

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

If you want to apply different functions to different columns, the specific way is to agg a dictionary that maps from column names to functions.

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Df can have hierarchical columns only when applying multiple functions to at least one column

Group-level operations and transformations

Aggregation is just one type of grouping operation, which is a special column of data conversion. Transform and apply are more forked.

Transform will apply a function to each group and then put the results in the appropriate location. If each group produces a scalar value, the scalar value will be broadcast.

Transform is also a special function with strict conditions: the passed function can only produce two kinds of results, either to produce a scalar value that can be broadcast (eg np.mean), or to produce an array of results of the same size.

People=pd.DataFrame(np.random.randn(5,5),

Columns=list('abcde'),

Index=['Joe','Steve', 'Wes', 'Jim', 'Travis'])

People

12345

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Key=['one','two','one','two','one']

People.groupby(key).mean()12

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

People.groupby(key).transform(np.mean)1

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

You can see that there are many values ​​as in Table 2.

Def demean(arr):

Return arr-arr.mean()

Demeaned=people.groupby(key).transform(demean)

Demeaned12345

Demeaned.groupby(key).mean()1

The most general groupby method is apply.

Tips=pd.read_csv('C:\\Users\\ecaoyng\\Desktop\\work space\\Python\\py_for_analysis_code\\pydata-book-master\\ch08\ips.csv')

Tips[:5]12

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

New generation of a column

Tips['tip_pct']=tips['tip']/tips['total_bill']

Tips[:6]12

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Select the top 5 tip_pct values ​​based on the grouping

Def top(df,n=5,column='tip_pct'):

Return df.sort_index(by=column)[-n:]

Top(tips,n=6)123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Group the smoker and apply the function

Tips.groupby('smoker').apply(top)1

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Multi-parameter version

Tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')1

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Quantile and barrel analysis

Cut and qcut combined with groupby makes it easy to analyze the bucket or quantile of the dataset.

Frame=pd.DataFrame({'data1':np.random.randn(1000),

'data2': np.random.randn(1000)})

Frame[:5]123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

Factor=pd.cut(frame.data1,4)

Factor[:10]

0 (0.281, 2.00374]

1 (0.281, 2.00374]

2 (-3.172, -1.442)

3 (-1.442, 0.281)

4 (0.281, 2.00374]

5 (0.281, 2.00374]

6 (-1.442, 0.281)

7 (-1.442, 0.281)

8 (-1.442, 0.281)

9 (-1.442, 0.281)

Name: data1, dtype: category

Categories (4, object): [(-3.172, -1.442] " (-1.442, 0.281) " (0.281, 2.00374] " (2.00374, 3.727]] 123456789101112131415

Def get_stats(group):

Return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}

Grouped=frame.data2.groupby(factor)

Grouped.apply(get_stats).unstack()1234

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

These are buckets of equal length. To get equal buckets based on the number of samples, use qcut.

Buckets of equal length: equal intervals

Equal-sized buckets: equal number of data points

Grouping=pd.qcut(frame.data1,10,labels=False)#label=false can get the quantile number

Grouped=frame.data2.groupby(grouping)

Grouped.apply(get_stats).unstack()123

Python/pandas data mining (fourteen) - groupby, aggregation, group-level operations

57 Modular Jack

57 Jack.China RJ11 Jack 1X5P,RJ11 Connector with Panel supplier & manufacturer, offer low price, high quality 4 Ports RJ11 Female Connector,RJ11 Jack 6P6C Right Angle, etc.

The RJ-45 interface can be used to connect the RJ-45 connector. It is suitable for the network constructed by twisted pair. This port is the most common port, which is generally provided by Ethernet hub. The number of hubs we usually talk about is the number of RJ-45 ports. The RJ-45 port of the hub can be directly connected to terminal devices such as computers and network printers, and can also be connected with other hub equipment and routers such as switches and hubs.

RJ11 Jack 1X5P,RJ11 Connector with Panel,4 Ports RJ11 Female Connector,RJ11 Jack 6P6C Right Angle

ShenZhen Antenk Electronics Co,Ltd , https://www.antenksocket.com