实现大部分数据分析的八种Pandas函数 - Nathan

#Python教程 #数据分析指南

2022-07-08 17K banq

大多数情况下，执行数据分析任务是遵循蛋糕食谱
做蛋糕，你需要一些工具，不是吗？像搅拌机、勺子、烤箱……
有了 pandas，您就拥有了这些用于任何数据分析任务的基本工具，让我们来了解一下您的“厨房”中不能缺少什么？

读取数据文件
好的！第一件事。这一步就像：让我看看我冰箱里有什么！

在这里，我将读取一个 Excel 文件，该文件存储在我实际文件之外的名为“dados”的文件夹中。

import pandas as pd
import numpy as np 
#Reading an Excel File
df = pd.read_excel('../dados/Olist-full.xlsx')

你必须注意到，我使用的是一个静态文件。对我来说，如果你正在进行初步的探索性分析，这是最好的开始方式。

在这之后，你可以将你的代码直接连接到数据湖。

1、改变列名
首先，让我们看看我们的列名是什么，使用.columns方法。

#Showing columns names
df.columns

Index(['Unnamed: 0', 'order_id', 'customer_id', 'order_status',
       'order_purchase_timestamp', 'order_approved_at',
       'order_delivered_carrier_date', 'order_delivered_customer_date',
       'order_estimated_delivery_date', 'order_item_id', 'product_id',
       'seller_id', 'shipping_limit_date', 'price', 'freight_value',
       'payment_sequential', 'payment_type', 'payment_installments',
       'payment_value', 'review_id', 'review_score', 'review_comment_title',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp', 'product_name_lenght',
       'product_description_lenght', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'customer_unique_id',
       'customer_zip_code_prefix', 'customer_city', 'customer_state',
       'seller_zip_code_prefix', 'seller_city', 'seller_state'],
      dtype='object')

那么，如果我想改变一列或更多列的名称怎么办？我们可以使用rename()函数，并在其上传递一个字典。

#Changing columns names
df = df.rename(columns = {
    'order_id': 'id_order_number',
    'customer_id': 'customer_number'
})

如果我们再次打印列名，我们将得到：

Index(['Unnamed: 0', 'id_order_number', 'customer_number', 'order_status',
       'order_purchase_timestamp', 'order_approved_at',
       'order_delivered_carrier_date', 'order_delivered_customer_date',
       'order_estimated_delivery_date', 'order_item_id', 'product_id',
       'seller_id', 'shipping_limit_date', 'price', 'freight_value',
       'payment_sequential', 'payment_type', 'payment_installments',
       'payment_value', 'review_id', 'review_score', 'review_comment_title',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp', 'product_name_lenght',
       'product_description_lenght', 'product_photos_qty', 'product_weight_g',
       'product_length_cm', 'product_height_cm', 'product_width_cm',
       'product_category_name_english', 'customer_unique_id',
       'customer_zip_code_prefix', 'customer_city', 'customer_state',
       'seller_zip_code_prefix', 'seller_city', 'seller_state'],
      dtype='object')

改变列名是很重要的，因为我们的代码需要让每一个使用它的人都能读懂和理解。记住，帮助你未来的同事，他可能要在你今天开发的代码之上工作。选择有意义的名字，不仅是对列，对变量也是如此。

2、检查和描述
现在我们将检查我们的数据框架。我总是从info()函数开始，以了解我们正在处理的列中的变量类型是什么。

#Basic information about our dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119143 entries, 0 to 119142
Data columns (total 40 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Unnamed: 0                     119143 non-null  int64  
 1   id_order_number                119143 non-null  object 
 2   customer_number                119143 non-null  object 
 3   order_status                   119143 non-null  object 
 4   order_purchase_timestamp       119143 non-null  object 
 5   order_approved_at              118966 non-null  object 
 6   order_delivered_carrier_date   117057 non-null  object 
 7   order_delivered_customer_date  115722 non-null  object 
 8   order_estimated_delivery_date  119143 non-null  object 
 9   order_item_id                  118310 non-null  float64
 10  product_id                     118310 non-null  object 
 11  seller_id                      118310 non-null  object 
 12  shipping_limit_date            118310 non-null  object 
 13  price                          118310 non-null  float64
 14  freight_value                  118310 non-null  float64
 15  payment_sequential             119140 non-null  float64
 16  payment_type                   119140 non-null  object 
 17  payment_installments           119140 non-null  float64
 18  payment_value                  119140 non-null  float64
 19  review_id                      118146 non-null  object 
 20  review_score                   118146 non-null  float64
 21  review_comment_title           119143 non-null  object 
 22  review_comment_message         119143 non-null  object 
 23  review_creation_date           118146 non-null  object 
 24  review_answer_timestamp        118146 non-null  object 
 25  product_name_lenght            116601 non-null  float64
 26  product_description_lenght     116601 non-null  float64
 27  product_photos_qty             116601 non-null  float64
 28  product_weight_g               118290 non-null  float64
 29  product_length_cm              118290 non-null  float64
 30  product_height_cm              118290 non-null  float64
 31  product_width_cm               118290 non-null  float64
 32  product_category_name_english  116576 non-null  object 
 33  customer_unique_id             119143 non-null  object 
 34  customer_zip_code_prefix       119143 non-null  int64  
 35  customer_city                  119143 non-null  object 
 36  customer_state                 119143 non-null  object 
 37  seller_zip_code_prefix         118310 non-null  float64
 38  seller_city                    118310 non-null  object 
 39  seller_state                   118310 non-null  object 
dtypes: float64(15), int64(2), object(23)
memory usage: 36.4+ MB

现在你可以注意到dtypes值，看看它们是否符合你的期望。在这里，我可以看到，我期望订单_购买_时间戳的值是一个日期时间，但我得到的是一个对象值。你会看到我们如何进一步改变它。

这里非常重要的一点是，你要看一下你的变量是如何表现的。所以我们有两个强大的pandas函数，describe()和value_counts()。第一个是关于变量的数量，第二个是关于分类变量的。

#Describing the 'price' variable
df['price'].describe()

这里我描述的是价格变量，它给我的返回是：

count    118310.000000
mean        120.646603
std         184.109691
min           0.850000
25%          39.900000
50%          74.900000
75%         134.900000
max        6735.000000
Name: price, dtype: float64

有了它，我可以了解我的变量在HOLE数据集中的表现，以及像平均值、标准差、最小值和最大值这样的重要措施。

使用value_counts()函数，我可以探索分类变量的行为，比如payment_type。在这里，我可以看到原始结果和归一化的结果，只是改变了一个参数。

#Describing the 'price' variable
df['payment_type'].value_counts()
#Describing the 'payment_type' variable
df['payment_type'].value_counts(normalize = True)

请看结果，首先是原始的那个：
credit_card 87776
boleto 23190
voucher 6465
debit_card 1706
not_defined 3
Name: payment_type, dtype: int64

而现在是正常化normalized：

credit_card    0.736747
boleto         0.194645
voucher        0.054264
debit_card     0.014319
not_defined    0.000025
Name: payment_type, dtype: float64

在这里，我可以看到（再次）对于我的hole数据集，我有73.67%的付款，是通过信用卡进行的。

3、丢弃值
降低数值就像摆脱冰箱里的臭鸡蛋。对于这一点，我们有一些可能性。

让我们使用isna()函数检查一下我们的数据框架中是否有NaN值:

#Checking if we have any NaN value
df.isna().any()

它返回给我们的信息如下:

Unnamed: 0                       False
id_order_number                  False
customer_number                  False
order_status                     False
order_purchase_timestamp         False
order_approved_at                 True
order_delivered_carrier_date      True
order_delivered_customer_date     True
order_estimated_delivery_date    False
order_item_id                     True
product_id                        True
seller_id                         True
shipping_limit_date               True
price                             True
freight_value                     True
payment_sequential                True
payment_type                      True
payment_installments              True
payment_value                     True
review_id                         True
review_score                      True
review_comment_title             False
review_comment_message           False
review_creation_date              True
review_answer_timestamp           True
product_name_lenght               True
product_description_lenght        True
product_photos_qty                True
product_weight_g                  True
product_length_cm                 True
product_height_cm                 True
product_width_cm                  True
product_category_name_english     True
customer_unique_id               False
customer_zip_code_prefix         False
customer_city                    False
customer_state                   False
seller_zip_code_prefix            True
seller_city                       True
seller_state                      True
dtype: bool

现在我们可以做一个决定，删除有NaN值的列中的行，例如在product_id上：

#Droping NaN values of the product_id column
df = df.dropna(subset = ['product_id'])

4、改变列的类型
正如我们在上面看到的，order_purchase_timestamp是一个对象列，但它应该是一个日期时间列，那么，我们如何做到这一点？

#Changing the dtype of the order_purchase_timestamp
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])

还有其他函数可以使用，比如to_numeric()!

5、处理日期和时间
知道如何操作日期和时间是很重要的

我怎样才能只得到一个特定日期列的年份或月份？对我们来说，理解数据框架中的圆圈是至关重要的：

#Create a new column with the month of the order
df['order_month'] = df['order_purchase_timestamp'].dt.month

6、分组
分组对于任何数据分析都是必不可少的。我在这里发表了一篇文章，向你展示了使用Pandas对数据框架进行分组的最佳方法。

让我们来看看一个非常好的例子我想在一个特定的时间段内对不同的列进行一些测量，每个3个月。

#Grouping for each 3 months by customer_state
buys_3m = df.groupby([pd.Grouper(key = 'order_purchase_timestamp', freq = '3M'), 'customer_state']).agg({
    'id_order_number': 'nunique',
    'price': ['sum', 'mean', 'max'],
    'freight_value': ['mean', 'median'],
}).reset_index()
buys_3m.columns = ['_'.join(col) for col in buys_3m.columns]

7. Filter筛选
筛选数据使我们能够为对我们真正重要的数据带来个性化。假设我只想选择特定类型的销售数据，这些数据来自于一个特定的州，并且高于一个值，我们该如何做呢？

#Filtering for SP state and price up or equal 115
sp_above_mean = df[(df['price'] >= 115) & (df['seller_state'] == 'SP')]

8、用apply和mapping进行个性化处理
我们可以用自定义函数进一步定制我们的分析，也可以使用apply()函数和map()。看看吧。

#Creating a new column with apply
df['price_status'] = df['price'].apply(lambda x: 'UP' if x >= df['price'].mean() else 'DOWN')
#Creating a new column using map
df['seller_by_payment'] = df['payment_type'].map(credit_cards)

下面是完整代码：

import pandas as pd
import numpy as np

#Reading an Excel File
df = pd.read_excel('../dados/Olist-full.xlsx')

#Showing columns names
df.columns

#Changing columns names
df = df.rename(columns = {
    'order_id': 'id_order_number',
    'customer_id': 'customer_number'
})

#Checking the new names
df.columns

#Basic information about our dataframe
df.info()

#Describing the 'price' variable
df['price'].describe()

#Describing the 'payment_type' variable
df['payment_type'].value_counts()

#Normalizing the results
df['payment_type'].value_counts(normalize = True)

#Checking if we have any NaN value
df.isna().any()

#Droping NaN values of the product_id column
df = df.dropna(subset = ['product_id'])

#Changing the dtype of the order_purchase_timestamp
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])

#Create a new column with the month of the order
df['order_month'] = df['order_purchase_timestamp'].dt.month

#Grouping for each 3 months by customer_state
buys_3m = df.groupby([pd.Grouper(key = 'order_purchase_timestamp', freq = '3M'), 'customer_state']).agg({
    'id_order_number': 'nunique',
    'price': ['sum', 'mean', 'max'],
    'freight_value': ['mean', 'median'],
}).reset_index()

buys_3m.columns = ['_'.join(col) for col in buys_3m.columns]

#Filtering for SP state and price up or equal 115
sp_above_mean = df[(df['price'] >= 115) & (df['seller_state'] == 'SP')]

#Filtering by the quantile - we can remove outliers with this
q1 = df['price'].quantile(0.01)
q2 = df['price'].quantile(0.99)

df_price_outliers = df[(df['price'] >= q1) & (df['price'] <= q2)]

#Creating a new column with apply
df['price_status'] = df['price'].apply(lambda x: 'UP' if x >= df['price'].mean() else 'DOWN')

#Creating a new column using map
df['seller_by_payment'] = df['payment_type'].map(credit_cards)

实现大部分数据分析的八种Pandas函数 - Nathan

什么是Context上下文？

抽象两种方法：上下文与类型

Content与Context一字之差暗藏逆天极道

语境崩塌：你的注意力正被劫持

Context逻辑之道