引言

在Python数据分析领域，Pandas库是当之无愧的瑞士军刀。其核心数据结构Series和DataFrame承载了90%以上的数据处理场景。本文将深入剖析这两个数据结构的特性和用法，帮助你真正掌握它们的核心逻辑。

一、Series：带标签的一维数组

1.1 什么是Series？

Series是Pandas中最基础的带标签的一维数组，可以看作Excel中的单列数据，但功能更加强大。

核心特征：

包含一组数据（任何NumPy数据类型）
包含一组索引（默认从0开始的自增索引）
数据与索引自动对齐

#从列表中创建
prices=pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9], index=['A', 'B', 'C', 'D', 'E', '1','1','1','1'],name='PRICES')
#从字典创建
volumes=pd.Series({'KYU':2.1E6,'CGN':1.8E6,'ASDF':3.2E6},name='VULUMES')
#从Numpy数组中创建
values=pd.Series(np.random.randn(5),index=pd.date_range('20250101',periods=5))

1.2 关键操作

print(prices['A'])
#输出===>
# 1
print(prices['1'])
#输出===>
# 1    6
# 1    7
# 1    8
# 1    9
# 向量化运算
print(prices * 2)   # 所有元素乘以2
#输出===>
# A     2
# B     4
# C     6
# D     8
# E    10
# 1    12
# 1    14
# 1    16
# 1    18
# 条件过滤
print(prices[prices > 7])  # 筛选值大于7的元素
#输出===>
# 1    8
# 1    9
# 方法调用
print(prices.value_counts())  # 值计数统计
#输出===>
# 1    1
# 2    1
# 3    1
# 4    1
# 5    1
# 6    1
# 7    1
# 8    1
# 9    1
print(s.value_counts())  # 值计数统计

1.3 Series的属性和方法速查表

属性/方法	作用	示例
`s.values`	获取数据（返回NumPy数组）	`prices.values` → `array([[1 2 3 4 5 6 7 8 9]])`
`s.index`	获取索引	`prices.index` → `Index(['A', 'B', 'C', 'D', 'E', '1', '1', '1', '1'])`
`s.dtype`	查看数据类型	`prices.dtype` → `int64`
`s.name`	Series的名称	`prices.name` → `'PRICES'`
`s.head(n)`	查看前n行（默认5行）	`prices.head(2)` → 输出前2行数据
`s.unique()`	返回唯一值数组	`prices.unique()` → `array([[1 2 3 4 5 6 7 8 9]])`

代码示例说明

二、DataFrame：二维数据表的核心

2.1 DataFrame的本质

DataFrame是一个二维标签化数据结构，可以理解为多个Series的集合，类似于：

Excel电子表格
SQL数据库表
R语言中的data.frame

# 从字典创建
data = {
    'Product': ['A', 'B', 'C'],
    'Price': [99, 200, 150],
    'Sales': [1200, 800, 1500]
}
df = pd.DataFrame(data, index=['Q1', 'Q2', 'Q3'])

2.2 索引操作

# loc vs iloc 演示
'''
索引示例：
   Product  Price  Sales
Q1       A     99   1200
Q2       B    200    800
Q3       C    150   1500
'''

# 标签索引 (loc)
print(df.loc['Q2', 'Price'])          # 200
print(df.loc['Q2':, 'Price':'Sales'])  # 切片包含结束点
# 输出===>
#     Price  Sales
# Q2    200    800
# Q3    150   1500
# 位置索引 (iloc)
print(df.iloc[0, 1])                   # 99
print(df.iloc[1:3, [0,2]])              # 位置切片不包含结束点
# 输出===>
#    Product  Sales
# Q2       B    800
# Q3       C   1500
# 条件过滤
tech_stocks = df[df['Sales'] > 1000]
# 输出===>
#    Product  Price  Sales
# Q1       A     99   1200
# Q3       C    150   1500

2.3 数据查看

df.head(2)       # 查看前两行
df.describe()    # 统计摘要
df.info()        # 数据类型概览

2.4 列操作

# 新增列
df['Revenue'] = df['Price'] * df['Sales']

# 删除列
df = df.drop(columns=['Sales'])

# 类型转换
df['Price'] = df['Price'].astype('float32')

三、实战

3.1 案例

# 创建多层索引Series
# 数据集生成
dates = pd.date_range('2025-03-25', periods=5)
stock_data = pd.DataFrame({
    'Open': np.random.uniform(100, 200, 5),
    'Close': np.random.uniform(105, 210, 5),
    'Volume': np.random.randint(1e6, 5e6, 5)
}, index=dates)

# 任务要求：
# 1. 添加20日均线列（假设数据足够，这里用滚动平均）
# 2. 找出收盘价高于开盘价的天数
# 3. 筛选出交易量前3大的交易日

3.2 ans

# 任务1：添加20日均线列（使用滚动窗口）
stock_data['Close_mean_20'] = stock_data['Close'].rolling(window=20, min_periods=1).mean()

# 任务2：找出收盘价高于开盘价的天数
close_higher_mask = stock_data['Close'] > stock_data['Open']
close_higher_days = close_higher_mask.sum()  # 统计True的数量

# 任务3：筛选交易量前3大的交易日
top3_volume = stock_data.nlargest(3, 'Volume')

# 打印结果
print("任务1：添加20日均线后的数据表")
print(stock_data)
print("\n任务2：收盘价高于开盘价的天数：", close_higher_days)
print("\n任务3：交易量前3的交易日")
print(top3_volume)

四、Series vs DataFrame：关键差异

特性	Series	DataFrame
维度	一维	二维
索引	单索引	行列双索引
数据存储	单列数据	多列异构数据
创建方式	列表/字典/标量	字典/二维数组/CSV
数据访问	`s[key]`	`df[col][row]`
适用场景	单变量分析	多变量关系分析