python-pandas-Series对象

pandas的核心为两大数据结构，数据分析相关的所有事务都是围绕着这两种结构进行的:

series
DataFrame

Series这类数据结构用于存储一个序列这样的一维数据

DataFrame作为复杂的数据结构，则用于存储多维数据

两者的奇特之处是将Index(索引）对象和标签整合到自己的结构中。

1.Series对象

pandas库的Series对象用来表示一维数据结构，跟数组类似，但多了一些额外的功能。由两个相互关联的数组组成，其中主数组用来存放数据(NumPy任意类型数据)。主数组的每个元素都有一个与之相关联的标签，这些标签存储在另外一个叫作Index的数组中。

声明Series对象
调用series()构造函数，把要存放在Series对象中的数据以数组形式传入，就能创建一个Series对象。
1
2
3
4
5
6
7
8
s = pd.Series([12,-4,7,9])
#0 12
#1 -4
#2 7
#3 9
#dtype：int64

#左侧为标签index，右侧是标签对应的元素
声明Series时，若不指定标签，pandas默认使用从0开始依次递增的数值作为标签。（最好使用有意义的标签，用以区分和识别每个元素）
1
2
3
4
5
6
7
8
s = pd.Series([12,-4,7,9],index=['a','b','c','d'])
#a 12
#b -4
#c 7
#d 9
#dtype：int64

#左侧为标签index，右侧是标签对应的元素
如果想分别查看组成Series对象的两个数组,可像下面这样调用它的两个属性: index(索引)和values（元素)。
1
2
s.values
s.index
选择内部元素
获取Scrics对象内部的元素，把它作为普通的NumPy数组，指定键即可，或者指定位于索引位置处的标签。
1
2
3
4
s[2]
# 7
s['b']
# -4
选取多项的方式
1
2
3
4
5
6
7
8
9
10
s[0:2]
# A 12
# b -4
#dtype:int64

# 标签选择
s[['b','c']]
# b -4
# c 7
#dtype:int64

为元素赋值

索引或标签选取元素后进行赋值

s[1]=0
s
#a  12
#b  1
#c  7
#d  9
#dtype: int64
s['b'] = 1
s
#a  12
#b  1
#c  7
#d  9
#dtype: int64

用numpy数组或其他Series对象定义新Series对象

用NumPy数组或现有的Series对象定义新的Series对象。

arr = np.array([1,2,3,4])
s3 = pd.Series(arr)
s3
# 0 1
# 1 2
# 2 3
# 3 4
# dtype:int32

S4 = pd.Series(s)
s4
#a  12
#b  4
#c  7
#d  9
# dtype:int64

这样做时不要忘记新Series对象中的元素不是原NumPy数组或Series对象元素的副本，而是对它们的引用。也就是说，这些对象是动态插入到新Series对象中。如改变原有对象元素的值，新Series对象中这些元素也会发生改变。

arr = np.array([1,2,3,4])
s3 = pd.Series(arr)
print(s3)
arr[1] = 3
print(arr)
print(s3)
s3[2] = 4
print(arr)
print(s3)
# 0    1
# 1    2
# 2    3
# 3    4
# dtype: int32
# [1 3 3 4]
# 0    1
# 1    3
# 2    3
# 3    4
# dtype: int32
# [1 3 4 4]
# 0    1
# 1    3
# 2    4
# 3    4
# dtype: int32

筛选元素
pandas库的开发是以NumPy库为基础的，因此就数据结构而言，NumPy数组的多种操作方法得以扩展到Series对象中。
Series对象运算和数学函数
Series对象运算和numpy数组一样

Numpy库的数学函数的使用需要指定出处np。

Series对象的组成元素

要弄清楚Series对象包含多少个不同的元素，可使用unique()函数。其返回结果为一个数组，包含Series去重后的元素,但顺序看上去是任意的。

isin()函数用来判断所属关系,也就是判断给定的一列元素是否包含在数据结构之中。isin()函数返回布尔值，可用于筛选Series或DataFrame列中的数据。

serd = pd.Series([1,0,2,1,2,3],index=['white','white','blue','green','green','yellow'])
serd.unique()
# array([1,0,2,3],dtype=int64)
serd.value_counts()
# 2  2
# 1  2
# 3  1
# 0  1
# dtype: int64

serd.isin([0,3])
# white  False
# white  True
# blue   False
# green  False
# green  False
# yellow True
# dtype: bool
serd[serd.isin([0,3])]
#white  0
#yellow 3
#dtype

NaN
NaN Not a Number,非数值

数据结构中若字段为空或者不符合数字的定义时,用这个特定的值来表示。

一般来讲，NaN值表示数据有问题，必须对其进行处理，尤其是在数据分析时。从某些数据源抽取数据时遇到了问题，甚至是数据源缺失数据，往往就会产生这类数据。进一步米H，T异负数的对数,执行计算或函数时抛出异常等特定情况,也可能产生这类数据。

pandas 中可以定义这种数据类型
1
2
3
4
5
s2 = pd.Series([5,-3,np.NaN,14])
#0 5
#1 -3
#2 NaN
#3 14
isnull()和notnull()函数用来识别没有对应元素的索引时非常好用。
1
2
3
4
5
6
7
8
9
10
11
12
s2.isnull()
#0 False
#1 False
#2 True
#3 False
#dtype:bool
s2.notnull()
#0 True
#1 True
#2 False
#3 True
#dtype:bool
Series用作字典
把Series对象当作字典( dict，dictionary)对象来用。定义Series对象时，我们可以用事先定义好的字典来创建Series对象。
1
2
3
4
5
6
7
mydict = {'red':2000,'blue':1000,'yellow': 500,'orange':1000}
myseries = pd.Series(mydict)
# blue 1000
#orange 1000
#red 2000
#yellow 500
#dtype: int64
上述例子中,索引数组用字典的键来填充，每个索引所对应的元素为用作索引的键在字典中对应的值。你还可以单独指定索引，pandas会控制字典的键和数组索引标签之间的相关性。如遇缺失值处，pandas就会为其添加NaN。
Series对象之间的运算
Series对象之间能进行数学运算，甚至标签也可以参与运算。Series这种数据结构在运算时有一大优点，它能够通过识别标签对齐不一致的数据。
1
2
3
4
5
6
7
8
9
10
mydict2={'red':400,'yellow':1000,'black':700}
myseries2=pd.Series(mydict2)
myseries + myseries2
# black NaN
# blue NaN
# orange NaN
# green NaN
# red 2400
# yellow 1500
# dtype：float
上述运算得到一个新Series对象，其中只对标签相同的元素求和。其他只属于任何一个Series对象的标签也被添加到新对象中,只不过它们的值均为NaN,

python-pandas-Series对象

1.Series对象

声明Series对象

选择内部元素

为元素赋值

用numpy数组或其他Series对象定义新Series对象

筛选元素

Series对象运算和数学函数

Series对象的组成元素

NaN

Series用作字典

Series对象之间的运算