```python import numpy as np import pandas as pd import torch from sklearn.preprocessing import LabelEncoder from torch.utils.data import Dataset, DataLoader import torch.nn.functional as F import torch.nn as nn from collections import Counter torch.__version__ ``` '1.4.0' # 5.2 Pytorch处理结构化数据 ## 简介 在介绍之前,我们首先要明确下什么是结构化的数据。结构化数据,可以从名称中看出,是高度组织和整齐格式化的数据。它是可以放入表格和电子表格中的数据类型。对我们来说,结构化数据可以理解为就是一张2维的表格,例如一个csv文件,就是结构化数据,在英文一般被称作Tabular Data或者叫 structured data,下面我们来看一下结构化数据的例子。 一下文件来自于fastai的自带数据集: https://github.com/fastai/fastai/blob/master/examples/tabular.ipynb fastai样例在这里 ## 数据预处理 我们拿到的结构化数据,一般都是一个csv文件或者数据库中的一张表格,所以对于结构化的数据,我们直接使用pasdas库处理就可以了 ```python #读入文件 df = pd.read_csv('./data/adult.csv') #salary是这个数据集最后要分类的结果 df['salary'].unique() ``` array(['>=50k', '<50k'], dtype=object) ```python #查看数据类型 df.head() ```
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
```python #pandas的describe可以告诉我们整个数据集的大概结构,是非常有用的 df.describe() ```
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.000000 3.256100e+04 32074.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.079815 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572999 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000
```python #查看一共有多少数据 len(df) ``` 32561 对于模型的训练,只能够处理数字类型的数据,所以这里面我们首先要将数据分成三个类别 - 训练的结果标签:即训练的结果,通过这个结果我们就能够明确的知道我们这次训练的任务是什么,是分类的任务,还是回归的任务。 - 分类数据:这类的数据是离散的,无法通过直接输入到模型中进行训练,所以我们在预处理的时候需要优先对这部分进行处理,这也是数据预处理的主要工作之一 - 数值型数据:这类数据是直接可以输入到模型中的,但是这部分数据有可能还是离散的,所以如果需要也可以对其进行处理,并且处理后会对训练的精度有很大的提升,这里暂且不讨论 ```python #训练结果 result_var = 'salary' #分类型数据 cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race','sex','native-country'] #数值型数据 cont_names = ['age', 'fnlwgt', 'education-num','capital-gain','capital-loss','hours-per-week'] ``` 人工确认完数据类型后,我们可以看一下分类类型数据的数量和分布情况 ```python for col in df.columns: if col in cat_names: ccol=Counter(df[col]) print(col,len(ccol),ccol) print("\r\n") ``` workclass 9 Counter({' Private': 22696, ' Self-emp-not-inc': 2541, ' Local-gov': 2093, ' ?': 1836, ' State-gov': 1298, ' Self-emp-inc': 1116, ' Federal-gov': 960, ' Without-pay': 14, ' Never-worked': 7}) education 16 Counter({' HS-grad': 10501, ' Some-college': 7291, ' Bachelors': 5355, ' Masters': 1723, ' Assoc-voc': 1382, ' 11th': 1175, ' Assoc-acdm': 1067, ' 10th': 933, ' 7th-8th': 646, ' Prof-school': 576, ' 9th': 514, ' 12th': 433, ' Doctorate': 413, ' 5th-6th': 333, ' 1st-4th': 168, ' Preschool': 51}) marital-status 7 Counter({' Married-civ-spouse': 14976, ' Never-married': 10683, ' Divorced': 4443, ' Separated': 1025, ' Widowed': 993, ' Married-spouse-absent': 418, ' Married-AF-spouse': 23}) occupation 16 Counter({' Prof-specialty': 4073, ' Craft-repair': 4028, ' Exec-managerial': 4009, ' Adm-clerical': 3720, ' Sales': 3590, ' Other-service': 3247, ' Machine-op-inspct': 1968, ' ?': 1820, ' Transport-moving': 1566, ' Handlers-cleaners': 1347, ' Farming-fishing': 977, ' Tech-support': 905, ' Protective-serv': 643, nan: 512, ' Priv-house-serv': 147, ' Armed-Forces': 9}) relationship 6 Counter({' Husband': 13193, ' Not-in-family': 8305, ' Own-child': 5068, ' Unmarried': 3446, ' Wife': 1568, ' Other-relative': 981}) race 5 Counter({' White': 27816, ' Black': 3124, ' Asian-Pac-Islander': 1039, ' Amer-Indian-Eskimo': 311, ' Other': 271}) sex 2 Counter({' Male': 21790, ' Female': 10771}) native-country 42 Counter({' United-States': 29170, ' Mexico': 643, ' ?': 583, ' Philippines': 198, ' Germany': 137, ' Canada': 121, ' Puerto-Rico': 114, ' El-Salvador': 106, ' India': 100, ' Cuba': 95, ' England': 90, ' Jamaica': 81, ' South': 80, ' China': 75, ' Italy': 73, ' Dominican-Republic': 70, ' Vietnam': 67, ' Guatemala': 64, ' Japan': 62, ' Poland': 60, ' Columbia': 59, ' Taiwan': 51, ' Haiti': 44, ' Iran': 43, ' Portugal': 37, ' Nicaragua': 34, ' Peru': 31, ' Greece': 29, ' France': 29, ' Ecuador': 28, ' Ireland': 24, ' Hong': 20, ' Trinadad&Tobago': 19, ' Cambodia': 19, ' Thailand': 18, ' Laos': 18, ' Yugoslavia': 16, ' Outlying-US(Guam-USVI-etc)': 14, ' Hungary': 13, ' Honduras': 13, ' Scotland': 12, ' Holand-Netherlands': 1}) 下一步就是要将分类型数据转成数字型数据,在这部分里面,我们还做了对于缺失数据的填充 ```python for col in df.columns: if col in cat_names: df[col].fillna('---') df[col] = LabelEncoder().fit_transform(df[col].astype(str)) if col in cont_names: df[col]=df[col].fillna(0) ``` 上面的代码中: 我们首先使用了pandas的fillna函数对分类的数据做了空值的填充,这里面标识成一个与其他现有值不一样的值就可以,这里面我使用的三个中划线 --- 作为标记,然后就是使用了sklearn的LabelEncoder函数进行了数据的处理 然后有对我们的数值型的数据做了0填充的处理,对于数值型数据的填充,也可以使用平均值,或者其他方式填充,这个不是我们的重点,就不详细说明了。 ```python df.head() ```
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 4 101320 7 12.0 2 15 5 4 0 0 1902 40 39 >=50k
1 44 4 236746 12 14.0 0 4 1 4 1 10520 0 45 39 >=50k
2 38 4 96185 11 0.0 0 15 4 2 0 0 0 32 39 <50k
3 38 5 112847 14 15.0 2 10 0 1 1 0 0 40 39 >=50k
4 42 6 82297 5 0.0 2 8 5 2 0 0 0 50 39 <50k
数据处理完成后可以看到,现在所有的数据都是数字类型的了,可以直接输入到模型进行训练了. ```python #分割下训练数据和标签 Y = df['salary'] Y_label = LabelEncoder() Y=Y_label.fit_transform(Y) Y ``` array([1, 1, 0, ..., 1, 0, 0]) ```python X=df.drop(columns=result_var) X ```
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 49 4 101320 7 12.0 2 15 5 4 0 0 1902 40 39
1 44 4 236746 12 14.0 0 4 1 4 1 10520 0 45 39
2 38 4 96185 11 0.0 0 15 4 2 0 0 0 32 39
3 38 5 112847 14 15.0 2 10 0 1 1 0 0 40 39
4 42 6 82297 5 0.0 2 8 5 2 0 0 0 50 39
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 36 4 297449 9 13.0 0 10 1 4 1 14084 0 40 39
32557 23 0 123983 9 13.0 4 0 3 3 1 0 0 40 39
32558 53 4 157069 7 12.0 2 7 0 4 1 0 0 40 39
32559 32 2 217296 11 9.0 2 14 5 4 0 4064 0 22 39
32560 26 4 182308 15 10.0 2 10 0 4 1 0 0 40 39

32561 rows × 14 columns

以上,基本的数据预处理已经完成了,上面展示的只是一些必要的处理,如果要提高训练准确率还有很多技巧,这里就不详细说明了。 ## 定义数据集 要使用pytorch处理数据,肯定要使用Dataset进行数据集的定义,下面定义一个简单的数据集 ```python class tabularDataset(Dataset): def __init__(self, X, Y): self.x = X.values self.y = Y def __len__(self): return len(self.y) def __getitem__(self, idx): return (self.x[idx], self.y[idx]) ``` ```python train_ds = tabularDataset(X, Y) ``` 可以直接使用索引访问定义好的数据集中的数据 ```python train_ds[0] ``` (array([4.9000e+01, 4.0000e+00, 1.0132e+05, 7.0000e+00, 1.2000e+01, 2.0000e+00, 1.5000e+01, 5.0000e+00, 4.0000e+00, 0.0000e+00, 0.0000e+00, 1.9020e+03, 4.0000e+01, 3.9000e+01]), 1) ## 定义模型 数据已经准备完毕了,下一步就是要定义我们的模型了,这里使用了3层线性层的简单模型作为处理 ```python class tabularModel(nn.Module): def __init__(self): super().__init__() self.lin1 = nn.Linear(14, 500) self.lin2 = nn.Linear(500, 100) self.lin3 = nn.Linear(100, 2) self.bn_in = nn.BatchNorm1d(14) self.bn1 = nn.BatchNorm1d(500) self.bn2 = nn.BatchNorm1d(100) def forward(self,x_in): #print(x_in.shape) x = self.bn_in(x_in) x = F.relu(self.lin1(x)) x = self.bn1(x) #print(x) x = F.relu(self.lin2(x)) x = self.bn2(x) #print(x) x = self.lin3(x) x=torch.sigmoid(x) return x ``` 在定义模型的时候看到了我们加入了Batch Normalization来做批量的归一化: 批量归一化的内容请见这篇文章:https://mp.weixin.qq.com/s/FFLQBocTZGqnyN79JbSYcQ 或者扫描这个二维码,在微信中查看: ![](https://raw.githubusercontent.com/zergtant/pytorch-handbook/master/deephub.jpg) ## 训练 ```python #训练前指定使用的设备 DEVICE=torch.device("cpu") if torch.cuda.is_available(): DEVICE=torch.device("cuda") print(DEVICE) ``` cuda ```python #损失函数 criterion =nn.CrossEntropyLoss() ``` ```python #实例化模型 model = tabularModel().to(DEVICE) print(model) ``` tabularModel( (lin1): Linear(in_features=14, out_features=500, bias=True) (lin2): Linear(in_features=500, out_features=100, bias=True) (lin3): Linear(in_features=100, out_features=2, bias=True) (bn_in): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (bn1): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (bn2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ```python #测试模型是否没问题 rn=torch.rand(3,14).to(DEVICE) model(rn) ``` tensor([[0.5110, 0.1931], [0.4274, 0.5801], [0.5549, 0.7322]], device='cuda:0', grad_fn=) ```python #学习率 LEARNING_RATE=0.01 #BS batch_size = 1024 #优化器 optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) ``` ```python #DataLoader加载数据 train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True) ``` 以上的基本步骤是每个训练过程都需要的,所以就不多介绍了,下面开始进行模型的训练 ```python %%time model.train() #训练10轮 TOTAL_EPOCHS=100 #记录损失函数 losses = []; for epoch in range(TOTAL_EPOCHS): for i, (x, y) in enumerate(train_dl): x = x.float().to(DEVICE) #输入必须未float类型 y = y.long().to(DEVICE) #结果标签必须未long类型 #清零 optimizer.zero_grad() outputs = model(x) #计算损失函数 loss = criterion(outputs, y) loss.backward() optimizer.step() losses.append(loss.cpu().data.item()) print ('Epoch : %d/%d, Loss: %.4f'%(epoch+1, TOTAL_EPOCHS, np.mean(losses))) ``` Epoch : 1/100, Loss: 0.4936 Epoch : 2/100, Loss: 0.4766 Epoch : 3/100, Loss: 0.4693 Epoch : 4/100, Loss: 0.4653 Epoch : 5/100, Loss: 0.4627 Epoch : 6/100, Loss: 0.4606 Epoch : 7/100, Loss: 0.4591 Epoch : 8/100, Loss: 0.4582 Epoch : 9/100, Loss: 0.4573 Epoch : 10/100, Loss: 0.4565 Epoch : 11/100, Loss: 0.4557 Epoch : 12/100, Loss: 0.4551 Epoch : 13/100, Loss: 0.4546 Epoch : 14/100, Loss: 0.4540 Epoch : 15/100, Loss: 0.4535 Epoch : 16/100, Loss: 0.4530 Epoch : 17/100, Loss: 0.4526 Epoch : 18/100, Loss: 0.4522 Epoch : 19/100, Loss: 0.4519 Epoch : 20/100, Loss: 0.4515 Epoch : 21/100, Loss: 0.4511 Epoch : 22/100, Loss: 0.4508 Epoch : 23/100, Loss: 0.4504 Epoch : 24/100, Loss: 0.4502 Epoch : 25/100, Loss: 0.4499 Epoch : 26/100, Loss: 0.4496 Epoch : 27/100, Loss: 0.4492 Epoch : 28/100, Loss: 0.4489 Epoch : 29/100, Loss: 0.4486 Epoch : 30/100, Loss: 0.4483 Epoch : 31/100, Loss: 0.4480 Epoch : 32/100, Loss: 0.4477 Epoch : 33/100, Loss: 0.4474 Epoch : 34/100, Loss: 0.4471 Epoch : 35/100, Loss: 0.4469 Epoch : 36/100, Loss: 0.4466 Epoch : 37/100, Loss: 0.4463 Epoch : 38/100, Loss: 0.4460 Epoch : 39/100, Loss: 0.4458 Epoch : 40/100, Loss: 0.4455 Epoch : 41/100, Loss: 0.4452 Epoch : 42/100, Loss: 0.4449 Epoch : 43/100, Loss: 0.4447 Epoch : 44/100, Loss: 0.4445 Epoch : 45/100, Loss: 0.4442 Epoch : 46/100, Loss: 0.4439 Epoch : 47/100, Loss: 0.4437 Epoch : 48/100, Loss: 0.4434 Epoch : 49/100, Loss: 0.4432 Epoch : 50/100, Loss: 0.4429 Epoch : 51/100, Loss: 0.4426 Epoch : 52/100, Loss: 0.4424 Epoch : 53/100, Loss: 0.4421 Epoch : 54/100, Loss: 0.4419 Epoch : 55/100, Loss: 0.4417 Epoch : 56/100, Loss: 0.4414 Epoch : 57/100, Loss: 0.4411 Epoch : 58/100, Loss: 0.4409 Epoch : 59/100, Loss: 0.4406 Epoch : 60/100, Loss: 0.4404 Epoch : 61/100, Loss: 0.4402 Epoch : 62/100, Loss: 0.4399 Epoch : 63/100, Loss: 0.4397 Epoch : 64/100, Loss: 0.4394 Epoch : 65/100, Loss: 0.4392 Epoch : 66/100, Loss: 0.4390 Epoch : 67/100, Loss: 0.4387 Epoch : 68/100, Loss: 0.4384 Epoch : 69/100, Loss: 0.4382 Epoch : 70/100, Loss: 0.4380 Epoch : 71/100, Loss: 0.4377 Epoch : 72/100, Loss: 0.4375 Epoch : 73/100, Loss: 0.4373 Epoch : 74/100, Loss: 0.4371 Epoch : 75/100, Loss: 0.4368 Epoch : 76/100, Loss: 0.4366 Epoch : 77/100, Loss: 0.4364 Epoch : 78/100, Loss: 0.4362 Epoch : 79/100, Loss: 0.4360 Epoch : 80/100, Loss: 0.4358 Epoch : 81/100, Loss: 0.4356 Epoch : 82/100, Loss: 0.4353 Epoch : 83/100, Loss: 0.4351 Epoch : 84/100, Loss: 0.4348 Epoch : 85/100, Loss: 0.4346 Epoch : 86/100, Loss: 0.4344 Epoch : 87/100, Loss: 0.4342 Epoch : 88/100, Loss: 0.4340 Epoch : 89/100, Loss: 0.4338 Epoch : 90/100, Loss: 0.4336 Epoch : 91/100, Loss: 0.4333 Epoch : 92/100, Loss: 0.4331 Epoch : 93/100, Loss: 0.4329 Epoch : 94/100, Loss: 0.4328 Epoch : 95/100, Loss: 0.4326 Epoch : 96/100, Loss: 0.4324 Epoch : 97/100, Loss: 0.4322 Epoch : 98/100, Loss: 0.4320 Epoch : 99/100, Loss: 0.4318 Epoch : 100/100, Loss: 0.4316 Wall time: 49.6 s 训练完成后我们可以看一下模型的准确率 ```python model.eval() correct = 0 total = 0 for i,(x, y) in enumerate(train_dl): x = x.float().to(DEVICE) y = y.long() outputs = model(x).cpu() _, predicted = torch.max(outputs.data, 1) total += y.size(0) correct += (predicted == y).sum() print('准确率: %.4f %%' % (100 * correct / total)) ``` 准确率: 89.0000 % 以上就是基本的流程了