Sequence labeling with embedding-level attention

2018年11月16日,我的第一篇SCI被接收啦!发表在IEEE Access上,目前影响因子3.557,JCR3区(据说即将升2区)。

论文名叫《An Attentive Neural Sequence Labeling Model for Adverse Drug Reactions Mentions Extraction》,主要是在做序列标注任务,利用了embedding-level的attention机制,并引入了一个辅助分类器(auxiliary classifier),在两个不良药物反应抽取的数据集上取得了state-of-the-art的效果。这篇博客将用来记录这篇论文,论文代码同时也会在我的github上发布:code for sequence labeling。论文链接:https://ieeexplore.ieee.org/document/8540859

img

博客将分为几个部分来介绍论文的工作:

  1. 要做什么
  2. 别人怎么做的以及它们的不足之处
  3. 我的模型
  4. 数据、代码以及结果

要做什么

给定一段文本,训练一个模型,让模型能够将文本中的不良药物反应部分标注出来。

img

上图给出了两段文本(实际上是两句话),如果采用I-O标注方案的话,橙色部分的就是模型需要标注出来的不良药物反应(I-ADR)部分,句子的其他部分就是O,其实就是对句子中的每个词进行一个二分类。

这里补充两个缩写,ADR和ADE,前者是Adverse Drug Reaction的缩写,后者是Adverse Drug Event的缩写,其实都是一个意思,都是不良药物反应/事件的意思,这两者是等价的。

别人怎么做的&不足之处

Conditional Rndom Field

Conditional Rndom Field 即条件随机场,简称CRF。条件随机场是近几年自然语言处理领域常用的算法之一,常用于句法分析、命名实体识别、词性标注等。目前我们看到的主流的做序列标注的模型(变态的BERT除外),基本都是双向的LSTM接一个CRF(BiLSTM + CRF)。

我们主要提一下这篇论文Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features (之前的做ADR标注的工作),它提出了ADRMine,一个利用CRF进行ADR抽取的系统模型。ADRMine需要以手工设计的特征作为模型的输入,比如Context features, 判断词是否在ADR Lexicon中,POS以及论文主要提出的Embedding cluster features等等。虽然最后的效果(在当时看来)还可以,可是这样的模型需要很多hand-crafted features,耗时耗力。

Bi-LSTM

对,你没看错,就是凭借一个简单的双向LSTM,Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts这篇论文就发在AMIA (Journal of the American Medical Informatics Association)上(请让我看见你的contribution好嘛(^▽^))。话说回来,我的这篇论文基本是基于人家这篇论文做的,所以还是很感谢作者提供的数据集以及她的论文源码的。其实这篇论文的效果我是没太看明白,她用了task-specific的word embedding,我换了一个通用的GloVe 300维的词向量就远超她了好嘛(可能是作者故意放水,让我们好“拾人牙慧”呢)。

Semi-Supervised Bi-LSTM

Semi-Supervised learning也就是半监督学习,Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction这篇论文使用了半监督学习来做序列标注任务。具体来说,作者使用一些药物的名字作为search keywords,爬取了时间跨度为两个月的相关推文,共10万条。模型分为两个部分:首先是Unsupervised部分,作者使用爬取的推文,挖去每个推文中的药物的名字,用<DRUG>这个标记替换。然后基于CBOW的思想(其实BERT也借鉴了CBOW的思想,只不过论文里说的是cloze task),给定之前挖去的药物的上下文,使用双向LSTM训练的模型去预测这个药物的名称,其实就是一个双向的语言模型,现在来看这个思想还挺超前的,因为BERT的双向语言模型其实就是这个意思啊!第二部分就是Supervised sequence classification,使用前面训练的双向LSTM模型去预测推文中每个token的标签。从结果看,效果还不错,但是需要爬取额外的数据,还要进行双向语言模型的训练,就显得有点麻烦了。

我的模型

主要思想:1)除了word embedding,我们同时引入character embedding,不是像其他论文中的简单的将这两个粒度的特征进行一个简单的拼接,我们使用了一个embedding-level的attention机制,相当于赋予这两种特征一个权重,让模型在训练过程中自己学习哪个粒度的特征对结果的预测更重要。2)除了模型的正常的输出(称为主分类器),我们还引入了一个辅助分类器,也就是attention层的结果,我们将主分类器和辅助分类器的结果相加,相当于利用了模型中间层的输出,丰富了特征表示,从而进一步地提高了模型的效果。

img

下面详细地讲解一下怎么得到character embedding的,怎么使用embedding-level的attention将两种特征结合起来的,以及辅助分类器的使用。

Character embedding

相比于word embedding,character embedding可以捕捉到单词更细粒度的特征,比如前缀,后缀,大小写等等。引入character embedding,可以提高模型对那些OOV tokens的分类效果。得到character embedding的方法主要有两种:Char CNN和Char LSTM。我对比了这两种结构获取字符表征地效果,Char LSTM的效果要比Char CNN好,当然这可能因数据集而异,可以参考COLING 2018 最佳论文Design Challenges and Misconceptions in Neural Sequence Labeling里的对比实验。

img

首先,我们对句子进行分词,并统计出数据集中所有句子的最大长度$maxlen_s$以及所有单词的最大长度$maxlen_t$,然后进行padding操作,要padding两次,一次在word级别进行,一次在character级别进行,因此character embedding layer的输入应该形如(batch_size, $maxlen_s$, $maxlen_t$)。为了和word embedding进行attention操作,我们需要保证word embedding层的输出的维度和character embedding层的输出的维度保持一致,可以通过keras里的$reshape$函数实现。

Embedding-level attention

使用attention机制将两个级别的embedding的结合起来的思想借鉴于这篇论文Attending to Characters in Neural Sequence Labeling Models。其实就是一个two-layer的perceptron,公式如下:

$a = \sigma(V_atanh(W_ax + U_aq))$
$\tilde{x} = a\cdot x + (1-a)\cdot q$

其中$x$是word embedding的表征,$q$是character embedding的表征,$\sigma$是$sigmoid$函数,被用来得到两个embedding粒度特征的权重$a$。然后通过这个权重和两个粒度的embedding特征进行相乘得到新的表征$\tilde{x}$。

Auxiliary classifier

关于辅助分类器的思想(这里要感谢一下实验室做图像的姚红豆同学),主要来自GoogleNetFully Convolutional Networks,它们分别针对图像分类和图像分割任务,将模型中间层的输出和模型的最后层的输出结合起来(相加或Concat),取得了不错的效果。我们模型中采用的是直接相加的方法。将
attention层的输出经过$Dense$和$softmax$,同双向GRU的输出(同样经过了$Dense$和$softmax$)相加,获取到模型的最终结果。公式如下:

$y_{main} = softmax(W_th + b_t)$
$y_{auxiliary} = softmax(W_p\tilde{x} + b_p)$
$y = y_{main} + y_{auxiliary}$

其中$h$是双向GRU层的所有时间步输出的concatenation。

数据、代码&结果

数据

我们使用的数据有两个,第一个是Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts这篇论文里用的推特数据,包括Twitter ADR Dataset v1.0和作者扩充的数据,共844条推文;另一个是An Attentive Sequence Model for Adverse Drug Event Extraction from Biomedical Text论文里使用的数据,叫做ADE-Corpus-V2,是从PubMed的论文摘要中提取的句子,共6821条。

标注方案方面,我们使用了和原论文一样的标注方案,即I-O标注,即一个句子中的单词,不是I-ADR就是O。对于推特数据集,多了一类I-Indication的标注(原数据集自带的)。

代码

推特数据代码

推特数据处理

推特的数据可以找Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts这篇论文的作者要,人还是很nice的,她的邮箱:acocos@seas.upenn.edu。原始的数据保存在raw文件夹里,具体文件结构如下:

  • raw
    • train
      • asu_train
      • chop_train
    • test
      • asu_test
      • chop_test

其中asu_train/test代表了Twitter ADR Dataset v1.0数据集,说代表是因为:It is against Twitter’s Terms of Service to publish the text of tweets,所以官方只提供了推文的id和对应的脚本,让你自己去爬取数据。就跟微博一样,有的人可能会经常删掉一些自己发过的状态,因此每个人通过脚本爬取的数据在数据量上可能都不一样,因此为了保持实验数据的一致性和实验对比的公稳,我找到了前面提到的论文的作者要了数据集,那么asu_train/test的数据就跟论文作者当时爬取的数量保持了一致;另一部分chop_train/test是论文作者自己爬了一些数据,作为推特数据集的一个补充。总的来说,训练集(asu_train + chop_train)共634条数据,测试集(asu_test + chop_test)共210条数据。每条数据包含8个字段,如下表所示:

id start end semantic_type span reldrug tgtdrug text
推文的id号 label的开始位置 label的结束位置 label 打标签的text 相关药物 目标药物 整个推特文本

主要的预处理包括:

  1. 将@某某某用<USER>替换
  2. 将http…用<URL>替换
  3. 将图片链接pic.twitter…用<PIC>替换
  4. 全部小写

预处理代码 data_processing.py,底部的OOV分析部分代码等整个模型预测出结果文件后再执行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
import os
import re
import csv
from nltk.tokenize import TweetTokenizer
import sys
import numpy as np
import pickle as pkl

# PICKLEFILE:我们会将处理后的数据存储成.pkl格式的文件
# SEQLAB_DATA_DIR:存放数据的文件夹
# LABELSET:标签集,我们用不到Beneficial部分
SETTINGS = {
'PICKLEFILE': 'H:/twitter_adr/data/processed/ade.full.pkl',
'SEQLAB_DATA_DIR': 'H:/twitter_adr/data',

## The B-I-O labeling scheme
'LABELSET': {'ADR': {'b': 'I-ADR', 'i': 'I-ADR'},
'Indication': {'b': 'I-Indication', 'i': 'I-Indication'},
'Beneficial': {'b': 'I-Indication', 'i': 'I-Indication'}}
}

Processed_data_dir = SETTINGS['SEQLAB_DATA_DIR']+'/processed'
Raw_data_dir = SETTINGS['SEQLAB_DATA_DIR']+'/raw'
Out_file = 'adr.full.pkl'

labelset = SETTINGS['LABELSET']
tokset = {'<UNK>'}
raw_headers = ['id', 'start', 'end', 'semantic_type', 'span', 'reldrug', 'tgtdrug', 'text']

# 创建用于存放处理后数据的文件夹
if not os.path.isdir(Processed_data_dir):
os.makedirs(Processed_data_dir)

trainfiles = 'asu,chop'.split(',')
testfiles = 'asu,chop'.split(',')

# 处理标签的函数
def comp_labels(l1,l2):
if l1 != 'O':
return l1
elif l2 != 'O':
return l2
else:
return 'O'

def combine_labels(seq_old, seq_new):
if len(seq_old) == 0:
return seq_new
seq_combined = []
for (o,n) in zip(seq_old, seq_new):
seq_combined.append(comp_labels(o,n))
return seq_combined

# 替换网址和图片链接,并全部小写处理
def clean_str(string):

string = re.sub(r'http\S+', '<URL>', string)
string = re.sub(r'pic.twitter\S+', '<PIC>', string)

return string.strip().lower()

# 预处理的主函数,替换@某某某,分词,对句子中的每个词分配对应标签等
def create_adr_dataset(t, files, tokset, labelset):

tokset |= {'<UNK>'}
atmention = re.compile('@\w+')
tt = TweetTokenizer()

try:
os.makedirs(os.path.join(Processed_data_dir, 'train'))
os.makedirs(os.path.join(Processed_data_dir, 'test'))
except:
pass

for f in ['_'.join([d,t]) for d in files]:
processed_rows = {}
fout = open(re.sub(r'\\', r'/', os.path.join(Processed_data_dir, t, f)), 'w', newline='')
fnames = raw_headers+['tokens','labels','norm_text']
wrt = csv.DictWriter(fout, fieldnames=fnames)
wrt.writeheader()
fname = re.sub(r"\\", r"/", os.path.join(Raw_data_dir, t, f))
with open(fname, 'r', errors='ignore') as fin:
dr = csv.DictReader(fin)
for row in dr:
# Pull from processed_rows dir so we can combine multiple annotations in a single tweet
pr = processed_rows.get(row['id'], {h: row.get(h,[]) for h in fnames})

text = row['text']
span = row['span']

text = clean_str(text)
span = clean_str(span)

# Tokenize
tok_text = tt.tokenize(text)
tok_span = tt.tokenize(span)

# Add sequence labels to raw data
labels = ['O'] * len(tok_text)
if len(row['span']) > 0 and row['semantic_type'] != 'NEG':
s = row['semantic_type']
for i in range(len(tok_text)):
if tok_text[i:i+len(tok_span)] == tok_span:
labels[i] = labelset[s]['b']
if len(tok_span) > 1:
labels[i+1:i+len(tok_span)] = [labelset[s]['i']] * (len(tok_span)-1)

# Combine spans and labels if duplicate
pr['labels'] = combine_labels(pr['labels'], labels)
if pr['span'] != row['span']:
pr['span'] = '|'.join([pr['span'], row['span']])
pr['tokens'] = tok_text

# Normalize text
tok_text = [ttw if not atmention.match(ttw) else '<USER>' for ttw in tok_text] # normalize @user
lower_text = [w.lower() for w in tok_text]
pr['norm_text'] = lower_text
tokset |= set(lower_text)
processed_rows[row['id']] = pr
for pr, dct in processed_rows.items():
wrt.writerow(dct)
fout.close()
return tokset

tokset |= create_adr_dataset('train', trainfiles, tokset, labelset)
tokset |= create_adr_dataset('test', testfiles, tokset, labelset)

def flatten(l):
return [item for sublist in l for item in sublist]

# Build index dictionaries
labels = ['O'] + sorted(list(set(flatten([subdict.values() for subdict in labelset.values()])))) + ['<UNK>']
labels2idx = dict(zip(labels, range(1,len(labels)+1)))
tok2idx = dict(zip(tokset, range(1,len(tokset)+1))) # leave 0 for padding

train_toks_raw = []
train_lex_raw = []
train_y_raw = []
valid_toks_raw = []
valid_lex_raw = []
valid_y_raw = []
t_toks = []
t_lex = []
t_y = []
t_class = []

def parselist(strlist):
'''
Parse list from string representation of list
:param strlist: string
:return:list
'''
return [w[1:-1] for w in strlist[1:-1].split(', ')]

for dtype in trainfiles:
with open(re.sub(r"\\", r"/", os.path.join(Processed_data_dir, 'train', dtype+'_train')), 'r') as fin:
rd = csv.DictReader(fin)
for row in rd:
t_toks.append(parselist(row['tokens']))
t_lex.append(parselist(row['norm_text']))
t_y.append(parselist(row['labels']))
if '<UNK>' in parselist(row['labels']):
sys.stderr.write('<UNK> found in labels for tweet %s' % row['tokens'])
t_class.append(row['semantic_type'])


def vectorize(listoftoklists, idxdict):
'''
Turn each list of tokens or labels in listoftoklists to an equivalent list of indices
:param listoftoklists: list of lists
:param idxdict: {tok->int}
:return: list of np.array
'''
res = []
for toklist in listoftoklists:
res.append(np.array(list(map(lambda x: idxdict.get(x, idxdict['<UNK>']), toklist))).astype('int32'))
return res

def load_adefull(fname):
if not os.path.isfile(fname):
print('Unable to find file', fname)
return None
with open(fname, 'rb') as f:
train_set, valid_set, test_set, dicts = pkl.load(f)
return train_set, valid_set, test_set, dicts

train_toks_raw = t_toks
train_lex_raw = t_lex
train_y_raw = t_y
valid_toks_raw = []
valid_lex_raw = []
valid_y_raw = []


test_toks_raw = []
test_lex_raw = []
test_y_raw = []
for dtype in testfiles:
with open(os.path.join(Processed_data_dir, 'test', dtype+'_test'), 'r') as fin:
rd = csv.DictReader(fin)
for row in rd:
test_toks_raw.append(parselist(row['tokens']))
test_lex_raw.append(parselist(row['norm_text']))
test_y_raw.append(parselist(row['labels']))
# Convert each sentence of normalized tokens and labels into arrays of indices
train_lex = vectorize(train_lex_raw, tok2idx)
train_y = vectorize(train_y_raw, labels2idx)
valid_lex = vectorize(valid_lex_raw, tok2idx)
valid_y = vectorize(valid_y_raw, labels2idx)
test_lex = vectorize(test_lex_raw, tok2idx)
test_y = vectorize(test_y_raw, labels2idx)

# Pickle the resulting data set
with open(os.path.join(Processed_data_dir, Out_file),'wb') as fout:
pkl.dump([[train_toks_raw,train_lex,train_y],[valid_toks_raw,valid_lex,valid_y],[test_toks_raw,test_lex,test_y],
{'labels2idx':labels2idx, 'words2idx':tok2idx}], fout)

"""
# OOV分析

from gensim.models import KeyedVectors

glove_300d_path = 'H:/twitter_adr/embeddings/glove.840B.300d.txt'
print("Loading embeddings...")
w2v = KeyedVectors.load_word2vec_format(glove_300d_path, binary=False, unicode_errors='ignore')

IV = []
OOTV = []
OOEV = []
OOBV = []

train_tokens = []

for train_senc in train_toks_raw:
for train_token in train_senc:
train_tokens.append(train_token)

unique_train_tokens = list(set(train_tokens))

test_tokens = []

for test_senc in test_toks_raw:
for test_token in test_senc:
test_tokens.append(test_token)

unique_test_tokens = list(set(test_tokens))

all_tokens = list(set(unique_train_tokens + unique_test_tokens))

for i in test_tokens:
if i in train_tokens and i in w2v:
IV.append(i)

if i in w2v and i not in train_tokens:
OOTV.append(i)

if i in train_tokens and i not in w2v:
OOEV.append(i)

if i not in train_tokens and i not in w2v:
OOBV.append(i)

# result_file是你预测后的文件的路径,注意需要删掉预测文件底部的F1等指标的显示信息
result_file = list(open('your prediction path', 'r'))

# 输入一个文件地址用来写入OOV分析结果
with open('your new file path', 'w') as fout:
bos = 'BOS\tO\tO\n'
eos = 'EOS\tO\tO\n'

for line in bgru_attention:
line = line.strip()
line = line.split("\t")
if line[0] == 'BOS':
fout.write(bos)
elif line[0] == 'EOS':
fout.write(eos)
# 每次注释掉其余三行来获取对应的未注释的那行的OOV分析结果
elif line[0] in IV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOTV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOEV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOBV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
fout.write('\t'.join([line[0], line[1], line[2]])+'\n')
else:
fout.write('\t'.join([line[0], 'O', 'O'])+'\n')

# 调用approximateMatch的get_approx_match方法,计算OOV分析结果
import approximateMatch
scores = approximateMatch.get_approx_match('your new file path')
"""

运行完以上预处理代码后,会在processed(和raw同在data目录下)目录下生成存放处理后数据的train和test文件夹,并生成ade.full.pkl的文件,这个文件存储了所有的处理后的数据信息。

推特数据模型

model.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
import numpy as np
from gensim.models import KeyedVectors
import sys, os, re
import pickle as pkl
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Lambda, merge, dot, Subtract
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.core import Dropout, Activation, Reshape
from keras.models import Model
import approximateMatch
from keras.layers.wrappers import Bidirectional, TimeDistributed
from keras.layers.merge import Concatenate
import keras.backend as K
import collections


# 词向量和之前预处理后的数据的路径
glove_300d_path = 'H:/twitter_adr/embeddings/glove.840B.300d.txt'
datapickle_path = 'H:/twitter_adr/data/processed/adr.full.pkl'

# 随机种子
seed = 10
np.random.seed(seed)

# 加载词向量
print("Loading word embeddings...")
w2v = KeyedVectors.load_word2vec_format(glove_300d_path, binary=False, unicode_errors='ignore')
print('word embeddings loading done!')

# 加载数据函数
def load_adefull(fname):
if not os.path.isfile(fname):
print('Unable to find file', fname)
return None
with open(fname, 'rb') as f:
train_set, valid_set, test_set, dicts = pkl.load(f)
return train_set, valid_set, test_set, dicts

# 加载数据
train_set, valid_set, test_set, dic = load_adefull(datapickle_path) # 由于数据量较少,我们没有再对训练集划分验证集
idx2label = dict((k,v) for v,k in dic['labels2idx'].items())
idx2word = dict((k,v) for v,k in dic['words2idx'].items())

# 0 is used as padding
if 0 in idx2label:
sys.stderr.write('Index 0 found in labels2idx: data may be lost because 0 used as padding\n')
if 0 in idx2word:
sys.stderr.write('Index 0 found in words2idx: data may be lost because 0 used as padding\n')
idx2word[0] = 'PAD'
idx2label[0] = 'PAD'
idx2label.pop(4) # 删掉UNK这个标签

train_toks, train_lex, train_y = train_set
test_toks, test_lex, test_y = test_set

vocsize = max(idx2word.keys()) + 1
nclasses = max(idx2label.keys()) + 1

maxlen = max(max([len(l) for l in train_lex]), max([len(l) for l in test_lex]))

"""
char embedding
"""
char_per_word = []
char_word = []
char_senc = []
maxlen_char_word = 0
a = []

for s in (train_toks + test_toks):
for w in s:
for c in w.lower():
char_per_word.append(c)
if len(char_per_word) > 25:
a.append(char_per_word)
char_per_word = char_per_word[:25]
if len(char_per_word) > maxlen_char_word:
maxlen_char_word = len(char_per_word)

char_word.append(char_per_word)
char_per_word = []

char_senc.append(char_word)
char_word = []


charcounts = collections.Counter()
for senc in char_senc:
for word in senc:
for charac in word:
charcounts[charac] += 1
chars = [charcount[0] for charcount in charcounts.most_common()]
char2idx = {c: i+1 for i, c in enumerate(chars)}

char_word_lex = []
char_lex = []
char_word = []
for senc in char_senc:
for word in senc:
for charac in word:
char_word_lex.append([char2idx[charac]])

char_word.append(char_word_lex)
char_word_lex = []

char_lex.append(char_word)
char_word = []

char_per_word = []
char_per_senc = []
char_senc = []
for s in char_lex:
for w in s:
for c in w:
for e in c:
char_per_word.append(e)
char_per_senc.append(char_per_word)
char_per_word = []
char_senc.append(char_per_senc)
char_per_senc = []

pad_char_all = []
for senc in char_senc:
while len(senc) < 36:
senc.insert(0, [])
pad_senc = pad_sequences(senc, maxlen=maxlen_char_word)
pad_char_all.append(pad_senc)
pad_senc = []

pad_char_all = np.array(pad_char_all)

pad_train_lex = pad_char_all[:634]
pad_test_lex = pad_char_all[634:]

idx2char = dict((k,v) for v,k in char2idx.items())
idx2char[0] = 'PAD'
charsize = max(idx2char.keys()) + 1

def init_embedding_weights(i2w, w2vmodel):
# Create initial embedding weights matrix
# Return: np.array with dim [vocabsize, embeddingsize]

d = 300
V = len(i2w)
assert sorted(i2w.keys()) == list(range(V)) # verify indices are sequential

emb = np.zeros([V,d])
num_unknownwords = 0
unknow_words = []
for i,l in i2w.items():
if i==0:
continue
if l in w2vmodel.vocab:
emb[i, :] = w2vmodel[l]
else:
num_unknownwords += 1
unknow_words.append(l)
emb[i] = np.random.uniform(-1, 1, d)
return emb, num_unknownwords, unknow_words

def vectorize_set(lexlists, maxlen, V):
nb_samples = len(lexlists)
X = np.zeros([nb_samples, maxlen, V])
for i, lex in enumerate(lexlists):
for j, tok in enumerate(lex):
X[i,j,tok] = 1
return X

def predict_score(model, x, toks, y, pred_dir, i2l, padlen, metafile=0, fileprefix=''):

pred_probs = model.predict(x, verbose=0)
test_loss = model.evaluate(x=x, y=y, batch_size=1, verbose=0)
pred = np.argmax(pred_probs, axis=2)

N = len(toks)

# If the name of a metafile is passed, simply write this round of predictions to file
if metafile > 0:
meta = open(metafile, 'a')

fname = re.sub(r'\\', r'/', os.path.join(pred_dir, fileprefix+'approxmatch_test'))
with open(fname, 'w') as fout:
for i in range(N):
bos = 'BOS\tO\tO\n'
fout.write(bos)
if metafile > 0:
meta.write(bos)

sentlen = len(toks[i])
startind = padlen - sentlen

preds = [i2l[j] for j in pred[i][startind:]]
actuals = [i2l[j] for j in np.argmax(y[i], axis=1)[startind:]]
for (w, act, p) in zip(toks[i], actuals, preds):
line = '\t'.join([w, act, p])+'\n'
fout.write(line)
if metafile > 0:
meta.write(line)

eos = 'EOS\tO\tO\n'
fout.write(eos)
if metafile > 0:
meta.write(eos)
scores = approximateMatch.get_approx_match(fname)
scores['loss'] = test_loss
if metafile > 0:
meta.close()

with open(fname, 'a') as fout:
fout.write('\nTEST Approximate Matching Results:\n ADR: Precision '+ str(scores['p'])+ ' Recall ' + str(scores['r']) + ' F1 ' + str(scores['f1']))
return scores

# Pad inputs to max sequence length and turn into one-hot vectors
train_lex = pad_sequences(train_lex, maxlen=maxlen)
test_lex = pad_sequences(test_lex, maxlen=maxlen)

train_y = pad_sequences(train_y, maxlen=maxlen)
test_y = pad_sequences(test_y, maxlen=maxlen)

train_y = vectorize_set(train_y, maxlen, nclasses)
test_y = vectorize_set(test_y, maxlen, nclasses)

# Build the model
print('Building the model...')

HIDDEN_DIM = 64
BATCH_SIZE = 1
NUM_EPOCHS = 8

hiddendim = HIDDEN_DIM

main_input = Input(shape=[maxlen], dtype='int32', name='input') # (None, 36)
char_input = Input(shape=[maxlen, maxlen_char_word], dtype='int32', name='char_input') # (None, 36, 25)

embeds, num_unk, unk_words = init_embedding_weights(idx2word, w2v)

embed_dim = 300
char_embed_dim = 100

# 我发现把mask_zero设为False结果并没有变差,甚至有点小提升
embed = Embedding(input_dim=vocsize, output_dim=embed_dim, input_length=maxlen,
weights=[embeds], mask_zero=False, name='embedding', trainable=False)(main_input)
embed = Dropout(0.1, name='embed_dropout')(embed)

"""
双向LSTM 获取 Char embedding
"""
char_embed = Embedding(input_dim=charsize, output_dim=char_embed_dim, embeddings_initializer='lecun_uniform',
input_length=maxlen_char_word, mask_zero=False, name='char_embedding')(char_input)
s = char_embed.shape
char_embed = Lambda(lambda x: K.reshape(x, shape=(-1, s[-2], char_embed_dim)))(char_embed)

fwd_state = GRU(150, return_state=True)(char_embed)[-2]
bwd_state = GRU(150, return_state=True, go_backwards=True)(char_embed)[-2]
char_embed = Concatenate(axis=-1)([fwd_state, bwd_state])
char_embed = Lambda(lambda x: K.reshape(x, shape=[-1, s[1], 2 * 150]))(char_embed)
char_embed = Dropout(0.1, name='char_embed_dropout')(char_embed)

"""
使用attention将word embedding和character embedding结合起来
"""
W_embed = Dense(300, name='Wembed')(embed)
W_char_embed = Dense(300, name='W_charembed')(char_embed)
merged1 = merge([W_embed, W_char_embed], name='merged1', mode='sum')
tanh = Activation('tanh', name='tanh')(merged1)
W_tanh = Dense(300, name='w_tanh')(tanh)
a = Activation('sigmoid', name='sigmoid')(W_tanh)

t = Lambda(lambda x: K.ones_like(x, dtype='float32'))(a)

merged2 = merge([a, embed], name='merged2', mode='mul')
sub = Subtract()([t, a])
merged3 = merge([sub, char_embed], name='merged3', mode='mul')
x_wave = merge([merged2, merged3], name='final_re', mode='sum')

# 辅助分类器
auxc = Dense(nclasses, name='auxiliary_classifier')(x_wave)
auxc = Activation('softmax')(auxc) # (None, 36, 5)

# 双向GRU
bi_gru = Bidirectional(GRU(hiddendim, return_sequences=True, name='gru'), merge_mode='concat', name='bigru')(x_wave) # (None, None, 128)
bi_gru = Dropout(0.1, name='bigru_dropout')(bi_gru)

# 主分类器
mainc = TimeDistributed(Dense(nclasses), name='main_classifier')(bi_gru) # (None, 36, 4)
mainc = Activation('softmax')(mainc)

# 将辅助分类器和主分类器相加,作为模型最终输出
final_output = merge([auxc, mainc], mode='sum')

model = Model(inputs=[main_input, char_input], outputs=final_output, name='output')
model.compile(optimizer='adam', loss='categorical_crossentropy')

print('Training...')
history = model.fit([train_lex, pad_train_lex], train_y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS)

# 预测结果
predir = 'H:/twitter_adr/model_output/predictions'
fileprefix = 'embedding_level_attention_'

scores = predict_score(model, [test_lex, pad_test_lex], test_toks, test_y, predir, idx2label,
maxlen, fileprefix=fileprefix)

PubMed数据代码

PubMed数据处理

PubMed abstracts数据可以从这个页面下载,我们用到的是里面的DRUG-AE.rel(provides relations between drugs and adverse effects)文件。数据格式如下:

The format of DRUG-AE.rel is as follows with pipe delimiters:
Column-1: PubMed-ID
Column-2: Sentence
Column-3: Adverse-Effect
Column-4: Begin offset of Adverse-Effect at ‘document level’
Column-5: End offset of Adverse-Effect at ‘document level’
Column-6: Drug
Column-7: Begin offset of Drug at ‘document level’
Column-8: End offset of Drug at ‘document level’

其实和推特数据差不多,不过有个问题需要注意,就是PubMed数据集中可能会出现重复的句子,举个例子,比如对于“我吃了阿莫西林和板蓝根,感到头晕和乏力。”这句话,相关药物有“阿莫西林”和“板蓝根”两种,不良药物反应有“头晕”和“乏力”两种,那么这个样本在数据集中将会出现4次(2种药物x2种不良反应),我们希望每个句子在数据集中只出现一次,因此在论文的第一版中,我们对所有重复的句子进行了处理,使得相同的句子在数据集中只出现一次,这样数据由原来的6821条变到了4271条,reviewer说这样做的实验就不具有对比性了。因此,在后来的实验中,我们保持了和原论文An Attentive Sequence Model for Adverse Drug Event Extraction from Biomedical Text一致的数据处理方法,话说回来,其实处理完之后数据也还是剩下4000多条(4858条),不过没办法,为了公平对比,我们还是要按照人家的处理方式来。

We have modified the data samples such that each sample is annotated with one drug and a list of ADE’s caused by that drug. This allows us to condition the model on a given drug and focus on the segments of text corresponding to the effect of the drug. As suggested in prior works, 120 sentences with overlapping entities (e.g., ”lithium intoxication”, where ”lithium” is a drug that causes ”lithium intoxication”) are removed from the dataset.

这是原论文中所描述的数据处理方法,将数据处理成一句话只包含一个药物以及由这个药物引起的所有的不良反应集合。其实这样做也会有问题,就比如之前我们举的例子,“我吃了阿莫西林和板蓝根,感到头晕和乏力。”这句话,按照这样的处理方式,会变成两句话(因为有两种不同的药物),以及一个[头晕,乏力]的不良反应标注。Emmmmm,我也很无奈啊,reviewer最大啊╮(╯▽╰)╭。这里还提到了去除了120个句子,这120个句子中的不良药物标注将药物也包括在内了。我按照这两个预处理方式对PubMed数据进行了处理,代码如下:

data_processing.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def data_processing(data_path):

from nltk.tokenize import TweetTokenizer
import collections
from keras.preprocessing.sequence import pad_sequences
import numpy as np

labelset = ['O', 'I-ADR']
ADE_V2 = list(open(data_path, 'r'))

raw_data = []

for line in ADE_V2:
line = line.strip()
line = line.split('|')
raw_data.append(line) # 原始数据有6821条

senc_drug_ade = []
new_sample = []

haha = []
heihei = []

for sample in raw_data:

if sample[5] in sample[2]:
haha.append([sample[0], sample[3], sample[4], 'I-ADR', sample[2], sample[6], sample[7], sample[5], sample[5], sample[1]])
if sample[2][:int(sample[7])-int(sample[6])] == sample[5]:
heihei.append([sample[0], sample[3], sample[4], 'I-ADR', sample[2], sample[6], sample[7], sample[5], sample[5], sample[1]])

b = []
for i in haha:
if i not in heihei:
b.append(i)

nested_idx = [1, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15, 16, 19, 20, 21, 30, 31]
for idx in nested_idx:
heihei.append(b[idx])

for sample in raw_data:
new_sample.append([sample[0], sample[3], sample[4], 'I-ADR', sample[2], sample[6], sample[7], sample[5], sample[5], sample[1]])
senc_drug_ade.append(new_sample[0])
new_sample = []

removed_senc = []
for sample in senc_drug_ade:
if sample not in heihei:
removed_senc.append(sample)

new_data = []
uni_senc = []
for sample in removed_senc:
if sample[9] not in uni_senc:
uni_senc.append(sample[9])
new_data.append(sample)
else:
if sample[7] != new_data[-1][7] and sample[7] not in new_data[-1][7]:
if sample[4] not in new_data[-1][4]:
new_data[-1][4] += '\t' + sample[4]
sample[4] = new_data[-1][4]
sample[7] += ', '+ new_data[-1][7]
new_data.append(sample)
else:
if sample[4] not in new_data[-1][4]:
new_data[-1][4] += '\t' + sample[4]

pre_senc = new_data[0][9]
pre_drug = new_data[0][7]
final_data = []
final_data.append(new_data[0])

for sample in new_data[1:]:
if sample[9] != pre_senc:
final_data.append(sample)
pre_senc = sample[9]
pre_drug = sample[7]
else:
if len(sample[7].split()) > len(pre_drug.split()):
final_data[-1][4] = sample[4]
final_data.append(sample) # 处理完后剩4858条数据

senc_adr = []
tt = TweetTokenizer()

for i in final_data:
senc_adr.append([i[9], i[4].split('\t')])

tok_senc_adr = []
tok_span = []
sub_tok_span = []
for i in senc_adr:
tok_text = tt.tokenize(i[0])
tok_text = [w.lower() for w in tok_text]
for j in i[1]:
sub_tok_span = tt.tokenize(j)
sub_tok_span = [w.lower() for w in sub_tok_span]
tok_span.append(sub_tok_span)
tok_senc_adr.append([tok_text, tok_span])
tok_span = []
sub_tok_span = []

all_labels = []

for i in tok_senc_adr:
labels = ['O'] * len(i[0])
for j in i[1]:
for k in range(len(i[0])):
if i[0][k:k+len(j)] == j:
labels[k] = 'I-ADR'
if len(j) > 1:
labels[k+1:k+len(j)] = ['I-ADR'] * (len(j)-1)
all_labels.append(labels)

wordcounts = collections.Counter()

for i in tok_senc_adr:
for word in i[0]:
wordcounts[word] += 1

words = [wordcount[0] for wordcount in wordcounts.most_common()]
word2idx = {w: i+1 for i, w in enumerate(words)}

labelcounts = collections.Counter()
for l in labelset:
labelcounts[l] += 1

labels = [labelcount[0] for labelcount in labelcounts.most_common()]
label2idx = {l: i+1 for i, l in enumerate(labels)}

idx2label = dict((k,v) for v,k in label2idx.items())
idx2word = dict((k,v) for v,k in word2idx.items())

idx2label[0] = 'PAD'
idx2word[0] = 'PAD'

vec_senc_adr = []
vec_senc = []
vec_adr = []

for i, j in zip(tok_senc_adr, all_labels):
vec_senc_adr.append([[word2idx[word] for word in i[0]], [label2idx[l] for l in j]])
vec_senc.append([word2idx[word] for word in i[0]])
vec_adr.append([label2idx[l] for l in j])


maxlen = max([len(l) for l in vec_senc]) # 93

vocsize = max(idx2word.keys()) + 1
nclasses = max(idx2label.keys()) + 1

pad_senc = pad_sequences(vec_senc, maxlen=maxlen)
pad_adr = pad_sequences(vec_adr, maxlen=maxlen)

def vectorize_set(lexlists, maxlen, V):
nb_samples = len(lexlists)
X = np.zeros([nb_samples, maxlen, V])
for i, lex in enumerate(lexlists):
for j, tok in enumerate(lex):
X[i,j,tok] = 1
return X

pad_adr = vectorize_set(pad_adr, maxlen, nclasses)

train_lex = pad_senc[:4372] # 4858 * 0.9 = 4372
test_lex = pad_senc[4372:]

train_y = pad_adr[:4372]
test_y = pad_adr[4372:]

return final_data, idx2word, idx2label, maxlen, vocsize, nclasses, tok_senc_adr, train_lex, test_lex, train_y, test_y

"""
data_path = "H:/pubmed_adr/data/ADE-Corpus-V2/DRUG-AE.rel"
final_data, idx2word, maxlen, vocsize, nclasses, tok_senc_adr, train_lex, test_lex, train_y, test_y = data_processing(data_path)
"""

PubMed数据模型

model.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
from data_processing import data_processing
from nltk.tokenize import TweetTokenizer
import collections
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from gensim.models import KeyedVectors
from keras.layers import Dense, Input, Lambda, merge, dot, Subtract
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.core import Dropout, Activation, Reshape
from keras.models import Model
import approximateMatch
from keras.layers.wrappers import Bidirectional, TimeDistributed
from keras.layers.merge import Concatenate
import keras.backend as K
import os, re

data_path = "H:/pubmed_adr/data/ADE-Corpus-V2/DRUG-AE.rel"
final_data, idx2word, idx2label, maxlen, vocsize, nclasses, tok_senc_adr, train_lex, test_lex, train_y, test_y = data_processing(data_path)

test_toks = []
test_tok_senc_adr = tok_senc_adr[4372:]
for i in test_tok_senc_adr:
test_toks.append(i[0])

train_toks = []
train_tok_senc_adr = tok_senc_adr[:4372]
for i in train_tok_senc_adr:
train_toks.append(i[0])

# Char embedding
char_per_word = []
char_word = []
char_senc = []
maxlen_char_word = 0
a = []

for s in (train_toks + test_toks):
for w in s:
for c in w.lower():
char_per_word.append(c)

if len(char_per_word) > 37:
a.append(char_per_word)
char_per_word = char_per_word[:37]
if len(char_per_word) > maxlen_char_word:
maxlen_char_word = len(char_per_word)

char_word.append(char_per_word)
char_per_word = []

char_senc.append(char_word)
char_word = []


charcounts = collections.Counter()
for senc in char_senc:
for word in senc:
for charac in word:
charcounts[charac] += 1
chars = [charcount[0] for charcount in charcounts.most_common()]
char2idx = {c: i+1 for i, c in enumerate(chars)}

char_word_lex = []
char_lex = []
char_word = []
for senc in char_senc:
for word in senc:
for charac in word:
char_word_lex.append([char2idx[charac]])

char_word.append(char_word_lex)
char_word_lex = []

char_lex.append(char_word)
char_word = []

char_per_word = []
char_per_senc = []
char_senc = []
for s in char_lex:
for w in s:
for c in w:
for e in c:
char_per_word.append(e)
char_per_senc.append(char_per_word)
char_per_word = []
char_senc.append(char_per_senc)
char_per_senc = []

pad_char_all = []
for senc in char_senc:
while len(senc) < maxlen:
senc.insert(0, [])
pad_senc = pad_sequences(senc, maxlen=maxlen_char_word)
pad_char_all.append(pad_senc)
pad_senc = []

pad_char_all = np.array(pad_char_all)

pad_train_lex = pad_char_all[:4372]
pad_test_lex = pad_char_all[4372:]

idx2char = dict((k,v) for v,k in char2idx.items())
idx2char[0] = 'PAD'
charsize = max(idx2char.keys()) + 1

# 一些参数的定义
seed = 10
np.random.seed(seed)

glove_300d_path = 'H:/twitter_adr/embeddings/glove.840B.300d.txt'
HIDDEN_DIM = 128
NUM_EPOCHS = 10
BATCH_SIZE = 16

c2v = None
embed_dim = 300
char_embed_dim = 100

def init_embedding_weights(i2w, w2vmodel):
# Create initial embedding weights matrix
# Return: np.array with dim [vocabsize, embeddingsize]

d = 300
V = len(i2w)
assert sorted(i2w.keys()) == list(range(V)) # verify indices are sequential

emb = np.zeros([V,d])
num_unknownwords = 0
unknow_words = []
for i,l in i2w.items():
if i==0:
continue
if l in w2vmodel.vocab:
emb[i, :] = w2vmodel[l]
else:
num_unknownwords += 1
unknow_words.append(l)
emb[i] = np.random.uniform(-1, 1, d)
return emb, num_unknownwords, unknow_words

def predict_score(model, x, toks, y, pred_dir, i2l, padlen, metafile=0, fileprefix=''):

pred_probs = model.predict(x, verbose=0)
test_loss = model.evaluate(x=x, y=y, batch_size=1, verbose=0)
pred = np.argmax(pred_probs, axis=2)

N = len(toks)

# If the name of a metafile is passed, simply write this round of predictions to file
if metafile > 0:
meta = open(metafile, 'a')

fname = re.sub(r'\\', r'/', os.path.join(pred_dir, fileprefix+'approxmatch_test'))
with open(fname, 'w') as fout:
for i in range(N):
bos = 'BOS\tO\tO\n'
fout.write(bos)
if metafile > 0:
meta.write(bos)

sentlen = len(toks[i])
startind = padlen - sentlen

preds = [i2l[j] for j in pred[i][startind:]]
actuals = [i2l[j] for j in np.argmax(y[i], axis=1)[startind:]]
for (w, act, p) in zip(toks[i], actuals, preds):
line = '\t'.join([w, act, p])+'\n'
fout.write(line)
if metafile > 0:
meta.write(line)

eos = 'EOS\tO\tO\n'
fout.write(eos)
if metafile > 0:
meta.write(eos)
scores = approximateMatch.get_approx_match(fname)
scores['loss'] = test_loss
if metafile > 0:
meta.close()

with open(fname, 'a') as fout:
fout.write('\nTEST Approximate Matching Results:\n ADR: Precision '+ str(scores['p'])+ ' Recall ' + str(scores['r']) + ' F1 ' + str(scores['f1']))
return scores

# 加载词向量
print('Loading word embeddings...')
w2v = KeyedVectors.load_word2vec_format(glove_300d_path, binary=False, unicode_errors='ignore')
print('word embeddings loading done!')

# Build the model
print('Building the model...')

hiddendim = HIDDEN_DIM
main_input = Input(shape=[maxlen], dtype='int32', name='input') # (None, 36)
char_input = Input(shape=[maxlen, maxlen_char_word], dtype='int32', name='char_input') # (None, 36, 25)

embeds, num_unk, unk_words = init_embedding_weights(idx2word, w2v)

embed = Embedding(input_dim=vocsize, output_dim=embed_dim, input_length=maxlen,
weights=[embeds], mask_zero=False, name='embedding', trainable=False)(main_input)

embed = Dropout(0.5, name='embed_dropout')(embed)

"""
双向LSTM 获取Char embedding
"""
char_embed = Embedding(input_dim=charsize, output_dim=char_embed_dim, embeddings_initializer='lecun_uniform',
input_length=maxlen_char_word, mask_zero=False, name='char_embedding')(char_input)

s = char_embed.shape
char_embed = Lambda(lambda x: K.reshape(x, shape=(-1, s[-2], char_embed_dim)))(char_embed)

fwd_state = GRU(150, return_state=True)(char_embed)[-2]
bwd_state = GRU(150, return_state=True, go_backwards=True)(char_embed)[-2]
char_embed = Concatenate(axis=-1)([fwd_state, bwd_state])
char_embed = Lambda(lambda x: K.reshape(x, shape=[-1, s[1], 2 * 150]))(char_embed)
char_embed = Dropout(0.5, name='char_embed_dropout')(char_embed)

"""
使用attention将word和character embedding结合起来
"""
W_embed = Dense(300, name='Wembed')(embed)
W_char_embed = Dense(300, name='W_charembed')(char_embed)
merged1 = merge([W_embed, W_char_embed], name='merged1', mode='sum')
tanh = Activation('tanh', name='tanh')(merged1)
W_tanh = Dense(300, name='w_tanh')(tanh)
a = Activation('sigmoid', name='sigmoid')(W_tanh)

t = Lambda(lambda x: K.ones_like(x, dtype='float32'))(a)

merged2 = merge([a, embed], name='merged2', mode='mul')
sub = Subtract()([t, a])
merged3 = merge([sub, char_embed], name='merged3', mode='mul')
x_wave = merge([merged2, merged3], name='final_re', mode='sum')

# 辅助分类器
auxc = Dense(nclasses, name='auxiliary_classifier')(x_wave)
auxc = Activation('softmax')(auxc) # (None, 36, 5) # (None, 36, 5)

# 双向GRU
bi_gru = Bidirectional(GRU(hiddendim, return_sequences=True, name='gru'), merge_mode='concat', name='bigru')(x_wave) # (None, None, 256)
bi_gru = Dropout(0.5, name='bigru_dropout')(bi_gru)

# 主分类器
mainc = TimeDistributed(Dense(nclasses), name='main_classifier')(bi_gru)
mainc = Activation('softmax')(mainc) # (None, 36, 5)

# 将辅助分类器和主分类器相加,作为模型最终输出
final_output = merge([auxc, mainc], mode='sum')

model = Model(inputs=[main_input, char_input], outputs=final_output, name='output')
model.compile(optimizer='adam', loss='categorical_crossentropy')

print('Training...')
history = model.fit([train_lex, pad_train_lex], train_y, batch_size=BATCH_SIZE, validation_split=0.1, epochs=NUM_EPOCHS)

# 预测结果
predir = 'H:/pubmed_adr/model_output/predictions'
fileprefix = 'embedding_level_attention_'

scores = predict_score(model, [test_lex, pad_test_lex], test_toks, test_y, predir, idx2label,
maxlen, fileprefix=fileprefix)

"""
# OOV分析

train_tokens = []

for i in tok_senc_adr[:4372]:
for w in i[0]:
train_tokens.append(w)

test_tokens = []

for i in tok_senc_adr[4372:]:
for w in i[0]:
test_tokens.append(w)

IV = []
OOTV = []
OOEV = []
OOBV = []

for i in test_tokens:
if i in train_tokens and i in w2v:
IV.append(i)

if i in w2v and i not in train_tokens:
OOTV.append(i)

if i in train_tokens and i not in w2v:
OOEV.append(i)

if i not in train_tokens and i not in w2v:
OOBV.append(i)

# result_file是你预测后的文件的路径,注意需要删掉预测文件底部的F1等指标的显示信息
result_file = list(open('your prediction path', 'r'))

# 输入一个文件地址用来写入OOV分析结果
with open('your new file path', 'w') as fout:
bos = 'BOS\tO\tO\n'
eos = 'EOS\tO\tO\n'

for line in bgru_attention:
line = line.strip()
line = line.split("\t")
if line[0] == 'BOS':
fout.write(bos)
elif line[0] == 'EOS':
fout.write(eos)
# 每次注释掉其余三行来获取对应的未注释的那行的OOV分析结果
elif line[0] in IV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOTV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOEV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
# elif line[0] in OOBV and (line[1] == 'I-ADR' or line[2] == 'I-ADR'):
fout.write('\t'.join([line[0], line[1], line[2]])+'\n')
else:
fout.write('\t'.join([line[0], 'O', 'O'])+'\n')

# 调用approximateMatch的get_approx_match方法,计算OOV分析结果
import approximateMatch
scores = approximateMatch.get_approx_match('your new file path')
"""

结果

img

这是我们论文里的结果,推特数据集上的F1有0.84左右,超过之前的SOTA(state-of-the-art)10%左右,PubMed数据集上的F1有0.91左右,超过之前的SOTA 5%左右。由于随机性每次跑的结果可能会有偏差,建议跑十次模型取平均值。

因为我们模型本质是在做序列标注任务,因此可以泛化到任意token level的分类任务,比如命名实体识别(NER),词性标注(POS tag)等。接下来我想在更大的数据集上验证一下我们的模型,并探索一下预训练的神奇功效。

-------------本文结束感谢您的阅读-------------

本文标题:Sequence labeling with embedding-level attention

文章作者:丁鹏

发布时间:2018年11月18日 - 09:11

最后更新:2018年11月27日 - 19:11

原始链接:http://deepon.me/2018/11/18/Sequence-labeling-with-embedding-level-attention/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

给博主投币,共同实现开源世界
0%