• 3.3 训练基于分类器的词块划分器

    3.3 训练基于分类器的词块划分器

    无论是基于正则表达式的词块划分器还是 n-gram 词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:

    1. class ConsecutiveNPChunkTagger(nltk.TaggerI): ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
    2. def __init__(self, train_sents):
    3. train_set = []
    4. for tagged_sent in train_sents:
    5. untagged_sent = nltk.tag.untag(tagged_sent)
    6. history = []
    7. for i, (word, tag) in enumerate(tagged_sent):
    8. featureset = npchunk_features(untagged_sent, i, history) ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
    9. train_set.append( (featureset, tag) )
    10. history.append(tag)
    11. self.classifier = nltk.MaxentClassifier.train( ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
    12. train_set, algorithm='megam', trace=0)
    13. def tag(self, sentence):
    14. history = []
    15. for i, word in enumerate(sentence):
    16. featureset = npchunk_features(sentence, i, history)
    17. tag = self.classifier.classify(featureset)
    18. history.append(tag)
    19. return zip(sentence, history)
    20. class ConsecutiveNPChunker(nltk.ChunkParserI): ![[4]](Images/8b4bb6b0ec5bb337fdb00c31efcc1645.jpg)
    21. def __init__(self, train_sents):
    22. tagged_sents = [[((w,t),c) for (w,t,c) in
    23. nltk.chunk.tree2conlltags(sent)]
    24. for sent in train_sents]
    25. self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
    26. def parse(self, sentence):
    27. tagged_sents = self.tagger.tag(sentence)
    28. conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
    29. return nltk.chunk.conlltags2tree(conlltags)

    留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:

    1. >>> def npchunk_features(sentence, i, history):
    2. ... word, pos = sentence[i]
    3. ... return {"pos": pos}
    4. >>> chunker = ConsecutiveNPChunker(train_sents)
    5. >>> print(chunker.evaluate(test_sents))
    6. ChunkParse score:
    7. IOB Accuracy: 92.9%
    8. Precision: 79.9%
    9. Recall: 86.7%
    10. F-Measure: 83.2%

    我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。

    1. >>> def npchunk_features(sentence, i, history):
    2. ... word, pos = sentence[i]
    3. ... if i == 0:
    4. ... prevword, prevpos = "<START>", "<START>"
    5. ... else:
    6. ... prevword, prevpos = sentence[i-1]
    7. ... return {"pos": pos, "prevpos": prevpos}
    8. >>> chunker = ConsecutiveNPChunker(train_sents)
    9. >>> print(chunker.evaluate(test_sents))
    10. ChunkParse score:
    11. IOB Accuracy: 93.6%
    12. Precision: 81.9%
    13. Recall: 87.2%
    14. F-Measure: 84.5%

    下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约 1.5 个百分点(相应的错误率减少大约 10%)。

    1. >>> def npchunk_features(sentence, i, history):
    2. ... word, pos = sentence[i]
    3. ... if i == 0:
    4. ... prevword, prevpos = "<START>", "<START>"
    5. ... else:
    6. ... prevword, prevpos = sentence[i-1]
    7. ... return {"pos": pos, "word": word, "prevpos": prevpos}
    8. >>> chunker = ConsecutiveNPChunker(train_sents)
    9. >>> print(chunker.evaluate(test_sents))
    10. ChunkParse score:
    11. IOB Accuracy: 94.5%
    12. Precision: 84.2%
    13. Recall: 89.4%
    14. F-Measure: 86.7%

    最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。

    1. >>> def npchunk_features(sentence, i, history):
    2. ... word, pos = sentence[i]
    3. ... if i == 0:
    4. ... prevword, prevpos = "<START>", "<START>"
    5. ... else:
    6. ... prevword, prevpos = sentence[i-1]
    7. ... if i == len(sentence)-1:
    8. ... nextword, nextpos = "<END>", "<END>"
    9. ... else:
    10. ... nextword, nextpos = sentence[i+1]
    11. ... return {"pos": pos,
    12. ... "word": word,
    13. ... "prevpos": prevpos,
    14. ... "nextpos": nextpos, ![[1]](/projects/nlp-py-2e-zh/Images/f4891d12ae20c39b685951ad3cddf1aa.jpg)
    15. ... "prevpos+pos": "%s+%s" % (prevpos, pos), ![[2]](/projects/nlp-py-2e-zh/Images/e5fb07e997b9718f18dbf677e3d6634d.jpg)
    16. ... "pos+nextpos": "%s+%s" % (pos, nextpos),
    17. ... "tags-since-dt": tags_since_dt(sentence, i)} ![[3]](/projects/nlp-py-2e-zh/Images/6372ba4f28e69f0b220c75a9b2f4decf.jpg)
    1. >>> def tags_since_dt(sentence, i):
    2. ... tags = set()
    3. ... for word, pos in sentence[:i]:
    4. ... if pos == 'DT':
    5. ... tags = set()
    6. ... else:
    7. ... tags.add(pos)
    8. ... return '+'.join(sorted(tags))
    1. >>> chunker = ConsecutiveNPChunker(train_sents)
    2. >>> print(chunker.evaluate(test_sents))
    3. ChunkParse score:
    4. IOB Accuracy: 96.0%
    5. Precision: 88.6%
    6. Recall: 91.0%
    7. F-Measure: 89.8%

    注意

    轮到你来:尝试为特征提取器函数npchunk_features增加不同的特征,看看是否可以进一步改善 NP 词块划分器的表现。