• 2.8 探索已标注的语料库

    2.8 探索已标注的语料库

    让我们简要地回过来探索语料库,我们在前面的章节中看到过,这次我们探索词性标记。

    假设我们正在研究词 often,想看看它是如何在文本中使用的。我们可以试着看看跟在 often 后面的词汇

    1. >>> brown_learned_text = brown.words(categories='learned')
    2. >>> sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often'))
    3. [',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',
    4. 'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]

    然而,使用tagged_words()方法查看跟随词的词性标记可能更有指导性:

    1. >>> brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
    2. >>> tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
    3. >>> fd = nltk.FreqDist(tags)
    4. >>> fd.tabulate()
    5. PRT ADV ADP . VERB ADJ
    6. 2 8 7 4 37 6

    请注意 often 后面最高频率的词性是动词。名词从来没有在这个位置出现(在这个特别的语料中)。

    接下来,让我们看一些较大范围的上下文,找出涉及特定标记和词序列的词(在这种情况下,"<Verb> to <Verb>")。在 code-three-word-phrase 中,我们考虑句子中的每个三词窗口[1],检查它们是否符合我们的标准[2]。如果标记匹配,我们输出对应的词[3]

    1. from nltk.corpus import brown
    2. def process(sentence):
    3. for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence): ![[1]](/projects/nlp-py-2e-zh/Images/7a979f968bd33428b02cde62eaf2b615.jpg)
    4. if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')): ![[2]](/projects/nlp-py-2e-zh/Images/6ac827d2d00b6ebf8bbc704f430af896.jpg)
    5. print(w1, w2, w3) ![[3]](/projects/nlp-py-2e-zh/Images/934b688727805b37f2404f7497c52027.jpg)
    6. >>> for tagged_sent in brown.tagged_sents():
    7. ... process(tagged_sent)
    8. ...
    9. combined to achieve
    10. continue to place
    11. serve to protect
    12. wanted to wait
    13. allowed to place
    14. expected to become
    15. ...

    最后,让我们看看与它们的标记关系高度模糊不清的词。了解为什么要标注这样的词是因为它们各自的上下文可以帮助我们弄清楚标记之间的区别。

    1. >>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
    2. >>> data = nltk.ConditionalFreqDist((word.lower(), tag)
    3. ... for (word, tag) in brown_news_tagged)
    4. >>> for word in sorted(data.conditions()):
    5. ... if len(data[word]) > 3:
    6. ... tags = [tag for (tag, _) in data[word].most_common()]
    7. ... print(word, ' '.join(tags))
    8. ...
    9. best ADJ ADV NP V
    10. better ADJ ADV V DET
    11. close ADV ADJ V N
    12. cut V N VN VD
    13. even ADV DET ADJ V
    14. grant NP N V -
    15. hit V VD VN N
    16. lay ADJ V NP VD
    17. left VD ADJ N VN
    18. like CNJ V ADJ P -
    19. near P ADV ADJ DET
    20. open ADJ V N ADV
    21. past N ADJ DET P
    22. present ADJ ADV V N
    23. read V VN VD NP
    24. right ADJ N DET ADV
    25. second NUM ADV DET N
    26. set VN V VD N -
    27. that CNJ V WH DET

    注意

    轮到你来:打开词性索引工具nltk.app.concordance()并加载完整的布朗语料库(简化标记集)。现在挑选一些上面代码例子末尾处列出的词,看看词的标记如何与词的上下文相关。例如搜索near会看到所有混合在一起的形式,搜索near/ADJ会看到它作为形容词使用,near N会看到只是名词跟在后面的情况,等等。更多的例子,请修改附带的代码,以便它列出的词具有三个不同的标签。