• 2.7 未简化的标记

    2.7 未简化的标记

    让我们找出每个名词类型中最频繁的名词。2.2中的程序找出所有以NN开始的标记,并为每个标记提供了几个示例单词。你会看到有许多NN的变种;最重要有此外,大多数的标记都有后缀修饰符:-NC表示引用,-HL表示标题中的词,-TL`表示标题(布朗标记的特征)。

    1. def findtags(tag_prefix, tagged_text):
    2. cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
    3. if tag.startswith(tag_prefix))
    4. return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
    5. >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
    6. >>> for tag in sorted(tagdict):
    7. ... print(tag, tagdict[tag])
    8. ...
    9. NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
    10. NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("company's", 6)]
    11. NN$-HL [("Golf's", 1), ("Navy's", 1)]
    12. NN$-TL [("President's", 11), ("Army's", 3), ("Gallery's", 3), ("University's", 3), ("League's", 3)]
    13. NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('business', 2), ('Salary', 2)]
    14. NN-NC [('eva', 1), ('aya', 1), ('ova', 1)]
    15. NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
    16. NN-TL-HL [('Fort', 2), ('Dr.', 1), ('Oak', 1), ('Street', 1), ('Basin', 1)]
    17. NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
    18. NNS$ [("children's", 7), ("women's", 5), ("janitors'", 3), ("men's", 3), ("taxpayers'", 2)]
    19. NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
    20. NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Bros.'", 1), ("Writers'", 1)]
    21. NNS-HL [('comments', 1), ('Offenses', 1), ('Sacrifices', 1), ('funds', 1), ('Results', 1)]
    22. NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Communists', 9)]
    23. NNS-TL-HL [('Nations', 1)]

    当我们开始在本章后续部分创建词性标注器时,我们将使用未简化的标记。