• 2.4 探索文本语料库

    2.4 探索文本语料库


    1. >>> cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
    2. >>> brown = nltk.corpus.brown
    3. >>> for sent in brown.tagged_sents():
    4. ... tree = cp.parse(sent)
    5. ... for subtree in tree.subtrees():
    6. ... if subtree.label() == 'CHUNK': print(subtree)
    7. ...
    8. (CHUNK combined/VBN to/TO achieve/VB)
    9. (CHUNK continue/VB to/TO place/VB)
    10. (CHUNK serve/VB to/TO protect/VB)
    11. (CHUNK wanted/VBD to/TO wait/VB)
    12. (CHUNK allowed/VBN to/TO place/VB)
    13. (CHUNK expected/VBN to/TO become/VB)
    14. ...
    15. (CHUNK seems/VBZ to/TO overtake/VB)
    16. (CHUNK want/VB to/TO buy/VB)


    轮到你来:将上面的例子封装在函数find_chunks()内,以一个如"CHUNK: {&lt;V.*&gt; &lt;TO&gt; &lt;V.*&gt;}"的词块字符串作为参数。Use it to search the corpus for several other patterns, such as four or more nouns in a row, e.g. "NOUNS: {&lt;N.*&gt;{4,}}"