• 2.2 按文体计数词汇

    2.2 按文体计数词汇

    在1中,我们看到一个条件频率分布,其中条件为布朗语料库的每一节,并对每节计数词汇。FreqDist()以一个简单的列表作为输入,ConditionalFreqDist() 以一个配对列表作为输入。

    1. >>> from nltk.corpus import brown
    2. >>> cfd = nltk.ConditionalFreqDist(
    3. ... (genre, word)
    4. ... for genre in brown.categories()
    5. ... for word in brown.words(categories=genre))

    让我们拆开来看,只看两个文体,新闻和言情。对于每个文体[2],我们遍历文体中的每个词[3],以产生文体与词的配对[1]

    1. >>> genre_word = [(genre, word) ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
    2. ... for genre in ['news', 'romance'] ![[2]](/projects/nlp-py-2e-zh/Images/6efeadf518b11a6441906b93844c2b19.jpg)
    3. ... for word in brown.words(categories=genre)] ![[3]](/projects/nlp-py-2e-zh/Images/e941b64ed778967dd0170d25492e42df.jpg)
    4. >>> len(genre_word)
    5. 170576

    因此,在下面的代码中我们可以看到,列表genre_word的前几个配对将是 ('news', word) [1]的形式,而最后几个配对将是 ('romance', word) [2]的形式。

    1. >>> genre_word[:4]
    2. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
    3. >>> genre_word[-4:]
    4. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]

    现在,我们可以使用此配对列表创建一个ConditionalFreqDist,并将它保存在一个变量cfd中。像往常一样,我们可以输入变量的名称来检查它[1],并确认它有两个条件[2]

    1. >>> cfd = nltk.ConditionalFreqDist(genre_word)
    2. >>> cfd ![[1]](/projects/nlp-py-2e-zh/Images/eeff7ed83be48bf40aeeb3bf9db5550e.jpg)
    3. <ConditionalFreqDist with 2 conditions>
    4. >>> cfd.conditions()
    5. ['news', 'romance'] # [_conditions-cfd]

    让我们访问这两个条件,它们每一个都只是一个频率分布:

    1. >>> print(cfd['news'])
    2. <FreqDist with 14394 samples and 100554 outcomes>
    3. >>> print(cfd['romance'])
    4. <FreqDist with 8452 samples and 70022 outcomes>
    5. >>> cfd['romance'].most_common(20)
    6. [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
    7. ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
    8. ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
    9. ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
    10. >>> cfd['romance']['could']
    11. 193