• 4.3 ElementTree 接口

    4.3 ElementTree 接口

    Python 的 ElementTree 模块提供了一种方便的方式访问存储在 XML 文件中的数据。ElementTree 是 Python 标准库(自从 Python 2.5)的一部分,也作为 NLTK 的一部分提供,以防你在使用 Python 2.4。

    我们将使用 XML 格式的莎士比亚戏剧集来说明 ElementTree 的使用方法。让我们加载 XML 文件并检查原始数据,首先在文件的顶部[1],在那里我们看到一些 XML 头和一个名为play.dtd的模式,接着是根元素 PLAY。我们从 Act 1[2]再次获得数据。(输出中省略了一些空白行。)

    1. >>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
    2. >>> raw = open(merchant_file).read()
    3. >>> print(raw[:163]) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
    4. <?xml version="1.0"?>
    5. <?xml-stylesheet type="text/css" href="shakes.css"?>
    6. <!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
    7. <PLAY>
    8. <TITLE>The Merchant of Venice</TITLE>
    9. >>> print(raw[1789:2006]) ![[2]](/projects/nlp-py-2e-zh/Images/f9e1ba3246770e3ecb24f813f33f2075.jpg)
    10. <TITLE>ACT I</TITLE>
    11. <SCENE><TITLE>SCENE I. Venice. A street.</TITLE>
    12. <STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR>
    13. <SPEECH>
    14. <SPEAKER>ANTONIO</SPEAKER>
    15. <LINE>In sooth, I know not why I am so sad:</LINE>

    我们刚刚访问了作为一个字符串的 XML 数据。正如我们看到的,在 Act 1 开始处的字符串包含 XML 标记 title、scene、stage directions 等。

    下一步是作为结构化的 XML 数据使用ElementTree处理文件的内容。我们正在处理一个文件(一个多行字符串),并建立一棵树,所以方法的名称是parse [1]并不奇怪。变量merchant包含一个 XML 元素PLAY [2]。此元素有内部结构;我们可以使用一个索引来得到它的第一个孩子,一个TITLE元素[3]。我们还可以看到该元素的文本内容:戏剧的标题[4]。要得到所有的子元素的列表,我们使用getchildren()方法[5]

    1. >>> from xml.etree.ElementTree import ElementTree
    2. >>> merchant = ElementTree().parse(merchant_file) ![[1]](/projects/nlp-py-2e-zh/Images/346344c2e5a627acfdddf948fb69cb1d.jpg)
    3. >>> merchant
    4. <Element 'PLAY' at 0x10ac43d18> # [_element-play]
    5. >>> merchant[0]
    6. <Element 'TITLE' at 0x10ac43c28> # [_element-title]
    7. >>> merchant[0].text
    8. 'The Merchant of Venice' # [_element-text]
    9. >>> merchant.getchildren() ![[5]](/projects/nlp-py-2e-zh/Images/63a8e4c47e813ba9630363f9b203a19a.jpg)
    10. [<Element 'TITLE' at 0x10ac43c28>, <Element 'PERSONAE' at 0x10ac43bd8>,
    11. <Element 'SCNDESCR' at 0x10b067f98>, <Element 'PLAYSUBT' at 0x10af37048>,
    12. <Element 'ACT' at 0x10af37098>, <Element 'ACT' at 0x10b936368>,
    13. <Element 'ACT' at 0x10b934b88>, <Element 'ACT' at 0x10cfd8188>,
    14. <Element 'ACT' at 0x10cfadb38>]

    这部戏剧由标题、角色、一个场景的描述、字幕和五幕组成。每一幕都有一个标题和一些场景,每个场景由台词组成,台词由行组成,有四个层次嵌套的结构。让我们深入到第四幕:

    1. >>> merchant[-2][0].text
    2. 'ACT IV'
    3. >>> merchant[-2][1]
    4. <Element 'SCENE' at 0x10cfd8228>
    5. >>> merchant[-2][1][0].text
    6. 'SCENE I. Venice. A court of justice.'
    7. >>> merchant[-2][1][54]
    8. <Element 'SPEECH' at 0x10cfb02c8>
    9. >>> merchant[-2][1][54][0]
    10. <Element 'SPEAKER' at 0x10cfb0318>
    11. >>> merchant[-2][1][54][0].text
    12. 'PORTIA'
    13. >>> merchant[-2][1][54][1]
    14. <Element 'LINE' at 0x10cfb0368>
    15. >>> merchant[-2][1][54][1].text
    16. "The quality of mercy is not strain'd,"

    注意

    轮到你来:对语料库中包含的其他莎士比亚戏剧,如《罗密欧与朱丽叶》或《麦克白》,重复上述的一些方法;方法列表请参阅nltk.corpus.shakespeare.fileids()

    虽然我们可以通过这种方式访问整个树,使用特定名称查找子元素会更加方便。回想一下顶层的元素有几种类型。我们可以使用merchant.findall('ACT')遍历我们感兴趣的类型(如幕)。下面是一个做这种特定标记在每一个级别的嵌套搜索的例子:

    1. >>> for i, act in enumerate(merchant.findall('ACT')):
    2. ... for j, scene in enumerate(act.findall('SCENE')):
    3. ... for k, speech in enumerate(scene.findall('SPEECH')):
    4. ... for line in speech.findall('LINE'):
    5. ... if 'music' in str(line.text):
    6. ... print("Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text))
    7. Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;
    8. Act 3 Scene 2 Speech 9: Fading in music: that the comparison
    9. Act 3 Scene 2 Speech 9: And what is music then? Then music is
    10. Act 5 Scene 1 Speech 23: And bring your music forth into the air.
    11. Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music
    12. Act 5 Scene 1 Speech 23: And draw her home with music.
    13. Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.
    14. Act 5 Scene 1 Speech 25: Or any air of music touch their ears,
    15. Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet
    16. Act 5 Scene 1 Speech 25: But music for the time doth change his nature.
    17. Act 5 Scene 1 Speech 25: The man that hath no music in himself,
    18. Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.
    19. Act 5 Scene 1 Speech 29: It is your music, madam, of the house.
    20. Act 5 Scene 1 Speech 32: No better a musician than the wren.

    不是沿着层次结构向下遍历每一级,我们可以寻找特定的嵌入的元素。例如,让我们来看看演员的顺序。我们可以使用频率分布看看谁最能说:

    1. >>> from collections import Counter
    2. >>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
    3. >>> speaker_freq = Counter(speaker_seq)
    4. >>> top5 = speaker_freq.most_common(5)
    5. >>> top5
    6. [('PORTIA', 117), ('SHYLOCK', 79), ('BASSANIO', 73),
    7. ('GRATIANO', 48), ('LORENZO', 47)]

    我们也可以查看对话中谁跟着谁的模式。由于有 23 个演员,我们需要首先使用3中描述的方法将“词汇”减少到可处理的大小。

    1. >>> from collections import defaultdict
    2. >>> abbreviate = defaultdict(lambda: 'OTH')
    3. >>> for speaker, _ in top5:
    4. ... abbreviate[speaker] = speaker[:4]
    5. ...
    6. >>> speaker_seq2 = [abbreviate[speaker] for speaker in speaker_seq]
    7. >>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(speaker_seq2))
    8. >>> cfd.tabulate()
    9. ANTO BASS GRAT OTH PORT SHYL
    10. ANTO 0 11 4 11 9 12
    11. BASS 10 0 11 10 26 16
    12. GRAT 6 8 0 19 9 5
    13. OTH 8 16 18 153 52 25
    14. PORT 7 23 13 53 0 21
    15. SHYL 15 15 2 26 21 0

    忽略 153 的条目,因为是前五位角色(标记为OTH)之间相互对话,最大的值表示 Othello 和 Portia 的相互对话最多。