Skip to content

Dissociated Parse #62

@cpressey

Description

@cpressey

It's well known that Markov chains don't understand grammar. Any sequences in the output that might look grammatical are only there because grammatical-looking sequences are statistically likely.

Off and on I've been interested in seeing if one can make variations of Markov generators that do retain some of the syntactic structure of the original text.

I had some success with it in 2019 with Anne of Green Garbles, which shows that (roughly speaking) one can combine two Markov models to obtain a third model where the generation can switch between states (like "in narration" and "in dialogue").

This is a second experiment with this goal.

tl;dr -- we can run the Dissociated Press algorithm in a context where we retain the syntactic structure associated with the text, that is, not just on a list of words like usual, but on a forest of parse trees. I call this variation Dissociated Parse.

Obtaining some parse trees from The Wonderful Wizard of Oz using the link-grammar parser, and running Dissociated Parse on them, produces text like:

everything had had it .

she dearly looked the bread were Dorothy carried the rest and filled dried leaves to the Emerald City ,
not fall into aunt Em , and then shook the tree .

near Dorothy as the country were her been lived as long she he sang briskly . .

to then , the room did not had about it .

the strong pressure at first know Toto until morning dressed than Dorothy that she oiled the house .

So, this is clearly generating nonsense, like a Markov chain would, but it is also retaining some of the source's syntactic structure, like a Markov chain wouldn't. (It still doesn't understand anything about tense and aspect, though; and the link-grammar parser doesn't always parse exactly as you'd expect; so for these reasons it still frequently falls short of "syntactically well-formed".)

Dissociated Press

For background, a description of the Dissociated Press algorithm.

Just as there is more than one algorithm for sorting, there is more than one algorithm for generating a Markov chain.

The usual algorithm involves analyzing the source text and building a weighted transition table, then finding a random (but likely) path
through this table.

The probably less well-known algorithm called Dissociated Press goes like this:

  1. load all the words into a list in memory
  2. select some word as the starting word
  3. print the current word
  4. find all occurrences of the current word in the text
  5. select one of those occurrences at random and jump to it
  6. move to the next word in the text
  7. repeat from step 3

Even though this works rather differently from the transition table algorithm, it produces the same result. (Exercise for the reader: convince yourself that it does in fact produce the same result.)

One downside of this algorithm is that it requires the entire text be kept in memory, rather than just a transition table. But, this is also an upside, in the sense that variations on the algorithm can exploit structure in the text which would not be retained in the transition table.

Dissociated Parse adapts this to work recursively on parse trees. Consider a parse tree to consist of a word, a part-of-speech tag, and zero or more child trees. Here is a sketch of the algorithm:

traverse(tree):
    1. find all trees that have the same part-of-speech tag as this tree
    2. select one of those trees at random and replace this tree with it
    3. print the word of this tree
    4. for each child of this tree, traverse(child)

That's all it is. One obvious improvement that I haven't tried yet is to have it find all trees that have the same part of speech and word as the current tree. I hope to try that. And to submit a 50,000-word long run too. But in case I do not get around to those, I wanted to at least get this writeup out.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions