July 12, 2003
Be Careful What You Wish For

Colby Cosh is looking for software to analyze his style:

You know what I’d like to have? I was thinking about this. I’ve put almost a year’s worth of text here: at a guess, I’ve written three or four hundred thousand words for the site. What I’d like is a program that analyzed my word frequencies and compared them to some background standard to see which ones I might be overusing. I don’t know if other writers have this phobia . . . sometimes I’ll use some slightly obscure adjective, and I’ll realize, “Hmmm, I’ve written that word, what, three times in the last two months? It’s kind of an unusual word. How often can I get away with this before people start to notice?” And so I have to strike it out of my vocabulary for a while. But maybe there are subconsciously irritating “favourites” I’m not aware of. If there were a way for a large cross-section of a person’s prose to be analyzed in this manner . . . well, I’m perfectly aware that there are sophisticated forensic tools for the analysis of word frequencies, I just don’t know if any of them have ever been adapted to a specifically literary purpose. Even a crude application would be useful: the old Smith-Corona electronic typewriter that got me through college was able to do this kind of analysis on a single document, and I came to rely on it to save me from embarrassing word repeats.

Knowledge of one’s own style is not necessarily a good thing. Here’s one of my favorite passages from David Lodge’s Small World (1984), Part III, Chapter I. It is set in a pub, where Persse, a young Irish poet, is talking to Frobisher, a washed-up proletarian novelist:

“How did you come to lose faith in your style?” Persse enquired.

“I’ll tell you. I can date it precisely from a trip I made to Darlington six years ago. There’s a new university there, you know, one of those plateglass and poured-concrete affairs on the edge of the town. They wanted to give me an honorary degree. Not the most prestigious university in the world, but nobody else had offered to give me a degree. The idea was, Darlington’s a working-class, industrial town, so they’d honour a writer who wrote about working-class, industrial life. I bought that. I was sort of flattered, to tell the truth. So I went up there to receive this degree. The usual flummery of robes and bowing and lifting your cap to the vice-chancellor and so on. Bloody awful lunch. But it was all right, I didn’t mind. But then, when the official part was over, I was nobbled by a man in the English Department. Name of Dempsey.”

“Robin Dempsey,” said Persse.

“Oh, you know him? Not a friend of yours, I hope?”

“Definitely not.”

“Good. Well, as you probably know, this Dempsey character is gaga about computers. I gathered this over lunch, because he was sitting opposite me. ‘I’d like to take you over to our Computer Centre this afternoon,’ he said. ‘We’ve got something set up for you that I think you’ll find interesting.’ He was sort of twitching in his seat with excitement as he said it, like a kid who can’t wait to unwrap his Christmas presents. So when the degree business was finished, I went with him to this Computer Centre. Rather grand name, actually, it was just a prefabricated hut, with a couple of sheep cropping the grass outside. There was another chap there, sort of running the place, called Josh. But Dempsey did all the talking. ‘You’ve probably heard,’ he said, ‘of our Centre for Computational Stylistics.’ ‘No,’ I said, ‘Where is it?’ ‘Where? Well, it’s here, I suppose,’ he said. ‘I mean, I’m it, so it’s wherever I am. That is, wherever I am when I’m doing computational stylistics, which is only one of my research interests. It’s not so much a place,’ he said, ‘as a headed notepaper. Anyway,’ he went on, ‘when we heard that the University was going to give you an honorary degree, we decided to make yours the first complete corpus in our tape archive.’ ‘What does that mean?’ I said. ‘It means,’ he said, holding up a flat metal canister rather like the sort you keep film spools in, ‘It means that every word you’ve ever published is in here.’ His eyes gleamed with a kind of manic glee, like he was Frankenstein, or some kind of wizard, as if he had me locked up in that flat metal box. Which, in a way, he had. ‘What’s the use of that?’ I asked. ‘What’s the use of it?’ he said, laughing hysterically. ‘What’s the use? Let’s show him, Josh.’ And he passed the canister to the other guy, who takes out a spool of tape and fits it on to one of the machines. ‘Come over here,’ says Dempsey, and sits me down in front of a kind of typewriter with a TV screen attached. ‘With that tape,’ he said, ‘we can request the computer to supply us with any information we like about your ideolect.’ ‘Come again?’ I said. ‘Your own special, distinctive, unique way of using the English language. What’s your favorite word?’ ‘My favorite word? I don’t have one.’ ‘Oh yes you do!’ he said. ‘The word you use most frequently.’ ‘That’s probably the or a or and,’ I said. He shook his head impatiently. ‘We instruct the computer to ignore what we call grammatical words—articles, prepositions, pronouns, modal verbs, which have a high frequency rating in all discourse. Then we get to the real nitty-gritty, what we call the lexical words, the words that carry a distinctive semantic content. Words like love or dark or heart or God. Let’s see.’ So he taps away on the keyboard and instantly my favourite word appears on the screen. What do you think it was?’

“Beer?” Persse ventured.

Frobisher looked at him a shade suspiciously through his owlish spectacles, and shook his head. “Try again.”

“I don’t know, I’m sure,” said Persse.

Frobisher paused to drink and swallow, then looked solemnly at Persse. “Grease,” he said, at length.

“Grease?” Persse repeated blankly.

Grease. Greasy. Greased. Various forms and applications of the root, literal and metaphorical. I didn’t believe him at first, I laughed in his face. Then he pressed a button and the machine began listing all the phrases in my works in which the word grease appears in one form or another. There they were, streaming across the screen in front of me, faster than I could read them, with page references and line numbers. The greasy floor, the roads greasy with rain, the grease-stained cuff, the greasy jam butty, his greasy smile, the grease-smeared table, the greasy small change of their conversation, even, would you believe it, his body moved in hers like a well-greased piston. I was flabberglasted, I can tell you. My entire oeuvre seemed to be saturated with grease. I’d never realized I was so obsessed with the stuff. Dempsey was chortling with glee, pressing buttons to show what my other favourite words were. Grey and grime were high on the list, I seem to remember. I seemed to have a penchant for depressing words beginning with a hard ‘g’. Also sink, smoke, feel, struggle, run and sensual. Then he started to refine the categories. The parts of the body I mentioned most often were hand and breast, usually one on the other. The direct speech of male characters was invariably introduced by the simple tag he said, but the speech of women by a variety of expressive verbal groups, she gasped, she sighed, she whispered urgently, she cried passionately. All my heroes have brown eyes, like me. Their favourite expletive is bugger. The women they fall in love with tend to have Biblical names, especially ones beginning with ‘R’—Ruth, Rachel, Rebecca, and so on. I like to end chapters with a short moodless sentence.”

“You remember all this from six years ago?” Persse marvelled.

“Just in case I might forget, Robin Demspey gave me a printout of the whole thing, popped it into a folder and gave it to me to take home. ‘A little souvenir of the day,’ he was pleased to call it. Well, I took it home, read it on the train, and the next morning, when I sat down at my desk and tried to get on with my novel, I found I couldn’t. Every time I wanted an adjective, greasy would spring into my mind. Every time I wrote he said, I would scratch it out and write he groaned or he laughed, but it didn’t seem right—but when I went back to he said, that didn’t seem right either, it seemed predictable and mechanical. Robin and Josh had really fucked me up between them. I’ve never been able to write fiction since.”

He ended, and emptied his tankard in a single draught.

“That’s the saddest story I ever heard,” said Persse.

Of course, none of this answers Cosh’s question, whether such programs are available outside the pages of a comic novel.

Posted by Dr. Weevil at July 12, 2003 09:01 AM
Comments

Wasn't the Unabomber identified using just this sort of stylistic analysis?

I vaguely recall that the folks trying to verify the identity of William Shakespeare have been using something similar (i.e., comparing the plays to the writing style of Francis Bacon, Edward DeVere, etc.)

Posted by: Kevn Shaum on July 13, 2003 01:03 AM

Joe Klein was identified as the author of "Primary Colors" by a computer analysis, run by Professor Don Foster: http://www.amazon.com/exec/obidos/tg/detail/-/0805068120/104-1275882-3465532?vi=glance

Posted by: pj on July 13, 2003 05:45 PM

It shouldn't be too hard to write something that at the very least does a concordance.

Posted by: David Gillies on July 14, 2003 03:30 PM

There are some standard Unix tools that could answer Colby's question quickly. I assume all have been ported to Windows and Mac systems though I haven't used them there.

Posted by: Jim Miller on July 15, 2003 02:09 PM

No idea if it's any good, but...

http://www.variagate.com/wordfreq.htm

Posted by: Geoffrey Barto on July 19, 2003 05:09 AM

More than a little off-topic, but there's a Unix program called Dissociated Press that, given a corpus of text, produces an often wicked parody of the writer. It requires about 5-10 pages of work for the best results.

If you're interested, do a google search for "random text markov model".

Posted by: Eric Brown on July 21, 2003 08:34 PM