Friday, July 16, 2010

"I Write Like" filter-like algorithm compares and categorizes your writing style

Have you heard about "I Write Like"? Go to http://iwl.me and insert couple of paragraphs of your writing. In few seconds the algorithm will tell you "which famous writer you write like by analyzing your word choice and writing style and comparing them with those of the famous writers".

The idea belongs to Dmitry Chestnykh, a 27-year-old Russian software programmer. "Chestnykh modeled the site on software for e-mail spam filters. This means that the site's text analysis is largely keyword based. Even if you write in short, declarative, Hemingwayesque sentences, its your word choice that may determine your comparison" (NPR: http://alturl.com/8qurx).

The Russian blog-sphere is surprisingly quiet about both the author, Dmitry Chestnykh, and the site.  As for Dmitry himself, he twitted "I write like Douglas Adams. Proof:" on July 8. One week later he is Internet's newest superstart,  busy giving away interviews. Here is one of them from THEAWL.COM (http://alturl.com/4h8tx):

"...How many authors are currently in the database? How did you decide which authors to include?
The current version includes 50 writers. First versions included authors from the bestsellers list on Wikipedia, top downloaded books from The Gutenberg Project (a public library of out-of-copyright books), and the ones I could remember. Later versions included authors suggested by users.

When are you going to add explanations for the algorithm for each author? Why haven't you included this already — why keep it secret?
I wanted to write a blog post about it, and to open-source the code, but haven't had time for it yet, because I've been busy updating the program and handling all the traffic, emails and comments I received. Also, it's really interesting to read how people try to explain the results they got.

Actually, the algorithm is not a rocket science, and you can find it on every computer today. It's a Bayesian classifier, which is widely used to fight spam on the Internet. Take for example the "Mark as spam" button in Gmail or Outlook. When you receive a message that you think is spam, you click this button, and the internal database gets trained to recognize future messages similar to this one as spam. This is basically how "I Write Like" works on my side: I feed it with "Frankenstein" and tell it, "This is Mary Shelley. Recognize works similar to this as Mary Shelley." Of course, the algorithm is slightly different from the one used to detect spam, because it takes into account more stylistic features of the text, such as the number of words in sentences, the number of commas, semicolons, and whether the sentence is a direct speech or a quotation..."

Well, I inserted three sample essays from my blog.  Twice it came out as David Foster Wallace (http://alturl.com/w2quj) and once as Arthur C. Clarke (http://alturl.com/rmixh).

5 comments:

  1. Great concept.
    All I can say is that their algorithm isn't disastrous- you will notice that they kept their training samples to an 'optimal' level of 50. Fewer samples would mean that the training isn't meaningful and more samples would mean that a heavy amount of algorithm optimization would be needed to produce good results. Obviously there is only so much distance that they can cover with this tool.

    I tried ten different writing samples and it told me that I stand between Dan Brown, David Foster Wallace and someone called Cory Doctorow. Not bad.

    ReplyDelete
  2. I write like Cory Doctorow. I had no idea ;-)

    ReplyDelete
  3. Who is Cory Doctorow?

    ReplyDelete
  4. Thanks, I checked he has a page on Wiki.

    ReplyDelete