alexr_rwx: (Default)
[personal profile] alexr_rwx
kompressr: make text shorter harnessing the power of acronyms (MTSHTPOA)

Try it out, let me know what you think and if it breaks :) Should be useful for making long papers shorter, by automatically extracting acronyms and using them wherever possible.

Public beta!tm

(running on App Engine, with NLTK!)

Date: 2010-06-22 07:04 am (UTC)
From: [personal profile] chrisamaphone
cool! i tried it on my recent LJ posts to get an idea for what it did, and then learned a bunch about my own writing patterns. :) some of the acronyms were pretty awkward; at first i thought because they include common short words but maybe more because they don't fall on natural phrase boundaries. stuff like "dinner at", "talking about", "and the", "i had a"...
Edited Date: 2010-06-22 07:05 am (UTC)

Date: 2010-06-22 05:22 pm (UTC)
lindseykuper: Photo of me outside. (Default)
From: [personal profile] lindseykuper
and then learned a bunch about my own writing patterns.

Unintended benefit! That's awesome.

Date: 2010-06-23 01:31 am (UTC)
ext_110843: (deus ex machina)
From: [identity profile] oniugnip.livejournal.com
Thanks thanks!

Not breaking along phrase boundaries is both easier to implement, and what I was going for (ie, kind of senseless, but following a pattern that's intuitively clear).

The alternative would require some light parsing -- maybe chunking for noun phrases...

Date: 2010-06-22 03:21 pm (UTC)
From: [identity profile] lyceum-arabica.livejournal.com
(laughs) Very cool! ...with some pretty funny pre-tweaking results. I fed it the first couple paragraphs of the wikipedia article on space flight, and it decided that 'Space Exploration' (SE), 'during the' (DT), and 'of Space Exploration' (OSE) were all good picks :-) This isn't my area, but I'd think the problem of identifying important common phrases particular to a piece of work (like SE), while weeding out generally common phrases and things that are a concatenation of the two, might be paper-worthy.

Also, you might want to keep it from acronyming its acronyms:

"Various criticisms of SE (OSE) are sometimes made. SE has often been used as a proxy competition for geopolitical rivalries such as the (AT) (SAT) Cold War. The early era OSE was driven by a "Space Race" between the Soviet Union and the United States (US); the launch of the (OT) first (TF) (OTF) (TLOTF) man-made object to orbit the Earth..."

Date: 2010-06-22 05:18 pm (UTC)
lindseykuper: Photo of me outside. (Default)
From: [personal profile] lindseykuper
I wasn't a fan of the acronyming-of-acronyms, either, but Alex claims that that's a feature!

Date: 2010-06-23 01:32 am (UTC)
ext_110843: (Default)
From: [identity profile] oniugnip.livejournal.com
I tried it both ways :)

Date: 2010-06-23 01:37 am (UTC)
ext_110843: (happy robot)
From: [identity profile] oniugnip.livejournal.com
Thanks for playing with it! :)

This isn't my area, but I'd think the problem of identifying important common phrases particular to a piece of work (like SE), while weeding out generally common phrases and things that are a concatenation of the two, might be paper-worthy.

It's a good intuition that that's an important problem, but it's way been done. There's a whole literature on finding sequences that commonly appear together (say, in a set of documents, or characteristically to particular documents)... words that NLP/corpus linguistics people say when they're discussing such a thing include "collocation (http://en.wikipedia.org/wiki/Collocation)" and "TF/IDF (http://en.wikipedia.org/wiki/Tf-idf")", if you're interested.

Date: 2010-06-22 05:20 pm (UTC)
lindseykuper: Photo of me outside. (Default)
From: [personal profile] lindseykuper
*applauds!* I love you.

Bug report!: if I paste in text with newlines, it doesn't recognize previously acronym'd phrases that straddle the line breaks.

Date: 2010-06-23 01:39 am (UTC)
ext_110843: (happy robot)
From: [identity profile] oniugnip.livejournal.com
<3!

Working on reproducing...

Date: 2010-06-23 01:57 am (UTC)
lindseykuper: Photo of me outside. (Default)
From: [personal profile] lindseykuper
Sorry, that was a bad explanation! You can repro with this, from my notes from work from Monday:
When the analyzer starts, it displays the message "Starting
emufuzzer_analyser XMLRPC server on port 55555".  It then sits in a
tight loop, waiting for the emulator module to pass it some
information: a machine state and an instruction to execute.  [TODO:
Where in the code does this happen?]

EmuFuzzer starts by both write-protecting and read-protecting all
pages of memory on the real machine.  [TODO: Where in the code does
this happen?]  The emulator sends over instructions, one at a time, to
be run on the real machine.  When an instruction tries to read memory,
we intercept the access via the page fault and go and get the page
from the emulated environment (so we "lazily" grab pages from the
emulated environment as needed).  Then we have the page in the
physical environment, readable but still not writable.
"Where in the code does this happen?" appears twice, so there ought to be a "WITCDTH?" the second time. But because it appears across a line break the second time, it doesn't get fully acronym'd even once. (Somewhat excitingly, I know the answer to both questions now!)

Date: 2010-06-23 02:35 am (UTC)
ext_110843: (removal of signs)
From: [identity profile] oniugnip.livejournal.com
!!!

There seem to be carriage returns passed in from the form? I certainly didn't expect that.

Date: 2010-06-23 02:44 am (UTC)
ext_110843: (mighty penguin)
From: [identity profile] oniugnip.livejournal.com
Should be fixed now! It was the carriage returns. (In this day and age, \r\n? I wonder where they get introduced!)
lindseykuper: Photo of me outside. (Default)
From: [personal profile] lindseykuper
I believe what the kids say is "o_O". Or "...".
ext_110843: (clango: drink more coffee!)
From: [identity profile] oniugnip.livejournal.com
The URLs are hackable! Try to find the other secret modes! :)

Profile

alexr_rwx: (Default)
Alex R

May 2022

S M T W T F S
1234 567
891011121314
15161718192021
22232425262728
293031    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 8th, 2025 07:16 pm
Powered by Dreamwidth Studios