May I present: kompressr
Jun. 22nd, 2010 01:56 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
kompressr: make text shorter harnessing the power of acronyms (MTSHTPOA)
Try it out, let me know what you think and if it breaks :) Should be useful for making long papers shorter, by automatically extracting acronyms and using them wherever possible.
Public beta!tm
(running on App Engine, with NLTK!)
Try it out, let me know what you think and if it breaks :) Should be useful for making long papers shorter, by automatically extracting acronyms and using them wherever possible.
Public beta!tm
(running on App Engine, with NLTK!)
no subject
Date: 2010-06-22 07:04 am (UTC)no subject
Date: 2010-06-22 05:22 pm (UTC)Unintended benefit! That's awesome.
no subject
Date: 2010-06-23 01:31 am (UTC)Not breaking along phrase boundaries is both easier to implement, and what I was going for (ie, kind of senseless, but following a pattern that's intuitively clear).
The alternative would require some light parsing -- maybe chunking for noun phrases...
no subject
Date: 2010-06-22 03:21 pm (UTC)Also, you might want to keep it from acronyming its acronyms:
"Various criticisms of SE (OSE) are sometimes made. SE has often been used as a proxy competition for geopolitical rivalries such as the (AT) (SAT) Cold War. The early era OSE was driven by a "Space Race" between the Soviet Union and the United States (US); the launch of the (OT) first (TF) (OTF) (TLOTF) man-made object to orbit the Earth..."
no subject
Date: 2010-06-22 05:18 pm (UTC)no subject
Date: 2010-06-23 01:32 am (UTC)no subject
Date: 2010-06-23 01:37 am (UTC)This isn't my area, but I'd think the problem of identifying important common phrases particular to a piece of work (like SE), while weeding out generally common phrases and things that are a concatenation of the two, might be paper-worthy.
It's a good intuition that that's an important problem, but it's way been done. There's a whole literature on finding sequences that commonly appear together (say, in a set of documents, or characteristically to particular documents)... words that NLP/corpus linguistics people say when they're discussing such a thing include "collocation (http://en.wikipedia.org/wiki/Collocation)" and "TF/IDF (http://en.wikipedia.org/wiki/Tf-idf")", if you're interested.
no subject
Date: 2010-06-22 05:20 pm (UTC)Bug report!: if I paste in text with newlines, it doesn't recognize previously acronym'd phrases that straddle the line breaks.
no subject
Date: 2010-06-23 01:39 am (UTC)Working on reproducing...
no subject
Date: 2010-06-23 01:57 am (UTC)"Where in the code does this happen?" appears twice, so there ought to be a "WITCDTH?" the second time. But because it appears across a line break the second time, it doesn't get fully acronym'd even once. (Somewhat excitingly, I know the answer to both questions now!)
no subject
Date: 2010-06-23 02:35 am (UTC)There seem to be carriage returns passed in from the form? I certainly didn't expect that.
no subject
Date: 2010-06-23 02:44 am (UTC)".dll?BUSINESS_LOGIC=BUSINESS_LOGIC_SHORTEN.ACTION"
Date: 2010-06-23 02:06 am (UTC)Re: ".dll?BUSINESS_LOGIC=BUSINESS_LOGIC_SHORTEN.ACTION"
Date: 2010-06-23 02:45 am (UTC)Re: ".dll?BUSINESS_LOGIC=BUSINESS_LOGIC_SHORTEN.ACTION"
Date: 2010-07-14 03:04 am (UTC)