Find root keyword from a bunch of longtails

mattseh

import this
Apr 6, 2009
5,504
71
0
A ~= A
Code:
import collections

for line in open('data.txt'):
    parts = line.split('\n')[0].split(' {')[1].split('|')

    counter = collections.Counter()
    for part in parts:
        words = part.split(' ')
        for i in range(1, len(words)+1):
            counter[tuple(words[:i])] +=1

    most_common = [words for words, count in counter.most_common(10) if count == counter.most_common(1)[0][1]]

    root_keyword = ' '.join(sorted(most_common, key=len)[-1])
Can anyone simplify?
 


Code:
import virtual assistants

for every correct extraction
     write answer in gdocs
     paypal_api ( send 0.50 to paypal_email )

sleep 60
profit




^^ dat for you matt
 
If you really want an automated solution, look up word2vec in Google, free open source algo, vectorizes words according to semantic meaning. Another package you might need is any stats package that does k-means clustering.

Psuedocode

for phraseID, phrase in phraselist
vectorlist.append(PhraseID, word2vec(iterwords(phrase))​

-->

k-meansclustering(vectorlist)

-->

for PhraseID, phrase in phraselist
if len(phrase) <= 2
k-meansscore(vectorlist.getvector(PhraseID))
if k-meansscore >= benchmarkscore
possiblerootwordslist.append(phraselist.getphrase(PhraseID)​
else
Pass​
(Next iter)​
 
Code:
#!/usr/bin/perl 
use File::Glob qw(bsd_glob);#bsd_glob can unroll spintax...
open(DATA,"data.txt");
while (<DATA>) {
   chomp;tr/|/,/;
   map {$C{$_}++} (map { split(/ +/,$_)}  bsd_glob($_));
   for (reverse(sort {$C{$a} <=> $C{$b}} keys %C)) {
       print "$_ \t=> $C{$_}\n";last if $i++ > 9;
   }
}