I wish did fix the issue. I am interested to find the ngrams of a string x= "A T G C C G C G T" . I use the ngram R package to get the ngrams. I use following lines to get my job done. , Don't know about ngram but you should produce the output like this,
x= "A T G C C G C G T"
strsplit(gsub("(\\S)(?=\\s(\\S))|\\s+\\S$", "\\1\\2", x, perl=T), " ")[]
#  "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"
hop of those help? I would recommend using Pig! Pig makes things like this very easy and straight-forward. Here's a sample pig script that does pretty much what you need:
raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');
How to get the array of all ngrams in Perl Text::Ngrams
will be helpful for those in need You can't get all the different sizes of n-grams at the same time, but you can get them all using multiple calls to get_ngrams. There is an undocumented parameter n to get_ngrams that says the size of the n-grams you want listed. In your code, if you say
my @ngramsarray = $ng3->get_ngrams(
n => 1,
orderby = >'frequency',
onlyfirst => 10,
normalize => 0);
('T', 8, 'E', 4, 'X', 2, '_', 2, 'S', 2)
Is it possible to maintain order of ngrams in the output of textcnt function in R?