Text preprocessing and topic modelling using text2vec package

Text preprocessing and topic modelling using text2vec package

Content Index :

Text preprocessing and topic modelling using text2vec package
Tag : r , By : Naveen
Date : November 23 2020, 03:01 PM

this one helps. Updated answer in response to your comment:
First question: The issue of synonym replacement has already been answered here: Replace words in text2vec efficiently. Check the answer of count in partiular. Patterns and replacements may be ngrams (multi word phrases). Please note that the second answer of Dmitriy Selivanov uses word_tokenizer() and does not cover the case of ngram replacement in the form presented.

#function proposed by @count
mgsub <- function(pattern,replacement,x) {
  if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
  for (v in 1:length(pattern)) {
    x  <- gsub(pattern[v],replacement[v],x, perl = TRUE)
  return(x )

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is boiling like lava",
          "the coffee is frozen",
          "the coffee is perfect",
          "the coffee is warm almost hot"

synonyms <- data.frame(mainword = c("warm", "cold")
                       ,syn1 = c("hot", "frozen")
                       ,syn2 = c("boiling like lava", "")
                       ,stringsAsFactors = FALSE)

synonyms[synonyms == ""] <- NA

synonyms <- reshape2::melt(synonyms
                           ,id.vars = "mainword"
                           ,value.name = "synonym"
                           ,na.rm = TRUE)

synonyms <- synonyms[, c("mainword", "synonym")]

prep_fun <- tolower
tok_fun <- word_tokenizer
tokens <- docs %>% 
  #here is where you might replace synonyms directly in the docs
  #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], . ) } %>%
  prep_fun %>% 
it <- itoken(tokens, 
             progressbar = FALSE)

v <- create_vocabulary(it,
                       sep_ngram = "_",
                       ngram = c(ngram_min = 1L
                                 #allow for ngrams in dtm
                                 ,ngram_max = max(stri_count_fixed(unlist(synonyms), " "))

vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer)

#ngrams in dtm

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x))

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm))

#only zeros/ones in dtm since none of the docs specified in my example
#contains duplicate terms
#7 24

#workaround to aggregate colnames in dtm
#I think there is no function `colsum` that allows grouping
#therefore, a workaround based on rowsum
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix
dtm <- 
      rowsum(t(as.matrix(dtm)), group = colnames(dtm))
    , sparse = T)

#synonyms in columns replaced
#7 20

No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Topic Modelling - Assign human readable labels to topic

Tag : development , By : Bruce
Date : March 29 2020, 07:55 AM
it helps some times See this paper by Jey Han Lau. He describes how to automatically generate labels using different sources and features.

How to get topic probability table from text2vec LDA

Tag : r , By : Ram
Date : March 29 2020, 07:55 AM
Hope that helps doc_topic_distr is a matrix which contains number of how many times words from document were assigned to particular topic. So you need just to normalize each row by number of words (also you can add doc_topic_prior before normalization).
tokens = movie_review$review %>% 
  tolower %>% 
# turn off progressbar because it won't look nice in rmd
it = itoken(tokens, ids = movie_review$id, progressbar = FALSE)
v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "lda_c")
doc_topic_prior = 0.1
lda_model = 
  LDA$new(n_topics = 10, vocabulary = v, 
          doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)

doc_topic_distr = 
  lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol = 0.01, 
                          check_convergence_every_n = 10)
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#5814_8   16   18    0   34    0   16   49    0   20    23
#2381_9    4    0    6   20    0    0    6    6    0    28
#7759_3   21   39    7    0    3   47    0   25   21    17
#3630_4   18    7   22   14   19    0   18    0    2    35
#9495_8    4    0   13   17   13   78    3    2   28    25
#8196_8    0    0    0   11    0    8    0    8    8     0
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
# or add norm first and normalize :
# doc_topic_prob = normalize(doc_topic_distr + doc_topic_prior, norm = "l1")

In R text2vec package - LDA model can show the topic distribution for each tokens in document?

Tag : r , By : greggerz
Date : March 29 2020, 07:55 AM
I wish this helpful for you Unfortunately it is impossible to get distribution of topics for each token in a given document. Document-topic counts are calculated/aggregated "on the fly", so document-token-topic distribution is not stored anywhere.

Normalized topic document probabilities text2vec R

Tag : r , By : Adrian Codrington
Date : March 29 2020, 07:55 AM
Does that help According to the up to date documentation LDA fit_transform returns topic probabilities.

LDA topic model using R text2vec package and LDAvis in shinyApp

Tag : r , By : Morbo
Date : March 29 2020, 07:55 AM
wish helps you Private fields are private for a purpose - they are specifically hided for a user and not part of public API (can be easily changed in future or removed). The correct way of embedding LDAvis to a shiny app is to store LDAvis json on disk and then open it in a shiny app. Something like should work:
lda_model$plot(out.dir = "SOME_DIR", open.browser = FALSE)
output$myChart <- renderVis(readLines("SOME_DIR/lda.json"))
Related Posts Related QUESTIONS :
  • Deparse and (un)escape quotes
  • Regression table with clustered standard errors in R jupyter notebook?
  • Disaggregate quarterly data to daily data in R keeping values?
  • How to save output to console and file simultaneously in RStudio server?
  • Why does data.table j have a different environment when directly calling mget() vs calling mget() inside another functio
  • scale_fill_viridis_c color bar on a log scale
  • How to change the lab name corresponding to function in ggplot
  • R, filtering for an element in a list in a dataframe cell
  • Extracting only bottom temperature from 4d NetCDF file
  • How to add/wrap lines of text to .tex with .sh script
  • R - building new variables from sequenced data
  • Sum rows values one after the other
  • Nesting ifelse inside summarytools
  • How best to divide different levels of a factor by one another in dataframe in R?
  • Why does my code run multiple times before I type data into the table? How do I make an action button that creates a tab
  • How to impute missing values not at random?
  • Set the y limits of an added average line of a plotly plot
  • how to calculate a new column after grouping with dplyr
  • Extract data from rows creating new columns using R
  • Create a filled area line plot with plotly
  • When do I need parentheses around an if statement to control the sequence of a formula in R?
  • my graph in ggplot2 contains an "e" character in y-axis
  • Making variables immutable in R
  • R: Difference between the subsequent ranks of a item group by date
  • Match data within multiple time-frames with dplyr
  • Conditional manipulation and extension of rows in data.table also considering previous extensions without for-loop
  • Conditional formula referring to preview row in DF not working
  • Set hoverinfo text in plotly scatterplot
  • Histogram of Sums from Categorical/Binary Data
  • Efficiently find set differences and generate random sample
  • Find closest points from data set B to point in data set A, using lat long in R
  • dplyr join on column A OR column B
  • Replace all string if row starts with (within a column)
  • Is there a possibility to combine position_stack and nudge_x in a stacked bar chart in ggplot2?
  • How can I extract bounding boxes in a row-wise manner using R?
  • How do I easily sum up values in different columns?
  • Reading numeric Date value from CSV file to data.frame in "R"
  • R programming: creating a stacked bar graph, with variable colors for each stacked bar
  • How to identify all columns that contain binary representation
  • Filter different groups by different factor levels
  • Saving .xlsx file to disc, form http post request
  • Add an "all" option under the filter that selects the number of rows displayed in a datatable
  • How to select second column of every xts in list
  • Generate a frequency dataframe out of an input dataframe
  • Why manual autocorrelation does not match acf() results?
  • Merge 3 dataframes which are different to each other
  • remove adjacent duplicates from string
  • How to change the position of stacked stacked bar chart in ggplot in R?
  • How to divide each of a range a variables by a second range of variables in R
  • Why do I need to assemble vector before scaling in Spark?
  • How to select individuals which appear in multiple groups?
  • How can I fill columns based on values in another column?
  • 32 bit R and 64 bit R: output differs
  • Remove a single backslash in paste0 output
  • ggplot2 different label for the first break
  • TSP in R, with given distances
  • How to find the given value from the range of values?
  • Solution on R group by issue _ multiple combination
  • Transform multiple columns with a function that uses different arguments per column
  • How can I parse a string with the format "1/16/2019 1:24:51" into a POSIXct or other date variable?
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com