logo
down
shadow

Is there an easy way to separate categorical vs continuous variables into two dataset in R


Is there an easy way to separate categorical vs continuous variables into two dataset in R

Content Index :

Is there an easy way to separate categorical vs continuous variables into two dataset in R
Tag : r , By : picamiolo
Date : November 24 2020, 05:44 AM


Comments
No Comments Right Now !

Boards Message :
You Must Login Or Sign Up to Add Your Comments .

Share : facebook icon twitter icon

Summarizing a dataset with continuous and categorical variables


Tag : r , By : user98832
Date : March 29 2020, 07:55 AM
should help you out I think you're looking for the function describe() in the package 'Hmisc'. See the documentation for details.

cluster analysis on a dataset which consist of categorical and continuous data?


Tag : r , By : Tim
Date : March 29 2020, 07:55 AM
I wish this helpful for you Cluster analysis is all about distance.
You can solve your problem in a few steps:
a      b     c      d
frog   lamp  llama  7.8 
frog   onion cat    4.3
frog   lamp  soup   1.3
monkey onion  cat   8.1
dragon onion  llama  3.6
library(cluster)

#make the distance matrix
dist<-daisy(df)

#make a hierarchical cluster model
model<-hclust(dist)

#plotting the hierarchy
plot(model)

#cutting the tree at your decided level
clustmember<-cutree(model,3) 

#adding the cluster member as a column to your data
df1<-data.frame(df,cluster=clustmember)
a      b     c      d    cluster
frog   lamp  llama  7.8    1
frog   onion cat    4.3    2
frog   lamp  soup   1.3    1
monkey onion  cat   8.1    2
dragon onion  llama  3.6   3

Using ggplot2 facet grid to explore large dataset with continuous and categorical variables


Tag : r , By : evegter
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , Exploring our data is arguably the most interesting and intellectually challenging part of our research, so I encourage you to do some more reading into this topic.
Visualisation is of course important. @Parfait has suggested to shape your data long, which makes plotting easier. Your mix of continuous and categorical data is a bit tricky. Beginners often try very hard to avoid reshaping their data - but there is no need to fret! In the contrary, you will find that most questions require a specific shape of your data, and you will in most cases not find a "one fits all" shape.
library(tidyverse)

``` r
# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])
data_num <- 
  mydf %>% 
  select(-ID) %>% 
  pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to =  'value')

#No need to use facet here
ggplot(data_num) +
  geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example, 
# because your group is also categorical. 
# One way:
# first convert all categorical columns to character, 
# then turn your "group" into factor
# then gather the character columns: 

# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])

# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward 

data_cat <- 
  mydf %>% select(-ID) %>%
  mutate_if(.predicate = is.factor, .funs = as.character) %>%
  mutate(group = factor(group)) %>%
  pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
  count(group, key, value) %>%
  group_by(group, key) %>%
  mutate(percent =  n/ sum(n)) %>%
  ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects

ggplot(data_cat) +
  geom_col(aes(group, percent, fill = key)) +
  facet_grid(~ value)

cut continuous variables to categorical variables in r (with separate values as groups)


Tag : r , By : Tom
Date : March 29 2020, 07:55 AM

Calculating Conditional Probabilities for Categorical and Continuous variables in python? P(categorical|continuous)


Tag : python , By : user106284
Date : March 29 2020, 07:55 AM
I wish did fix the issue. A, B are numeric variables, computing conditional probabilities only from the table (considering it as the population)
Let us assume that A,B can have values from those in the provided table only and take the following example probability table with a few more rows (for better understanding):
import pandas as pd
df = pd.read_csv('prob.txt', sep=' ') # let the dataframe df store the probability table 
df

# the probability table 
     A   B   C
0   2.0 1.0 foo
1   2.2 1.2 bar
2   1.0 1.5 foo
3   2.0 3.0 bar
4   2.0 2.0 foo
5   3.2 1.2 foo
# Pr(C='foo'| A=2.0) = Pr(C='foo' & A=2.0) / Pr(A=2.0)

df[(df.C=='foo') & (df.A==2.0)] # Pr(C='foo' & A=2.0), we have 2 such rows
#    A   B   C
# 0 2.0 1.0 foo
# 4 2.0 2.0 foo

df[(df.A==2.0)]    # Pr(A=2.0), we have 3 such rows 
#    A   B   C
# 0 2.0 1.0 foo
# 3 2.0 3.0 bar
# 4 2.0 2.0 foo

# the required probability Pr(C='foo'| A=2.0)
df[(df.C=='foo') & (df.A==2.0)].shape[0] / df[(df.A==2.0)].shape[0]  # 2 / 3
# 0.6666666666666666   
import pandas as pd
# load your data in dataframe df here
df.head()
#        A         B      C
# 0.161729  0.814335    foo
# 0.862661  0.517964    foo
# 0.814303  0.337391    foo
# 1.898132  1.530963    bar
# 2.124829  0.289176    bar

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
X, y = df[['A','B']], df['C']

# fit the classifier on the training dataset
clf.fit(X, y)

# predict the Pr(C = 'bar' | A, B) with predict_proba() 
print(clf.predict_proba([[1,1]])[:,0])   # Pr(C='bar'|A=1.0, B=1.0)
# [ 0.86871233]

import matplotlib.pylab as plt
X1, X2 = np.meshgrid(np.linspace(X[['A']].min(), X[['A']].max(),10), np.linspace(X[['B']].min(), X[['B']].max(),10))
plt.figure(figsize=(10,6))
# plot the probability surface
plt.contourf(X1, X2, clf.predict_proba(np.c_[X1.ravel(), X2.ravel()])[:,0].reshape(X1.shape), cmap='jet', alpha=.8)
plt.colorbar()
cols = {'foo':'green', 'bar':'red'}
plt.scatter(X[['A']], X[['B']], c=[cols[c] for c in y.tolist()], s=50)
plt.show()
clf.predict_proba([[1,1]])[:,0]  # Pr(C='bar'|A=1.0, B=1.0)
# [ 0.67028318]
clf.predict_proba([[1,1]])[:,0]  # Pr(C='bar'|A=1.0, B=1.0)
# [ 1.0]
Related Posts Related QUESTIONS :
  • How to identify all columns that contain binary representation
  • Filter different groups by different factor levels
  • Saving .xlsx file to disc, form http post request
  • Add an "all" option under the filter that selects the number of rows displayed in a datatable
  • How to select second column of every xts in list
  • Generate a frequency dataframe out of an input dataframe
  • Why manual autocorrelation does not match acf() results?
  • Merge 3 dataframes which are different to each other
  • remove adjacent duplicates from string
  • How to change the position of stacked stacked bar chart in ggplot in R?
  • How to divide each of a range a variables by a second range of variables in R
  • Why do I need to assemble vector before scaling in Spark?
  • How to select individuals which appear in multiple groups?
  • How can I fill columns based on values in another column?
  • 32 bit R and 64 bit R: output differs
  • Remove a single backslash in paste0 output
  • ggplot2 different label for the first break
  • TSP in R, with given distances
  • How to find the given value from the range of values?
  • Solution on R group by issue _ multiple combination
  • Transform multiple columns with a function that uses different arguments per column
  • How can I parse a string with the format "1/16/2019 1:24:51" into a POSIXct or other date variable?
  • How to plot a box plot in R for outlier detection for a huge number of rows?
  • How to change column name according to another dataframe in R?
  • `sjPlot::tab_df()`--how to set the number of decimal places?
  • time average for specific time range in r
  • joining dataframes by closest time and another key in r
  • How to create nested for loop for a certain range
  • New category based on sequence of date ranges
  • how to extract formula from coxph model summary in R?
  • add row based on variable condition in R
  • Generating the sequence 111122222333334
  • Unable to use has_goog_key() in R
  • how to multiply each row with a scaler in corresponding column?
  • R is not recognizing levels of a factor as the same. Is there a way to do this?
  • Calculating mean of replicate experiment result values in a column based on multiple columns using R
  • Best method to extract the first instance of a string between specified keywords using data.table
  • ignore optional combination of alphanumeric characters in str_extract
  • Why tracemem shows two copies when modification occurs inside function body?
  • Can't use mppm on multitype point patterns
  • How to move selected matrix rows to top of matrix based on a selection vector of row names
  • Combining expressions with a common operator
  • Passing string through multiple filters for matching
  • Convert two columns in R to rows of unique occurrence
  • How to create a dataframe using a function based on user-input?
  • How to access the visited vertices in a given shortest path using R igraph
  • Differences in Unicode character output with print()
  • Extracting Function or Objects from a String and then Piping Them with Magrittr/Dplyr
  • renderUI not evaluated until it is rendered
  • Find the maximum absolute value by row in an R data frame
  • Extracting data from irregular lists using purrr:map()
  • transforming data based on range of column in r
  • Identify and subset rows with some similar information
  • converting character from mongolite to timestamp in R
  • Create list from two vectors with every combo of each
  • Error in running a spread because of unique 'key combinations'; combining rows of data
  • visualize numerical strings as a matrixed heatmap
  • how to make a blocked matrix?
  • How to summarize with two functions using with dplyr
  • Dataframe is no longer the same after being saved to Excel and read back in
  • shadow
    Privacy Policy - Terms - Contact Us © scrbit.com