Is there an easy way to separate categorical vs continuous variables into two dataset in R

Date : November 24 2020, 05:44 AM
Date : November 24 2020, 05:44 AM

Summarizing a dataset with continuous and categorical variables

Tag : r , By : user98832
Date : March 29 2020, 07:55 AM
should help you out I think you're looking for the function describe() in the package 'Hmisc'. See the documentation for details.

cluster analysis on a dataset which consist of categorical and continuous data?

Tag : r , By : Tim
Date : March 29 2020, 07:55 AM
I wish this helpful for you Cluster analysis is all about distance.
You can solve your problem in a few steps:
a      b     c      d
frog   lamp  llama  7.8 
frog   onion cat    4.3
frog   lamp  soup   1.3
monkey onion  cat   8.1
dragon onion  llama  3.6

#make the distance matrix

#make a hierarchical cluster model

#plotting the hierarchy

#cutting the tree at your decided level

#adding the cluster member as a column to your data
a      b     c      d    cluster
frog   lamp  llama  7.8    1
frog   onion cat    4.3    2
frog   lamp  soup   1.3    1
monkey onion  cat   8.1    2
dragon onion  llama  3.6   3

Using ggplot2 facet grid to explore large dataset with continuous and categorical variables

Tag : r , By : evegter
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , Exploring our data is arguably the most interesting and intellectually challenging part of our research, so I encourage you to do some more reading into this topic.
Visualisation is of course important. @Parfait has suggested to shape your data long, which makes plotting easier. Your mix of continuous and categorical data is a bit tricky. Beginners often try very hard to avoid reshaping their data - but there is no need to fret! In the contrary, you will find that most questions require a specific shape of your data, and you will in most cases not find a "one fits all" shape.

``` r
# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])
data_num <- 
  mydf %>% 
  select(-ID) %>% 
  pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to =  'value')

#No need to use facet here
ggplot(data_num) +
  geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example, 
# because your group is also categorical. 
# One way:
# first convert all categorical columns to character, 
# then turn your "group" into factor
# then gather the character columns: 

# gathering numeric columns (without ID which is numeric).
#  [I'd recommend against numeric IDs!!])

# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward 

data_cat <- 
  mydf %>% select(-ID) %>%
  mutate_if(.predicate = is.factor, .funs = as.character) %>%
  mutate(group = factor(group)) %>%
  pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to =  'value')%>%
  count(group, key, value) %>%
  group_by(group, key) %>%
  mutate(percent =  n/ sum(n)) %>%
  ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects

ggplot(data_cat) +
  geom_col(aes(group, percent, fill = key)) +
  facet_grid(~ value)

cut continuous variables to categorical variables in r (with separate values as groups)

Tag : r , By : Tom
Date : March 29 2020, 07:55 AM

Calculating Conditional Probabilities for Categorical and Continuous variables in python? P(categorical|continuous)

Tag : python , By : user106284
Date : March 29 2020, 07:55 AM
I wish did fix the issue. A, B are numeric variables, computing conditional probabilities only from the table (considering it as the population)
Let us assume that A,B can have values from those in the provided table only and take the following example probability table with a few more rows (for better understanding):
import pandas as pd
df = pd.read_csv('prob.txt', sep=' ') # let the dataframe df store the probability table 

# the probability table 
     A   B   C
0   2.0 1.0 foo
1   2.2 1.2 bar
2   1.0 1.5 foo
3   2.0 3.0 bar
4   2.0 2.0 foo
5   3.2 1.2 foo
# Pr(C='foo'| A=2.0) = Pr(C='foo' & A=2.0) / Pr(A=2.0)

df[(df.C=='foo') & (df.A==2.0)] # Pr(C='foo' & A=2.0), we have 2 such rows
#    A   B   C
# 0 2.0 1.0 foo
# 4 2.0 2.0 foo

df[(df.A==2.0)]    # Pr(A=2.0), we have 3 such rows 
#    A   B   C
# 0 2.0 1.0 foo
# 3 2.0 3.0 bar
# 4 2.0 2.0 foo

# the required probability Pr(C='foo'| A=2.0)
df[(df.C=='foo') & (df.A==2.0)].shape[0] / df[(df.A==2.0)].shape[0]  # 2 / 3
# 0.6666666666666666   
import pandas as pd
# load your data in dataframe df here
#        A         B      C
# 0.161729  0.814335    foo
# 0.862661  0.517964    foo
# 0.814303  0.337391    foo
# 1.898132  1.530963    bar
# 2.124829  0.289176    bar

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
X, y = df[['A','B']], df['C']

# fit the classifier on the training dataset
clf.fit(X, y)

# predict the Pr(C = 'bar' | A, B) with predict_proba() 
print(clf.predict_proba([[1,1]])[:,0])   # Pr(C='bar'|A=1.0, B=1.0)
# [ 0.86871233]

import matplotlib.pylab as plt
X1, X2 = np.meshgrid(np.linspace(X[['A']].min(), X[['A']].max(),10), np.linspace(X[['B']].min(), X[['B']].max(),10))
# plot the probability surface
plt.contourf(X1, X2, clf.predict_proba(np.c_[X1.ravel(), X2.ravel()])[:,0].reshape(X1.shape), cmap='jet', alpha=.8)
cols = {'foo':'green', 'bar':'red'}
plt.scatter(X[['A']], X[['B']], c=[cols[c] for c in y.tolist()], s=50)
clf.predict_proba([[1,1]])[:,0]  # Pr(C='bar'|A=1.0, B=1.0)
# [ 0.67028318]
clf.predict_proba([[1,1]])[:,0]  # Pr(C='bar'|A=1.0, B=1.0)
# [ 1.0]
