Summarizing a dataset with continuous and categorical variables
Date : March 29 2020, 07:55 AM
should help you out I think you're looking for the function describe() in the package 'Hmisc'. See the documentation for details.

cluster analysis on a dataset which consist of categorical and continuous data?
Date : March 29 2020, 07:55 AM
I wish this helpful for you Cluster analysis is all about distance. You can solve your problem in a few steps: a b c d
frog lamp llama 7.8
frog onion cat 4.3
frog lamp soup 1.3
monkey onion cat 8.1
dragon onion llama 3.6
library(cluster)
#make the distance matrix
dist<daisy(df)
#make a hierarchical cluster model
model<hclust(dist)
#plotting the hierarchy
plot(model)
#cutting the tree at your decided level
clustmember<cutree(model,3)
#adding the cluster member as a column to your data
df1<data.frame(df,cluster=clustmember)
a b c d cluster
frog lamp llama 7.8 1
frog onion cat 4.3 2
frog lamp soup 1.3 1
monkey onion cat 8.1 2
dragon onion llama 3.6 3

Using ggplot2 facet grid to explore large dataset with continuous and categorical variables
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , Exploring our data is arguably the most interesting and intellectually challenging part of our research, so I encourage you to do some more reading into this topic. Visualisation is of course important. @Parfait has suggested to shape your data long, which makes plotting easier. Your mix of continuous and categorical data is a bit tricky. Beginners often try very hard to avoid reshaping their data  but there is no need to fret! In the contrary, you will find that most questions require a specific shape of your data, and you will in most cases not find a "one fits all" shape. library(tidyverse)
``` r
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
data_num <
mydf %>%
select(ID) %>%
pivot_longer(cols = which(sapply(., is.numeric)), names_to = 'key', values_to = 'value')
#No need to use facet here
ggplot(data_num) +
geom_boxplot(aes(key, value, color = group))
# selecting categorical columns is a bit more tricky in this example,
# because your group is also categorical.
# One way:
# first convert all categorical columns to character,
# then turn your "group" into factor
# then gather the character columns:
# gathering numeric columns (without ID which is numeric).
# [I'd recommend against numeric IDs!!])
# I use simple count() and mutate() to create a summary data frame with the proportions and geom_col, which equals geom_bar('stat = identity')
# There may be neater ways, but this is pretty straight forward
data_cat <
mydf %>% select(ID) %>%
mutate_if(.predicate = is.factor, .funs = as.character) %>%
mutate(group = factor(group)) %>%
pivot_longer(cols = which(sapply(., is.character)), names_to = 'key', values_to = 'value')%>%
count(group, key, value) %>%
group_by(group, key) %>%
mutate(percent = n/ sum(n)) %>%
ungroup # I always 'ungroup' after my data manipulations, in order to avoid unexpected effects
ggplot(data_cat) +
geom_col(aes(group, percent, fill = key)) +
facet_grid(~ value)

cut continuous variables to categorical variables in r (with separate values as groups)
Date : March 29 2020, 07:55 AM

Calculating Conditional Probabilities for Categorical and Continuous variables in python? P(categoricalcontinuous)
Tag : python , By : user106284
Date : March 29 2020, 07:55 AM
I wish did fix the issue. A, B are numeric variables, computing conditional probabilities only from the table (considering it as the population) Let us assume that A,B can have values from those in the provided table only and take the following example probability table with a few more rows (for better understanding): import pandas as pd
df = pd.read_csv('prob.txt', sep=' ') # let the dataframe df store the probability table
df
# the probability table
A B C
0 2.0 1.0 foo
1 2.2 1.2 bar
2 1.0 1.5 foo
3 2.0 3.0 bar
4 2.0 2.0 foo
5 3.2 1.2 foo
# Pr(C='foo' A=2.0) = Pr(C='foo' & A=2.0) / Pr(A=2.0)
df[(df.C=='foo') & (df.A==2.0)] # Pr(C='foo' & A=2.0), we have 2 such rows
# A B C
# 0 2.0 1.0 foo
# 4 2.0 2.0 foo
df[(df.A==2.0)] # Pr(A=2.0), we have 3 such rows
# A B C
# 0 2.0 1.0 foo
# 3 2.0 3.0 bar
# 4 2.0 2.0 foo
# the required probability Pr(C='foo' A=2.0)
df[(df.C=='foo') & (df.A==2.0)].shape[0] / df[(df.A==2.0)].shape[0] # 2 / 3
# 0.6666666666666666
import pandas as pd
# load your data in dataframe df here
df.head()
# A B C
# 0.161729 0.814335 foo
# 0.862661 0.517964 foo
# 0.814303 0.337391 foo
# 1.898132 1.530963 bar
# 2.124829 0.289176 bar
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
X, y = df[['A','B']], df['C']
# fit the classifier on the training dataset
clf.fit(X, y)
# predict the Pr(C = 'bar'  A, B) with predict_proba()
print(clf.predict_proba([[1,1]])[:,0]) # Pr(C='bar'A=1.0, B=1.0)
# [ 0.86871233]
import matplotlib.pylab as plt
X1, X2 = np.meshgrid(np.linspace(X[['A']].min(), X[['A']].max(),10), np.linspace(X[['B']].min(), X[['B']].max(),10))
plt.figure(figsize=(10,6))
# plot the probability surface
plt.contourf(X1, X2, clf.predict_proba(np.c_[X1.ravel(), X2.ravel()])[:,0].reshape(X1.shape), cmap='jet', alpha=.8)
plt.colorbar()
cols = {'foo':'green', 'bar':'red'}
plt.scatter(X[['A']], X[['B']], c=[cols[c] for c in y.tolist()], s=50)
plt.show()
clf.predict_proba([[1,1]])[:,0] # Pr(C='bar'A=1.0, B=1.0)
# [ 0.67028318]
clf.predict_proba([[1,1]])[:,0] # Pr(C='bar'A=1.0, B=1.0)
# [ 1.0]

