How to plot stacked bar chart to summarise each categorical column for proportion of values
Date : March 29 2020, 07:55 AM
Does that help Get rid of the irrelevant columns. Make all values be in ('Missing', 'Unknown', 'Other'). Call value_countson each column. The count will be nan instead of 0 when a value is not in column so you might want to use fillna(0) at the end. You already have the data you need, just plot it. - result = (df[['action', 'action_type', 'action_detail']]
.where(df.isin(('Missing', 'Unknown')), 'Other')
.apply(lambda x: x.value_counts(normalize=True))
.fillna(0))
print(result)
action action_type action_detail
Missing 0 0.5 0.5
Other 1 0.5 0.5
result.T.plot(kind='bar', stacked=True)
|
Summarise data between given value of a categorical variable
Date : March 29 2020, 07:55 AM
will help you You could use cumsum() to make your groupings and then process based on those like this: df %>% mutate(Agroups = cumsum(categoriesVector == "A")) %>%
filter(categoriesVector == "B") %>%
group_by(Agroups) %>%
summarise(propertyStart = min(propertyVector),
propertyEnd = max(propertyVector),
dataTotal = sum(dataVector))
# A tibble: 3 x 4
Agroups propertyStart propertyEnd dataTotal
<int> <dbl> <dbl> <dbl>
1 2 3 3 700
2 3 5 7 1200
3 4 9 9 100
|
dplyr, summarise categorical variable
Tag : r , By : Denis Chaykovskiy
Date : March 29 2020, 07:55 AM
this one helps. You have at least two options to solve this: Add the Category column to your group_by: small %>%
group_by(Video.ID, cat = Category) %>%
summarise(sumr = sum(Partner.Revenue),
len = mean(Video.Duration..sec.))
# A tibble: 1 x 4
# Groups: Video.ID [?]
# Video.ID cat sumr len
# <chr> <chr> <dbl> <dbl>
# 1 ---0zh9uzSE gadgets 0 1184
small %>%
group_by(Video.ID) %>%
summarise(sumr = sum(Partner.Revenue),
len = mean(Video.Duration..sec.),
cat = unique(Category))
# A tibble: 1 x 4
# Video.ID sumr len cat
# <chr> <dbl> <dbl> <chr>
# 1 ---0zh9uzSE 0 1184 gadgets
|
Create new column filled with random elements based on a categorical column
Tag : python , By : codelurker
Date : March 29 2020, 07:55 AM
will be helpful for those in need I tried to find a solution using vectors but was unable. This solution iterates through the index and calculates new values for New1 and New2. This will achieve the result I believe you are looking for. for i in df.index:
# Grab the category variable for each row.
cat = df.loc[i,'Cat']
# Set column New1
mask1 = df['Cat'] == cat
mask2 = df.index != i
df.at[i,'New1']= df[mask1 & mask2]["ID"].sample().iloc[0]
# Set column New2
mask3 = df['Cat'] != cat
df.at[i,'New2']= df[mask3]["ID"].sample().iloc[0]
ID Cat New1 New2
0 87 A 56.0 76.0
1 56 A 87.0 36.0
2 67 A 56.0 76.0
3 76 D 36.0 87.0
4 36 D 76.0 87.0
ID Cat New1 New2
0 87 A 67.0 36.0
1 56 A 87.0 36.0
2 67 A 87.0 76.0
3 76 D 36.0 67.0
4 36 D 76.0 67.0
|
Summarise based on categorical runs
Tag : r , By : Tim Coffman
Date : March 29 2020, 07:55 AM
This might help you We can create groups using lag and cumsum and then calculate statistics for each group. library(dplyr)
test %>%
group_by(group = cumsum(fruit != lag(fruit, default = first(fruit)))) %>%
summarise(fruit = first(fruit),
duration = n(),
mean_temp = mean(temp)) %>%
select(-group)
# fruit duration mean_temp
# <fct> <int> <dbl>
#1 apple 2 91
#2 banana 3 101
#3 guava 4 94.8
#4 apple 3 92
#5 banana 1 92
#6 guava 1 101
group_by(group = data.table::rleid(fruit))
group_by(group = with(rle(as.character(fruit)), rep(seq_along(values), lengths)))
library(data.table)
setDT(test)[, .(duration = .N, fruit = fruit[1L],
mean_temp = mean(temp)), by = rleid(fruit)]
|