regex to count english words as single char inside char count of asian words
Tag : regex , By : Pavel K.
Date : March 29 2020, 07:55 AM
I wish did fix the issue. What ever you are trying to achieve, this will help you: To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks): var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫
x.match(/\w+/g).length; //Result: 1 : "Kitty"
function myCount(str) {
return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13
alert(myCount("これは犬です。DogとPuppyもOKですね!")); //14
["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]
function myCount(str) {
return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ヲ-ン]|\w+/g).length;
}
|
Clustering Words
Date : March 29 2020, 07:55 AM
|
Python Bag of Words clustering
Date : March 29 2020, 07:55 AM
help you fix your problem Just figured it out thanks to the opencv forums, instead of using another list (I used descriptors above), just add the descriptors you find directly to your bag with bow.add(dsc) dictionarySize = 5
BOW = cv2.BOWKMeansTrainer(dictionarySize)
for p in training_paths:
image = cv2.imread(p)
gray = cv2.cvtColor(image, cv2.CV_LOAD_IMAGE_GRAYSCALE)
kp, dsc= sift.detectAndCompute(gray, None)
BOW.add(dsc)
#dictionary created
dictionary = BOW.cluster()
|
Python: clustering similar words based on word2vec
Tag : python , By : Andrew Mattie
Date : March 29 2020, 07:55 AM
Does that help No, not really. For reference, common word2vec models which are trained on wikipedia (in english) consists around 3 billion words. You can use KNN (or something similar). Gensim has the most_similar function to get the closest words. Using a dimensional reduction (like PCA or tsne) you can get yourself a nice cluster. (Not sure if gensim has tsne module, but sklearn has, so you can use it) btw you're referring to some image, but it's not available.
|
clustering inside clustering that is nested clustering of a data table that is multiclass clustering
Tag : python , By : user171555
Date : March 29 2020, 07:55 AM
it helps some times You will need to carefully balance thresholds in textual similarity and in numerical similarity. There won't be an easy solution, and unless you have really huge data, a manual approach may be best. Textual similarity of short strings is highly unreliable.
|