sklearn.decomposition.LatentDirichletAllocation

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

bro's coding

sklearn.decomposition.LatentDirichletAllocation 본문

[AI]/python.sklearn

sklearn.decomposition.LatentDirichletAllocation

givemebro 2020. 4. 28. 17:58

문서 군집화(토픽 모델링)

토픽 모델링: 비지도 학습으로 문서를 토픽으로 할당하는 작업

LDA(잠재 디리클레 할당, Latent Dirichlet Allocation) : 문서들이 가지는 단어들의 성분을 구한다(PCA와 유사)

imdb_train, imdb_test = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

X = vect.fit_transform(text_train)

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, learning_method='batch',
                                max_iter=25, random_state=2020)
topics = lda.fit_transform(X)

# 각 토픽에서 가중치가 가장 큰 단어 찾기
for i in range(10):
    print(fn[np.argmax(lda.components_[i])])

role
action
show
world
horror
director
didn
family
original
black

#각 토픽에서 가중치가 큰 순서로 단어 10개 찾기

best=np.argsort(lda.components_,axis=1)[:,::-1]
best

array([[ 230, 5050, 3797, ..., 4108, 4321, 3807],
       [9896, 2646, 9474, ..., 6849, 4113, 9374],
       [9680, 9902, 9474, ..., 3457, 2892, 1241],
       ...,
       [3386, 9971, 3422, ..., 5614, 1354,  780],
       [9875, 3879, 4439, ..., 9616, 7996, 4183],
       [8095, 7946, 3767, ..., 5490, 6119, 3807]], dtype=int64)

for i in range(10):
    print(i,fn[best[i,:10]])

0 ['action' 'kids' 'game' 'animation' 'disney' 'fun' 'original' 'music' 're'
 'children']
1 ['work' 'director' 'us' 'interesting' 'without' 'feel' 'may' 'between'
 'real' 'seems']
2 ['war' 'world' 'us' 'years' 'our' 'american' 'new' 'fi' 'sci'
 'documentary']
3 ['horror' 'didn' 'thing' 'worst' 'nothing' 'actually' 'minutes' 'pretty'
 're' 'going']
4 ['john' 'cast' 'role' 'performance' 'jack' 'michael' 'horror' 'actor'
 'director' 'plays']
5 ['book' 'comedy' 'funny' 'actors' 'cast' 'script' 'role' 'actor' 'read'
 'performance']
6 ['version' 'black' 'white' 'american' 'production' 'set' 'english' 'early'
 'dvd' 'hollywood']
7 ['family' 'young' 'father' 'mother' 'son' 'woman' 'performance' 'wife'
 'girl' 'role']
8 ['woman' 'gets' 'house' 'guy' 'killer' 'wife' 'girl' 'police' 'down'
 'goes']
9 ['show' 'series' 'funny' 'tv' 'episode' 'years' 'comedy' 'old' 'now' 'saw']

저작자표시 (새창열림)

'[AI] > python.sklearn' 카테고리의 다른 글

활성함수를 사용하는 이유 (0)	2020.07.03
sklearn.TfidfVectorizer(tokenizer=twitter_tag.morphs).LogisticRegression (0)	2020.04.29
sklearn.feature_extraction.text.CountVectorizer.ngram.LogisticRegression.2단어들만 출력 (0)	2020.04.28
sklearn.feature_extraction.text.CountVectorizer.ngram_range적용 (0)	2020.04.28
sklearn.feature_extraction.text.TfidfTransformer.LogisticRegression적용 (0)	2020.04.28
sklearn.feature_extraction.text.TfidfTransformer (0)	2020.04.28
sklearn.feature_extraction.text.CountVectorizer.stop_words적용 (0)	2020.04.28
sklearn.feature_extraction.text.CountVectorizer.max_df변화 관찰 (0)	2020.04.28

'[AI]/python.sklearn' Related Articles

Comments

bro's coding

sklearn.decomposition.LatentDirichletAllocation 본문

sklearn.decomposition.LatentDirichletAllocation

'[AI] > python.sklearn' 카테고리의 다른 글

티스토리툴바