반응형
Notice
Recent Posts
Recent Comments
Link
관리 메뉴

bro's coding

sklearn.decomposition.LatentDirichletAllocation 본문

[AI]/python.sklearn

sklearn.decomposition.LatentDirichletAllocation

givemebro 2020. 4. 28. 17:58
반응형

문서 군집화(토픽 모델링)

토픽 모델링: 비지도 학습으로 문서를 토픽으로 할당하는 작업

LDA(잠재 디리클레 할당, Latent Dirichlet Allocation) : 문서들이 가지는 단어들의 성분을 구한다(PCA와 유사)

 

imdb_train, imdb_test = np.load('imdb.npy')

text_train = [s.decode().replace('<br />', '') for s in imdb_train.data]
y_train = imdb_train.target

 

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=10000, max_df=0.15)

 

X = vect.fit_transform(text_train)

 

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, learning_method='batch',
                                max_iter=25, random_state=2020)
topics = lda.fit_transform(X)

 

# 각 토픽에서 가중치가 가장 큰 단어 찾기
for i in range(10):
    print(fn[np.argmax(lda.components_[i])])
role
action
show
world
horror
director
didn
family
original
black

 

#각 토픽에서 가중치가 큰 순서로 단어 10개 찾기

best=np.argsort(lda.components_,axis=1)[:,::-1]
best
array([[ 230, 5050, 3797, ..., 4108, 4321, 3807],
       [9896, 2646, 9474, ..., 6849, 4113, 9374],
       [9680, 9902, 9474, ..., 3457, 2892, 1241],
       ...,
       [3386, 9971, 3422, ..., 5614, 1354,  780],
       [9875, 3879, 4439, ..., 9616, 7996, 4183],
       [8095, 7946, 3767, ..., 5490, 6119, 3807]], dtype=int64)

 

for i in range(10):
    print(i,fn[best[i,:10]])
0 ['action' 'kids' 'game' 'animation' 'disney' 'fun' 'original' 'music' 're'
 'children']
1 ['work' 'director' 'us' 'interesting' 'without' 'feel' 'may' 'between'
 'real' 'seems']
2 ['war' 'world' 'us' 'years' 'our' 'american' 'new' 'fi' 'sci'
 'documentary']
3 ['horror' 'didn' 'thing' 'worst' 'nothing' 'actually' 'minutes' 'pretty'
 're' 'going']
4 ['john' 'cast' 'role' 'performance' 'jack' 'michael' 'horror' 'actor'
 'director' 'plays']
5 ['book' 'comedy' 'funny' 'actors' 'cast' 'script' 'role' 'actor' 'read'
 'performance']
6 ['version' 'black' 'white' 'american' 'production' 'set' 'english' 'early'
 'dvd' 'hollywood']
7 ['family' 'young' 'father' 'mother' 'son' 'woman' 'performance' 'wife'
 'girl' 'role']
8 ['woman' 'gets' 'house' 'guy' 'killer' 'wife' 'girl' 'police' 'down'
 'goes']
9 ['show' 'series' 'funny' 'tv' 'episode' 'years' 'comedy' 'old' 'now' 'saw']
반응형
Comments