반응형
Notice
Recent Posts
Recent Comments
Link
관리 메뉴

bro's coding

sklearn.feature_extraction.text.CountVectorizer 본문

[AI]/python.sklearn

sklearn.feature_extraction.text.CountVectorizer

givemebro 2020. 4. 27. 15:28
반응형

BOW(Bag of words) : 단어집 만들기

from sklearn.feature_extraction.text import CountVectorizer

ss=['I am Tom. Tom is me!','He is Tom. He is a man']
vect=CountVectorizer()
vect.fit(ss)
'''
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
'''

 

vect.vocabulary_ # 한 글자 단어와 구두점 제외, 소문자로 변환
# {'am': 0, 'he': 1, 'is': 2, 'man': 3, 'me': 4, 'tom': 5}

 

# swap
voca=vect.vocabulary_
sorted([(v,k)for k,v in voca.items()])
# [(0, 'am'), (1, 'he'), (2, 'is'), (3, 'man'), (4, 'me'), (5, 'tom')]

 

vect.transform(ss) # sparse matrix: 해당하는 단어의 index와 값 을 가지고 있는 matrix
# <2x6 sparse matrix of type '<class 'numpy.int64'>'
# 	with 8 stored elements in Compressed Sparse Row format>

 

vect.transform(ss).toarray()
# array([[1, 0, 1, 0, 1, 2],
#       [0, 2, 2, 1, 0, 1]], dtype=int64)

 

vect.get_feature_names() # list를 출력 (vocabulary_는 dict를 출력)
# ['am', 'he', 'is', 'man', 'me', 'tom']
반응형
Comments