sklearn.feature_extraction.text.CountVectorizer

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

bro's coding

sklearn.feature_extraction.text.CountVectorizer 본문

[AI]/python.sklearn

sklearn.feature_extraction.text.CountVectorizer

givemebro 2020. 4. 27. 15:28

BOW(Bag of words) : 단어집 만들기

from sklearn.feature_extraction.text import CountVectorizer

ss=['I am Tom. Tom is me!','He is Tom. He is a man']
vect=CountVectorizer()
vect.fit(ss)

'''
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
'''

vect.vocabulary_ # 한 글자 단어와 구두점 제외, 소문자로 변환

# {'am': 0, 'he': 1, 'is': 2, 'man': 3, 'me': 4, 'tom': 5}

# swap
voca=vect.vocabulary_
sorted([(v,k)for k,v in voca.items()])

# [(0, 'am'), (1, 'he'), (2, 'is'), (3, 'man'), (4, 'me'), (5, 'tom')]

vect.transform(ss) # sparse matrix: 해당하는 단어의 index와 값 을 가지고 있는 matrix

# <2x6 sparse matrix of type '<class 'numpy.int64'>'
# 	with 8 stored elements in Compressed Sparse Row format>

vect.transform(ss).toarray()

# array([[1, 0, 1, 0, 1, 2],
#       [0, 2, 2, 1, 0, 1]], dtype=int64)

vect.get_feature_names() # list를 출력 (vocabulary_는 dict를 출력)

# ['am', 'he', 'is', 'man', 'me', 'tom']

저작자표시

'[AI] > python.sklearn' 카테고리의 다른 글

sklearn.feature_extraction.text.CountVectorizer.min_df변화 관찰 (0)	2020.04.28
sklearn.textdata.BernoulliNB적용 (0)	2020.04.28
sklearn.textdata.LogisticRegression적용 (0)	2020.04.27
sklearn.textdata.단어집과 문장 대조하기 (0)	2020.04.27
sklearn.textdata.datasets.load_files (0)	2020.04.27
sklearn.base.BaseEstimator, TransformerMixin(추정기 만들기) (0)	2020.04.27
sklearn.base.BaseEstimator, ClassifierMixin(분류기 만들기) (0)	2020.04.27
sklearn.pipeline.Pipeline (0)	2020.04.27

'[AI]/python.sklearn' Related Articles

Comments

bro's coding

sklearn.feature_extraction.text.CountVectorizer 본문

sklearn.feature_extraction.text.CountVectorizer

'[AI] > python.sklearn' 카테고리의 다른 글

티스토리툴바