'분류 전체보기' 카테고리의 글 목록 (49 Page)

Notice

Recent Posts

Recent Comments

Link

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

목록분류 전체보기 (688)

bro's coding

sklearn.TfidfVectorizer(tokenizer=twitter_tag.morphs).LogisticRegression

from konlpy.tag import Twitter, Okt from sklearn.feature_extraction.text import TfidfVectorizer # data 준비 tfidf=TfidfVectorizer(tokenizer=twitter_tag.morphs,min_df=3) X_train=tfidf.fit_transform(text_train) X_test=tfidf.transform(text_test) # model from sklearn.linear_model import LogisticRegression model=LogisticRegression() model.fit(X_train,y_train) model.score(X_test,y_test) # 가중치 w=model.co..

[AI]/python.sklearn 2020. 4. 29. 12:47

konlpy.tag.Twitter, Okt.basic

# 구 twitter_tag.phrases('겨울이 가고 어느덧 봄이 오는지 어제는 봄비가 많이 내렸습니다') # ['겨울', '어제', '봄비'] from konlpy.tag import Twitter, Okt # 분석 twitter_tag.morphs('홍길동은 조선시대 사람이죠?') # ['홍길동', '은', '조선시대', '사람', '이', '죠', '?'] # 명사 twitter_tag.nouns('겨울이 가고 어느덧 봄이 오는지 어제는 봄비가 많이 내렸습니다') # ['겨울', '봄', '어제', '봄비'] # type twitter_tag.pos('겨울이 가고 어느덧 봄이 오는지 어제는 봄비가 많이 내렸습니다') '''[('겨울', 'Noun'), ('이', 'Josa'), ('가고', '..

[IT]/python 2020. 4. 29. 12:03

KonlPy(코앤엘파이 한국어 형태소 분석기).install

https://konlpy.org/ko/latest/install/ 설치하기 — KoNLPy 0.5.2 documentation 우분투 Supported: Xenial(16.04.3 LTS), Bionic(18.04.3 LTS), Disco(19.04), Eoan(19.10) Install dependencies # Install Java 1.8 or up $ sudo apt-get install g++ openjdk-8-jdk python3-dev python3-pip curl Install KoNLPy $ python3 -m pip install --upgrade pip $ p konlpy.org # test code from konlpy.tag import Kkma Kkma_pos = Kkma() ..

[IT]/python 2020. 4. 29. 10:20

sklearn.decomposition.LatentDirichletAllocation

문서 군집화(토픽 모델링) 토픽 모델링: 비지도 학습으로 문서를 토픽으로 할당하는 작업 LDA(잠재 디리클레 할당, Latent Dirichlet Allocation) : 문서들이 가지는 단어들의 성분을 구한다(PCA와 유사) imdb_train, imdb_test = np.load('imdb.npy') text_train = [s.decode().replace(' ', '') for s in imdb_train.data] y_train = imdb_train.target from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(max_features=10000, max_df=0.15) X = vect.fit_tr..

[AI]/python.sklearn 2020. 4. 28. 17:58

sklearn.feature_extraction.text.CountVectorizer.ngram.LogisticRegression.2단어들만 출력

from sklearn.feature_extraction.text import CountVectorizer vect=CountVectorizer(ngram_range=(1,2)) X_train=vect.fit_transform(text_train) from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression model=LogisticRegression() model.fit(X_train,y_train) # 2개의 단어로 구성된 feature 추출 fn=np.array(vect.get_feature_names()) mask=np.array([s.find(' ')>=0 for s in..

[AI]/python.sklearn 2020. 4. 28. 16:24

sklearn.feature_extraction.text.CountVectorizer.ngram_range적용

ngram : n개의 단어로 만든 단어집 ex) s='I am Tam' 2gram : [I am], [am Tam] from sklearn.feature_extraction.text import CountVectorizer vect=CountVectorizer(ngram_range=(1,2)) X_train=vect.fit_transform(text_train) len(vect.get_feature_names()) 1522634 # 간략하게 2단어로 이뤄진 word 확인 #1 # count=0 # for key,value in vect.vocabulary_.items(): # # value=vect.vocabulary_[key] # if len(key.split())>1: # print(key,value..

[AI]/python.sklearn 2020. 4. 28. 15:23

sklearn.feature_extraction.text.TfidfTransformer.LogisticRegression적용

from sklearn.feature_extraction.text import TfidfVectorizer vect=TfidfVectorizer(min_df=5) vect.fit(text_train) X_train=vect.transform(text_train) from sklearn.linear_model import LogisticRegression model=LogisticRegression() model.fit(X_train,y_train) X_test=vect.transform(text_test) display(model.score(X_test,y_test),model.coef_) w=model.coef_[0] index_small=np.argsort(w)[:20] index_big=np.arg..

[AI]/python.sklearn 2020. 4. 28. 12:54

sklearn.feature_extraction.text.TfidfTransformer

https://ko.wikipedia.org/wiki/Tf-idf tf-idf - 위키백과, 우리 모두의 백과사전 위키백과, 우리 모두의 백과사전. TF-IDF(Term Frequency - Inverse Document Frequency)는 정보 검색과 텍스트 마이닝에서 이용하는 가중치로, 여러 문서로 이루어진 문서군이 있을 때 어떤 단어가 특정 문서 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치이다. 문서의 핵심어를 추출하거나, 검색 엔진에서 검색 결과의 순위를 결정하거나, 문서들 사이의 비슷한 정도를 구하는 등의 용도로 사용할 수 있다. TF(단어 빈도, ko.wikipedia.org # tf-idf적용 # 단어빈도 _역문서빈도 적용 # 어떤 단어가 한 문서에서 많이 나온다. 그 이유가 있을것..

[AI]/python.sklearn 2020. 4. 28. 11:22

sklearn.feature_extraction.text.CountVectorizer.stop_words적용

# Stop_words 적용 from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import BernoulliNB num_of_words=[] scores_BernoulliNB=[] vect=CountVectorizer(stop_words='english') vect.fit(text_train) num_of_words.append(len(vect.get_feature_names())) X_train=vect.transform(text_train) X_test=vect.transform(text_test) model=BernoulliNB() model.fit(X_train,y_train) scores_Ber..

[AI]/python.sklearn 2020. 4. 28. 10:33

sklearn.feature_extraction.text.CountVectorizer.max_df변화 관찰

불용어 적용(stop words) 관사 지시 대명사 등등 관용적으로 사용하는 단어들 where when the it etc... max_df(너무 많이 나오는 애들)(비율) stop_words : 불용어 목록을 지정함 from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import BernoulliNB num_of_words=[] scores_BernoulliNB=[] max_df=np.arange(0.1,1,0.1) for df in max_df: vect=CountVectorizer(max_df=df) vect.fit(text_train) num_of_words.append(len(vect.get_fe..

[AI]/python.sklearn 2020. 4. 28. 10:25

Prev 1 ··· 46 47 48 49 50 51 52 ··· 69 Next

목록분류 전체보기 (688)

bro's coding

티스토리툴바