R - 크롤링 연습 ②

인공지능/R

R - 크롤링 연습 ②

해피밀세트 2020. 4. 27. 15:54

1. 영화 평점을 기준으로 긍정/부정 리뷰 분석 (레미제라블)

# 평점, 리뷰, 작성자, 작성날짜 뽑기 text <- c() p <- c() name <- c() time <- c() for(i in 1:10){ html <- read_html(iconv(paste0("https://movie.naver.com/movie/point/af/list.nhn?st=mcode&sword=89755&target=after&page=",i), from = 'euc-kr',to='cp949'),encoding='cp949') comment <- html_nodes(html,".title") %>% html_text() comment <- gsub('\n','',comment) comment <- gsub('\t','',comment) comment <- gsub('레미제라블별점 - 총 10점 중','',comment) comment <- gsub('[[:digit:]]{1}','',comment) comment <- gsub('신고','',comment) comment text <- c(text,comment) point <- html_nodes(html,xpath='//[@id="old_content"]/table/tbody/tr/td[2]/div/em') %>% html_text() p <- c(p,point) id <- html_nodes(html,xpath='//[@id="old_content"]/table/tbody/tr/td[3]/a') %>% html_text() name <- c(name,id) date <- html_nodes(html,xpath='//*[@id="old_content"]/table/tbody/tr/td[3]/text()') %>% html_text() time <- c(time,date) }
# 긍정적인 리뷰(po) / 부정적인 리뷰(ne)로 구분 df <- data.frame('point' = p, 'comment' = text, 'id' = name, 'date' = time, stringsAsFactors = F) str(df) df$point <- as.integer(df$point) po <- df[df$point>=8,'comment'] ne <- df[df$point<8,'comment']
# wordcloud로 띄우기 po22 <- SimplePos22(po) po22 word_po22 <-table(as.vector(na.omit(str_match(po22, "([A-z가-힣]+)/NC")[,2]))) po09 <- SimplePos09(po) po09 word_po09 <- table(as.vector(na.omit(str_match(po09, "([A-z가-힣]+)/N")[,2]))) wordcloud2(word_po22) wordcloud2(word_po09) ne22 <- SimplePos22(ne) ne22 word_ne22 <-table(as.vector(na.omit(str_match(ne22, "([A-z가-힣]+)/NC")[,2]))) ne09 <- SimplePos09(ne) ne09 word_ne09 <- table(as.vector(na.omit(str_match(ne09, "([A-z가-힣]+)/N")[,2]))) wordcloud2(word_ne22) wordcloud2(word_ne09)
긍정적인 리뷰 (평점 8점 이상)
부정적인 리뷰 (평점 8점 미만)

2. 주식 종목분석 리포트

2-1 테이블 모양대로 크롤링 (모든 종목)

Sys.setlocale("LC_ALL","English") stock <- NULL for(i in 1:100){ html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949") t <- html_nodes(html,"table") stock <- rbind(stock,html_table(t[[1]])) } Sys.setlocale("LC_ALL") stock View(stock)
# 첨부컬럼 삭제 stock <- stock[-4] View(stock)
# 제목의 생략(...) 지우기 stock$제목 <- gsub("\\.{2,}","",stock$제목) View(stock)
# 제목에 있는 단어 워드클라우드로 띄우기 stock22 <- SimplePos22(stock$제목) stock22 word_stock22 <- table(as.vector(na.omit(str_match(stock22, "([A-z가-힣]+)/NC")[,2]))) wordcloud2(word_stock22)

2-2 종목명이 LG화학인것만 추출

Sys.setlocale("LC_ALL","English") stock <- NULL for(i in 1:1000){ html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949") t <- html_nodes(html,"table") stock <- rbind(stock,html_table(t[[1]])) } Sys.setlocale("LC_ALL") stock <- stock[-4] stock$제목 <- gsub("\\.{2,}","",stock$제목)
lg <- stock[stock$종목명 == 'LG화학','제목']
# 텍스트 정제 및 워드클라우드로 띄우기 lg22 <- SimplePos22(lg) word_lg22 <- table(as.vector(na.omit(str_match(lg22, "([A-z가-힣]+)/NC")[,2]))) wordcloud2(word_lg22)

2-3 원하는 종목의 워드클라우드 확인

# name만 바꾸면 됨

name = '삼성전자'

Sys.setlocale("LC_ALL","English")

stock <- NULL

for(i in 1:1000){

html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949")

t <- html_nodes(html,"table")

stock <- rbind(stock,html_table(t[[1]]))

}

Sys.setlocale("LC_ALL")

stock <- stock[-4]

stock$제목 <- gsub("\\.{2,}","",stock$제목)

View(stock)

x <- stock[stock$종목명 == name,'제목']

x22 <- SimplePos22(x)

word_x22 <- table(as.vector(na.omit(str_match(x22, "([A-z가-힣]+)/NC")[,2])))

wordcloud2(word_x22)

x09 <- SimplePos09(x)

word_x09 <- table(as.vector(na.omit(str_match(x09, "([A-z가-힣]+)/N")[,2])))

wordcloud2(word_x09)

저작자표시

'인공지능 > R' 카테고리의 다른 글

R - 크롤링 연습 ③ (0)	2020.05.04
R - RSelenium, xlsx 사용 (0)	2020.04.28
R - 크롤링 연습 ① (0)	2020.04.27
R - 크롤링 (0)	2020.04.27
R - KoNLP 설치 및 사용 (0)	2020.04.25

현재글R - 크롤링 연습 ②

딥러닝을 공부하는 블로그입니다.

SQL, 오라클, 크롤링, CNN, pandas, Python, 시각화, 파이토치, Ai, Oracle, 파이썬, 인공지능, r, 리눅스, 맛집, 함수, 의료영상, 코딩, 머신러닝, 딥러닝,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Truman Show