R - 크롤링 연습 ②

컴퓨터/R

R - 크롤링 연습 ②

해피밀세트 2020. 4. 27. 15:54

1. 영화 평점을 기준으로 긍정/부정 리뷰 분석 (레미제라블)

# 평점, 리뷰, 작성자, 작성날짜 뽑기 text <- c() p <- c() name <- c() time <- c() for(i in 1:10){ html <- read_html(iconv(paste0("https://movie.naver.com/movie/point/af/list.nhn?st=mcode&sword=89755&target=after&page=",i), from = 'euc-kr',to='cp949'),encoding='cp949') comment <- html_nodes(html,".title") %>% html_text() comment <- gsub('\n','',comment) comment <- gsub('\t','',comment) comment <- gsub('레미제라블별점 - 총 10점 중','',comment) comment <- gsub('[[:digit:]]{1}','',comment) comment <- gsub('신고','',comment) comment text <- c(text,comment) point <- html_nodes(html,xpath='//[@id="old_content"]/table/tbody/tr/td[2]/div/em') %>% html_text() p <- c(p,point) id <- html_nodes(html,xpath='//[@id="old_content"]/table/tbody/tr/td[3]/a') %>% html_text() name <- c(name,id) date <- html_nodes(html,xpath='//*[@id="old_content"]/table/tbody/tr/td[3]/text()') %>% html_text() time <- c(time,date) }
# 긍정적인 리뷰(po) / 부정적인 리뷰(ne)로 구분 df <- data.frame('point' = p, 'comment' = text, 'id' = name, 'date' = time, stringsAsFactors = F) str(df) df $point <- as.integer(df$ point) po <- df[df$point>=8,'comment'] ne <- df[df$point<8,'comment']
# wordcloud로 띄우기 po22 <- SimplePos22(po) po22 word_po22 <-table(as.vector(na.omit(str_match(po22, "([A-z가-힣]+)/NC")[,2]))) po09 <- SimplePos09(po) po09 word_po09 <- table(as.vector(na.omit(str_match(po09, "([A-z가-힣]+)/N")[,2]))) wordcloud2(word_po22) wordcloud2(word_po09) ne22 <- SimplePos22(ne) ne22 word_ne22 <-table(as.vector(na.omit(str_match(ne22, "([A-z가-힣]+)/NC")[,2]))) ne09 <- SimplePos09(ne) ne09 word_ne09 <- table(as.vector(na.omit(str_match(ne09, "([A-z가-힣]+)/N")[,2]))) wordcloud2(word_ne22) wordcloud2(word_ne09)
긍정적인 리뷰 (평점 8점 이상)
부정적인 리뷰 (평점 8점 미만)

2. 주식 종목분석 리포트

2-1 테이블 모양대로 크롤링 (모든 종목)

Sys.setlocale("LC_ALL","English") stock <- NULL for(i in 1:100){ html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949") t <- html_nodes(html,"table") stock <- rbind(stock,html_table(t[[1]])) } Sys.setlocale("LC_ALL") stock View(stock)
# 첨부컬럼 삭제 stock <- stock[-4] View(stock)
# 제목의 생략(...) 지우기 stock $제목 <- gsub("\\.{2,}","",stock$ 제목) View(stock)
# 제목에 있는 단어 워드클라우드로 띄우기 stock22 <- SimplePos22(stock$제목) stock22 word_stock22 <- table(as.vector(na.omit(str_match(stock22, "([A-z가-힣]+)/NC")[,2]))) wordcloud2(word_stock22)

2-2 종목명이 LG화학인것만 추출

Sys.setlocale("LC_ALL","English") stock <- NULL for(i in 1:1000){ html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949") t <- html_nodes(html,"table") stock <- rbind(stock,html_table(t[[1]])) } Sys.setlocale("LC_ALL") stock <- stock[-4] stock $제목 <- gsub("\\.{2,}","",stock$ 제목)
lg <- stock[stock$종목명 == 'LG화학','제목']
# 텍스트 정제 및 워드클라우드로 띄우기 lg22 <- SimplePos22(lg) word_lg22 <- table(as.vector(na.omit(str_match(lg22, "([A-z가-힣]+)/NC")[,2]))) wordcloud2(word_lg22)

2-3 원하는 종목의 워드클라우드 확인

# name만 바꾸면 됨

name = '삼성전자'

Sys.setlocale("LC_ALL","English")

stock <- NULL

for(i in 1:1000){

html <- read_html(paste0("https://finance.naver.com/research/company_list.nhn?&page=",i),encoding = "cp949")

t <- html_nodes(html,"table")

stock <- rbind(stock,html_table(t[[1]]))

}

Sys.setlocale("LC_ALL")

stock <- stock[-4]

stock $제목 <- gsub("\\.{2,}","",stock$ 제목)

View(stock)

x <- stock[stock$종목명 == name,'제목']

x22 <- SimplePos22(x)

word_x22 <- table(as.vector(na.omit(str_match(x22, "([A-z가-힣]+)/NC")[,2])))

wordcloud2(word_x22)

x09 <- SimplePos09(x)

word_x09 <- table(as.vector(na.omit(str_match(x09, "([A-z가-힣]+)/N")[,2])))

wordcloud2(word_x09)

저작자표시

'컴퓨터 > R' 카테고리의 다른 글

R - 크롤링 연습 ③ (0)	2020.05.04
R - RSelenium, xlsx 사용 (0)	2020.04.28
R - 크롤링 연습 ① (0)	2020.04.27
R - 크롤링 (0)	2020.04.27
R - KoNLP 설치 및 사용 (0)	2020.04.25

현재글R - 크롤링 연습 ②

Truman Show

딥러닝을 공부하는 블로그입니다.

함수, SQL, CNN, 리눅스, 코딩, 시각화, 인공지능, Oracle, Python, 크롤링, 파이썬, 딥러닝, r, 파이토치, 오라클, Ai, 의료영상, 맛집, pandas, 머신러닝,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Truman Show