R - 크롤링 연습 ③

컴퓨터/R

R - 크롤링 연습 ③

해피밀세트 2020. 5. 4. 21:08

1. 위키독스 딥러닝 파트 크롤링

# 필요한 라이브러리 불러오기

library(RSelenium)
library(rvest)
library(stringr)
library(KoNLP)
library(wordcloud2)

# RSelenium으로 위키독스 접속

remdr <- remoteDriver(remoteServerAddr='localhost', port=4445L,
browserName='chrome')
remdr$open()
remdr$navigate("https://wikidocs.net/book/2155")

# 딥러닝 파트 node 가져오기

source <- remdr$getPageSource()[[1]]
html <- read_html(source)
x <- html_nodes(html,'.list-group.list-group-toc > a') %>%
html_attrs()
df<-data.frame(x)
df<-t(df)
rownames(df) <- NULL
df<-df[,'class']
df<-data.frame(df)
df<-df[-1,]
df <- df[47:56]
View(df)

# 숫자부분만 가져오기

deep <- c()
for (i in 1:10){
deep <- c(deep, as.vector(df[i]))
}
deep<-str_extract_all(deep,"[[:digit:]]{1,}",simplify=T)
deep

# 위에서 가져온 숫자로 딥러닝 파트 url접속 / 텍스트 가져오기

text <- c()
for(i in 1:length(deep)){
  remdr$navigate(paste0("https://wikidocs.net/",deep[i,]))
  source <- remdr$getPageSource()[[1]]
  html <- read_html(source)
  x <- html_nodes(html,'#load_content > .page-content > p') %>%
    html_text()
  text <- c(text,x)
}
text

# 텍스트 정제 작업 및 wordcloud로 띄우기

text <- gsub('위','',text)
text <- gsub('의','',text)
deep09 <- SimplePos09(text)
word_deep09 <-table(as.vector(na.omit(str_match(deep09, "([A-z가-힣]+)/N")[,2])))
wordcloud2(word_deep09)

저작자표시

'컴퓨터 > R' 카테고리의 다른 글

R - tm 라이브러리를 사용한 텍스트 마이닝 (0)	2020.05.06
R - RSelenium, xlsx 사용 (0)	2020.04.28
R - 크롤링 연습 ② (0)	2020.04.27
R - 크롤링 연습 ① (0)	2020.04.27
R - 크롤링 (0)	2020.04.27

현재글R - 크롤링 연습 ③

Truman Show

딥러닝을 공부하는 블로그입니다.

pandas, 의료영상, 시각화, 코딩, CNN, SQL, 인공지능, 맛집, 딥러닝, r, Oracle, 머신러닝, Ai, Python, 리눅스, 파이토치, 크롤링, 파이썬, 오라클, 함수,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Truman Show