hmk run dev
jsoup 네이버 블로그 크롤링(iframe) 본문
jsoup으로 네이버 블로그를 크롤링 하려다가 아래처럼 iframe 태그로 되어있었다...
<html lang="ko"><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta name="robots" content="noindex,follow">
<meta name="referrer" content="always">
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico?3">
<link rel="alternate" type="application/rss+xml" href="https://rss.blog.naver.com/jdhrg.xml" title="RSS feed for jdhrg Blog">
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="https://blog.naver.com/NBlogWlwLayout.naver?blogId=jdhrg">
<title>서울 관광지 국립중앙박물관 혼자놀기 여행 : 네이버 블로그</title>
<script type="text/javascript" src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset-584146299_https.js" charset="UTF-8"></script><script type="text/javascript" charset="UTF-8">
var photoContent="";
var postContent="";
var videoId = "";
var thumbnail = "";
var inKey = "";
var movieFileSize = "";
var playTime = "";
var screenSize = "";
var blogId = 'jdhrg';
var blogURL = 'https://blog.naver.com';
var eventCnt = '';
var g_ShareObject = {};
g_ShareObject.referer = "https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0";
jsMVC.setController("framesetTitleController", FramesetTitleController);
jsMVC.setController("framesetUrlController", FramesetUrlController);
jsMVC.setController("framesetMusicController", FramesetMusicController);
var oFramesetTitleController = jsMVC.getController("framesetTitleController");
var oFramesetUrlController = jsMVC.getController("framesetUrlController");
var oFramesetMusicController = jsMVC.getController("framesetMusicController");
var sTitle = document.title;
var topFrameAlert = function(message){
alert(message);
};
var topFrameConfirm = function(message){
if(confirm(message)){
return true;
} else {
return false;
}
};
</script><style type="text/css">
html{width:100%;height:100%;}
body{width:100%;height:100%;margin:0;padding:0;font-size:0;}
#mainFrame{width:100%;height:100%;margin:0;padding:0;border:0;}
#hiddenFrame{width:0;height:0;margin:0;padding:0;border:0;}
</style></head>
<body>
<iframe id="mainFrame" name="mainFrame" allowfullscreen="true" src="/PostView.naver?blogId=jdhrg&logNo=222563821741&redirect=Dlog&widgetTypeCall=true&topReferer=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0&directAccess=false" scrolling="auto" onload="oFramesetTitleController.start(self.frames['mainFrame'], self, sTitle);oFramesetTitleController.onLoadFrame();oFramesetUrlController.start(self.frames['mainFrame']);oFramesetUrlController.onLoadFrame()"></iframe>
</body></html>
셀네니움 등.. 별 방법을 다 찾아 봤지만 방법은 의외로 간단했다..
iframe 태그 안에 src 부분을 보면 블로그의 실제 주소가 있는데 이걸 이용해 페이지를 크롤링하면 된다!
앞에 "http://blog.naver.com" 를 붙여주면 실제주소!
src="/PostView.naver?blogId=jdhrg&logNo=222563821741&redirect=Dlog&widgetTypeCall=true&topReferer=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0&directAccess=false"
네이버 블로그 크롤링 코드
나는 블로그 썸네일 og:image만 필요했으므로 아래처럼 코드를 작성 했다!
public static void getMeta(String url) throws IOException{
// 네이버 블로그 url로 document가져오기
Document doc = Jsoup.connect(url).get();
// iframe 태그에 있는 진짜 블로그 주소 가져오기
Elements iframes = doc.select("iframe#mainFrame");
String src = iframes.attr("src");
//진짜 블로그 주소 document 가져오기
String url2 = "http://blog.naver.com"+ src;
Document doc2 = Jsoup.connect(url2).get();
System.out.println("주소 확인용 : " +url2);
//System.out.println("doc2 : "+doc2);
// 블로그에서 원하는 블로그 페이지 가져오기
String[] blog_logNo = src.split("&");
String[] logNo_split = blog_logNo[1].split("=");
String logNo = logNo_split[1];
// 찾고자 하는 블로그 본문 가져오기
String real_blog_addr = "div#post-view" + logNo;
Elements blog_element = doc2.select(real_blog_addr);
// 블로그 썸네일 가져오기
String og_image = doc2.select("meta[property=og:image]").get(0).attr("content");
System.out.println("og_image : " + og_image);
}
'java' 카테고리의 다른 글
추상클래스 & 인터페이스 (0) | 2022.02.16 |
---|---|
JAVA 메모리 구조 (0) | 2022.01.31 |
JPA @Query 띄워쓰기 주의 사항 (0) | 2022.01.12 |
generic (0) | 2021.11.14 |
interface (0) | 2021.11.14 |
Comments