hmk run dev

jsoup 네이버 블로그 크롤링(iframe) 본문

java

jsoup 네이버 블로그 크롤링(iframe)

hmk run dev 2021. 12. 4. 21:07

 

jsoup으로 네이버 블로그를 크롤링 하려다가 아래처럼 iframe 태그로 되어있었다...

 

<html lang="ko"><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta name="robots" content="noindex,follow">
<meta name="referrer" content="always">
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico?3">
<link rel="alternate" type="application/rss+xml" href="https://rss.blog.naver.com/jdhrg.xml" title="RSS feed for jdhrg Blog">
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="https://blog.naver.com/NBlogWlwLayout.naver?blogId=jdhrg">




<title>서울 관광지 국립중앙박물관 혼자놀기 여행 : 네이버 블로그</title>
<script type="text/javascript" src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset-584146299_https.js" charset="UTF-8"></script><script type="text/javascript" charset="UTF-8">
var photoContent="";
var postContent="";

var videoId 	  = "";
var thumbnail 	  = "";
var inKey 		  = "";
var movieFileSize = "";
var playTime 	  = "";
var screenSize 	  = "";

var blogId = 'jdhrg';
var blogURL = 'https://blog.naver.com';
var eventCnt = '';

var g_ShareObject = {};
g_ShareObject.referer = "https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0";


jsMVC.setController("framesetTitleController", FramesetTitleController);
jsMVC.setController("framesetUrlController", FramesetUrlController);
jsMVC.setController("framesetMusicController", FramesetMusicController);
var oFramesetTitleController = jsMVC.getController("framesetTitleController");
var oFramesetUrlController = jsMVC.getController("framesetUrlController");
var oFramesetMusicController = jsMVC.getController("framesetMusicController");
var sTitle = document.title;

var topFrameAlert = function(message){
	alert(message);
};

var topFrameConfirm = function(message){
	if(confirm(message)){
		return true;
	} else {
		return false;
	}
};
</script><style type="text/css">
    html{width:100%;height:100%;}
    body{width:100%;height:100%;margin:0;padding:0;font-size:0;}
    #mainFrame{width:100%;height:100%;margin:0;padding:0;border:0;}
    #hiddenFrame{width:0;height:0;margin:0;padding:0;border:0;}
</style></head>




<body>
    <iframe id="mainFrame" name="mainFrame" allowfullscreen="true" src="/PostView.naver?blogId=jdhrg&amp;logNo=222563821741&amp;redirect=Dlog&amp;widgetTypeCall=true&amp;topReferer=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0&amp;directAccess=false" scrolling="auto" onload="oFramesetTitleController.start(self.frames['mainFrame'], self, sTitle);oFramesetTitleController.onLoadFrame();oFramesetUrlController.start(self.frames['mainFrame']);oFramesetUrlController.onLoadFrame()"></iframe>

</body></html>

 

셀네니움 등.. 별 방법을 다 찾아 봤지만 방법은 의외로 간단했다..

 

iframe 태그 안에 src 부분을 보면 블로그의 실제 주소가 있는데 이걸 이용해 페이지를 크롤링하면 된다!

앞에 "http://blog.naver.com" 를 붙여주면 실제주소!

src="/PostView.naver?blogId=jdhrg&amp;logNo=222563821741&amp;redirect=Dlog&amp;widgetTypeCall=true&amp;topReferer=https%3A%2F%2Fsearch.naver.com%2Fsearch.naver%3Fwhere%3Dnexearch%26sm%3Dtop_sug.pre%26fbm%3D0%26acr%3D2%26acq%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%26qdt%3D0%26ie%3Dutf8%26query%3D%25EC%2584%259C%25EC%259A%25B8%2B%25ED%2598%25BC%25EC%259E%2590%2B%25EB%2586%2580%25EA%25B8%25B0&amp;directAccess=false"

 

네이버 블로그 크롤링 코드

나는 블로그 썸네일 og:image만 필요했으므로 아래처럼 코드를 작성 했다!

public static void getMeta(String url) throws IOException{
		// 네이버 블로그 url로 document가져오기
		Document doc = Jsoup.connect(url).get();
	
		// iframe 태그에 있는 진짜 블로그 주소 가져오기
		Elements iframes = doc.select("iframe#mainFrame");
		String src = iframes.attr("src");
		//진짜 블로그 주소 document 가져오기
		String url2 = "http://blog.naver.com"+ src;
		Document doc2 = Jsoup.connect(url2).get();
		System.out.println("주소 확인용 : " +url2);
		//System.out.println("doc2 : "+doc2);
		// 블로그에서 원하는 블로그 페이지 가져오기
		String[] blog_logNo = src.split("&");
		String[] logNo_split = blog_logNo[1].split("=");
		String logNo = logNo_split[1];
		
		// 찾고자 하는 블로그 본문 가져오기
		String real_blog_addr = "div#post-view" + logNo;
		
		Elements blog_element = doc2.select(real_blog_addr);
		// 블로그 썸네일 가져오기
		String og_image = doc2.select("meta[property=og:image]").get(0).attr("content");
		System.out.println("og_image : " + og_image);
}

'java' 카테고리의 다른 글

추상클래스 & 인터페이스  (0) 2022.02.16
JAVA 메모리 구조  (0) 2022.01.31
JPA @Query 띄워쓰기 주의 사항  (0) 2022.01.12
generic  (0) 2021.11.14
interface  (0) 2021.11.14
Comments