The company’s R&D department cannot access the Internet, but the company hopes that its R&D colleagues can Follow the news, understand the hotspots of science and technology, and keep up with the trend of the times. So a discuz forum was built, but the content was lacking. Fortunately, the server of this forum can be accessed online (in both networks). So I wanted to make a crawler tool to grab news content through rss and put it on the company forum.
is now implemented, crawling multiple websites at the same time The above data (IT Home, Tiger Sniffing Network, etc.) only have text without pictures. Of course, I will try to capture images. The current results are pretty good. There are a lot of texts about RSS reading on the Internet, but about the links in RSS pointing toText The crawling of is relatively small. It happens to be such a project these days. Now I will put the design ideas and key codes Post it and share it. The source code has been sorted and uploaded, and there is a link at the bottom of the article.
First, let’s briefly introduce RSS
1: What is RSS? RSS (really simple syndication): web content aggregator. The format of RSS is XML. Must conform to the XML 1.0 specification. The role of RSS: subscribe to BLOG, subscribe to news 2 Historical version of RSS: http://blogs.law.harvard.edu /tech/rssVersionHistory There are many versions of RSS, 0.90, 0.91, 0.92, 0.93, 0.94, 1.0 and 2.0. The opposite of RSS is ATOM. RSS 2.0 is mainly used in China, and ATOM 0.3 is mainly used abroad. Due to the emergence of 2 factions in RSS, it has led to chaotic scenes. Among them, the RSS2.0 specification is defined and locked by Harvard University. 地址:http://blogs.law.harvard.edu/tech/rss 3: Analyze jar:Rome :http://wiki.java.net/bin/view/ Javawsxml/Rome |
More RSS introduction: http://www.360doc.com/content/10/1224/09/1007797_80869844.shtml
< p>II. Implementation ideas
1. http request to fetch the content of RSS . The rss are all in xml format and assembled into objects through home jar.
First define a bean, and the requested rss content is assembled into this object. The fields are based on your project requirements
public class RSSItemBean {private String title; private String author; private String uri; private String link; private String description; private Date pubDate; private String type; private String content; //omit get set}
http request code, here is quoted import com.sun.syndication.feed.synd.*; the following files, please import home.jar package< /span>
/** * * @param url rss website address such as: http://www.ithome.com/rss/ * @return all article objects* @throws Exception */ public ListgetRss(String url) throws Exception {URL feedUrl = new URL(url );//SyndFeedInput: Convert the content from remote reading to xml structure into SyndFeedImpl instance SyndFeedInput input = new SyndFeedInput();//rome generate rss and atom instances according to SyndFeed type, SyndFeed feed = input.build(new XmlReader(feedUrl )); //SyndFeed is the rss and atom implementation class Sy ndFeedImpl interface List entries = feed.getEntries(); RSSItemBean item = null; List rssItemBeans = new ArrayList (); for (SyndEntryImpl entry: entries) {item = new RSSItemBean(); item .setTitle(entry.getTitle().trim()); item.setType(feed.getTitleEx().getValue().trim()); item.setUri(entry.getUri()); item.setPubDate(entry. getPublishedDate()); item.setAuthor(entry.getAuthor()); rssItemBeans.add(item);} return rssItemBeans; }
Now the rss parsing is complete. The news headline, article address, and introduction are all there. It’s very simple, isn’t it? Here comes the question. Isn’t the body of the news yet? The text is not in rss. Like Baidu News, it is just an aggregation. When you click on the news link, it will still be linked to the source webpage of the news.
Sometimes, we need to grab the body of the news. It is not difficult, please continue to look down.
< br>
String url = entry.getUri();
item.setUri(url);pre>
This link points to the main text page of the news. What we need to do is to get the source code from the url. Here we use java to initiate an http request, which involves the knowledge points of java network operation. If you forget it, please make up. Because the encoding of each website is different, some use gbk, some utf-8. Chinese is tangled at this time, and the problem of garbled characters often appears, so there is this line of code
htmlContent = new String(bytes, pageEncoding);This pageEncoding can be found from reading the source code, such as
content="text/html; charset=gb2312"Of course I found it on the website I was looking for ( To see with the naked eye), gb2312 corresponds to GB2312; utf-8 corresponds to UTF-8. There is not much to say about garbled characters, and garbled characters are the problem here.
/** * http request to get the source code of the page* @param surl body url * @ return page source code*/ public String getStaticPage(String surl) {String htmlContent = ""; try {java.io.InputStream inputStream; java.net.URL url = new java.net.URL(surl); java.net.HttpURLConnection connection = (java.net.HttpURLConnection) url.openConnection(); connection.connect(); inputStream = connection.getInputStream(); byte bytes[] = new byte[1024 * 4000]; int index = 0; int count = inputStream.read(bytes, index, 1024 * 4000); while (count != -1) {index += count; count = inputStream.read(bytes, index, 1);} htmlContent = new String(bytes, pageEncoding) ; connection.disconnect();} catch (Exception ex) {ex.printStackTrace();} return htmlContent.trim();}
Now we have grabbed all the source code of the webpage, which is not what we want Ah, I only need the text, but nothing else.
Let’s see how I did it.
I used the IT house to do the test, and there is a "subscribe" button in the upper right corner of the website---->click
< p>
Jumped to the RSS directory. I grabbed these contents. The url in the red in the content points to the news text. Copy the link and open it in the browser. As shown below,
The content in the red box is not needed, just the green box Content. The source code we grabbed is just a string. As long as you find the beginning and end of the article, won't you grab the text? And on the same website, the position of the content of the article is the same. If the parent tag of the article is
, then we can find the beginning of the text, and most of the end of the text is the same. See below
Use firebug to view the position of the text The beginning of the text I took:
End of text:
WordPress database error: [Table 'yf99682.wp_s6mz6tyggq_comments' doesn't exist]
SELECT SQL_CALC_FOUND_ROWS wp_s6mz6tyggq_comments.comment_ID FROM wp_s6mz6tyggq_comments WHERE ( comment_approved = '1' ) AND comment_post_ID = 1833 ORDER BY wp_s6mz6tyggq_comments.comment_date_gmt ASC, wp_s6mz6tyggq_comments.comment_ID ASC