Multi-site RSS news text, import the discuz forum, automatic posting (1) - Auto, Crawl, Discuz, Forum, import, Multi, News, one, POST, realization, RSS, site, Text

The company’s R&D department cannot access the Internet, but the company hopes that its R&D colleagues can Follow the news, understand the hotspots of science and technology, and keep up with the trend of the times. So a discuz forum was built, but the content was lacking. Fortunately, the server of this forum can be accessed online (in both networks). So I wanted to make a crawler tool to grab news content through rss and put it on the company forum.

is now implemented, crawling multiple websites at the same time The above data (IT Home, Tiger Sniffing Network, etc.) only have text without pictures. Of course, I will try to capture images. The current results are pretty good. There are a lot of texts about RSS reading on the Internet, but about the links in RSS pointing toText The crawling of is relatively small. It happens to be such a project these days. Now I will put the design ideas and key codes Post it and share it. The source code has been sorted and uploaded, and there is a link at the bottom of the article.

First, let’s briefly introduce RSS

< tr>

1: What is RSS? RSS (really simple syndication): web content aggregator. The format of RSS is XML. Must conform to the XML 1.0 specification.
The role of RSS: subscribe to BLOG, subscribe to news
2 Historical version of RSS:
http://blogs.law.harvard.edu /tech/rssVersionHistory
There are many versions of RSS, 0.90, 0.91, 0.92, 0.93, 0.94, 1.0 and 2.0. The opposite of RSS is ATOM.
RSS 2.0 is mainly used in China, and ATOM 0.3 is mainly used abroad.
Due to the emergence of 2 factions in RSS, it has led to chaotic scenes. Among them, the RSS2.0 specification is defined and locked by Harvard University.
地址:http://blogs.law.harvard.edu/tech/rss

3: Analyze jar:Rome :http://wiki.java.net/bin/view/ Javawsxml/Rome
Rome is an open source project on java.net, and the current version is 1.0. Why is it called Rome? According to its introduction, it means “all roads lead to Rome”, and it means RSS. Rome may be pulled out of one of its own sub-projects by the sun company. The naming of packages and classes feels normative just like j2sdk. Functionally supports all versions of RSS and Atom 0.3 (Atom is a way of content aggregation similar to RSS). Rome itself provides API and function implementation.

More RSS introduction: http://www.360doc.com/content/10/1224/09/1007797_80869844.shtml

< p>II. Implementation ideas

1. http request to fetch the content of RSS . The rss are all in xml format and assembled into objects through home jar.

First define a bean, and the requested rss content is assembled into this object. The fields are based on your project requirements

public class RSSItemBean {private String title; private String author; private String uri; private String link; private String description; private Date pubDate; private String type; private String content; //omit get set}

http request code, here is quoted import com.sun.syndication.feed.synd.*; the following files, please import home.jar package< /span>

 /** * * @param url rss website address such as: http://www.ithome.com/rss/ * @return all article objects* @throws Exception */ public List getRss(String url) throws Exception {URL feedUrl = new URL(url );//SyndFeedInput: Convert the content from remote reading to xml structure into SyndFeedImpl instance SyndFeedInput input = new SyndFeedInput();//rome generate rss and atom instances according to SyndFeed type, SyndFeed feed = input.build(new XmlReader(feedUrl )); //SyndFeed is the rss and atom implementation class Sy ndFeedImpl interface List entries = feed.getEntries(); RSSItemBean item = null; List rssItemBeans = new ArrayList(); for (SyndEntryImpl entry: entries) {item = new RSSItemBean(); item .setTitle(entry.getTitle().trim()); item.setType(feed.getTitleEx().getValue().trim()); item.setUri(entry.getUri()); item.setPubDate(entry. getPublishedDate()); item.setAuthor(entry.getAuthor()); rssItemBeans.add(item);} return rssItemBeans; }

Now the rss parsing is complete. The news headline, article address, and introduction are all there. It’s very simple, isn’t it? Here comes the question. Isn’t the body of the news yet? The text is not in rss. Like Baidu News, it is just an aggregation. When you click on the news link, it will still be linked to the source webpage of the news.

Sometimes, we need to grab the body of the news. It is not difficult, please continue to look down.

< br>


String url = entry.getUri();
item.setUri(url);
 
This link points to the main text page of the news. What we need to do is to get the source code from the url. Here we use java to initiate an http request, which involves the knowledge points of java network operation. If you forget it, please make up. Because the encoding of each website is different, some use gbk, some utf-8. Chinese is tangled at this time, and the problem of garbled characters often appears, so there is this line of code

 htmlContent = new String(bytes, pageEncoding);

This pageEncoding can be found from reading the source code, such as

content="text/html; charset=gb2312"
 Of course I found it on the website I was looking for ( To see with the naked eye), gb2312 corresponds to GB2312; utf-8 corresponds to UTF-8. There is not much to say about garbled characters, and garbled characters are the problem here.  

 /** * http request to get the source code of the page* @param surl body url * @ return page source code*/ public String getStaticPage(String surl) {String htmlContent = ""; try {java.io.InputStream inputStream; java.net.URL url = new java.net.URL(surl); java.net.HttpURLConnection connection = (java.net.HttpURLConnection) url.openConnection(); connection.connect(); inputStream = connection.getInputStream(); byte bytes[] = new byte[1024 * 4000]; int index = 0; int count = inputStream.read(bytes, index, 1024 * 4000); while (count != -1) {index += count; count = inputStream.read(bytes, index, 1);} htmlContent = new String(bytes, pageEncoding) ; connection.disconnect();} catch (Exception ex) {ex.printStackTrace();} return htmlContent.trim();} 
 
 
Now we have grabbed all the source code of the webpage, which is not what we want Ah, I only need the text, but nothing else. 

Let’s see how I did it. 
 
 
 
I used the IT house to do the test, and there is a "subscribe" button in the upper right corner of the website---->click
 < p> 
 
Jumped to the RSS directory. I grabbed these contents. The url in the red  in the content points to the news text. Copy the link and open it in the browser. As shown below,
 
 
 
The content in the red box is not needed, just the green box Content. The source code we grabbed is just a string. As long as you find the beginning and end of the article, won't you grab the text? And on the same website, the position of the content of the article is the same. If the parent tag of the article is 

, then we can find the beginning of the text, and most of the end of the text is the same. See below 

 
Use firebug to view the position of the text  The beginning of the text I took: 

 

 
End of text: 


 
 
Corresponding to the code, it is a process of seeking substrings. As follows

 /** * Get the content of the text according to the url* * @param url points to the url address of the text* @return the source code of the text* / public String getContent(String url) {String src = getStaticPage(url); //Get all source code int startIndex = src.indexOf(startTag); //Start tag int endIndex = src.indexOf(endTag); //End tag // System.out.println(src); // System.out.println(startTag+"	"+endTag); //System.out.println(startIndex+"	"+endIndex); if (startIndex!= -1 && endIndex !=-1) {return src.substring(startIndex, endIndex);} return ""; }
 In this way, the body of the news is captured. The startTag and The endTags are not the same, so treat them separately. 
 
This is how you can grab the rss text. In the next article, I will introduce the text grabbing strategies for different websites and different rss. It can be regarded as a summary of my work. 
Download the source code if you want to study it carefully. This blog only writes some key codes, you can contact me directly if you have any questions. 

 
this The project is developed using maven, please add maven support or import the corresponding jar
The source code download address: http:/ /download.csdn.net/detail/a442180673/6511981 Simple version Text capture, not imported into data
 Full version: http://download.csdn.net/detail/a442180673/6523263 Implementation of importing discuz database

					
							

								
									
									 Auto, Crawl, Discuz, Forum, import, Multi, News, one, POST, realization, RSS, site, Text									
								


							

					
				
	
		Post navigation
		What is the relationship between Fastcgi and PHP-FPM?
Optimization – as the basic skills of the optimizer of the game industry
	
WordPress database error: [Table 'yf99682.wp_s6mz6tyggq_comments' doesn't exist]
SELECT SQL_CALC_FOUND_ROWS wp_s6mz6tyggq_comments.comment_ID FROM wp_s6mz6tyggq_comments  WHERE ( comment_approved = '1' ) AND comment_post_ID = 1833  ORDER BY wp_s6mz6tyggq_comments.comment_date_gmt ASC, wp_s6mz6tyggq_comments.comment_ID ASC 


	
	

		
		Leave a Comment Cancel reply
Your email address will not be published.