In the previous article, it was introduced to grab news text via rss. Here is an introduction to crawling the main text of multiple RSS sites at the same time, as well as capturing the pictures in the main text.
My RSS is not crawling the content of the site , but the required text, ad comments, etc. Are excluded.
第Part: Crawl multiple sites at the same time,look at my site configuration
IT Home http://www.ithome.com/rss/ ]]>< /startTag> ]]> GB2312 true < name>Huxiu.com http://www.huxiu.com/rss/0.xml ]]> ]]> UTF-8 true< /open>
These two sites are what I need to crawl, the url is the rss address startTag, endTag Is the beginning and end of the text, encoding is the encoding format of the site, open indicates whether to capture the site, if it is not clear, please see ht tp://www.voidcn.com/article/p-ovaqymau-bky.html
p>
The site to be crawled is ready, let’s start parsing. Use dom4j, please introduce related jars I am used to using maven to manage these jars
dom4j dom4j 1.6.1
p>
The bean object of the site
public class Website {private String name; private String url; private String startTag; private String endTag; private String encoding; private String open;//Omit get set}
parsed code, after parsing, generate list and return< p>
/** * @author hongliang.dinghl * Dom4j generates XML documents and Parse XML documents*/public class Dom4jUtil {public ListparserXml(String fileName) {SAXReader saxReader = new SAXReader(); List list = new ArrayLi st (); try {URL url = getClass().getResource("/"); //System.out.println(url.getPath()); String path = url.getFile().replace(" %20", "") + fileName; Document document = saxReader.read(new File(path)); Element websites = document.getRootElement(); for (Iterator i = websites.elementIterator(); i.hasNext(); ) {Element employee = (Element) i.next(); Website website = new Website(); for (Iterator j = employee.elementIterator(); j.hasNext();) {Element node = (Element) j.next (); String name = node.getName(); // System.out.println(name + ":" + node.getText()); String methodName = "set" + name.substring(0, 1).toUpperCase () + name.substring(1); Method method = website.getClass().getMethod(methodName, String.class); method.invoke(website, node.getText()); } list.add(website);}} catch (DocumentException e) {e.printStackTrace();} catch (NoSuchMethodException e) {e.printStackTrace();} catch (InvocationTargetException e) {e.printStackTrace();} catch (IllegalAccessException e) {e.printStackTrace();} return list; }}
Multiple site analysis ends. Then traverse the site, visit the url, grab the text, please see my last text.
第二部分:RSS图片抓取,a链接的去除 直接看代码吧,都有注释的 。 The bottom of the text,
public class FeedReader {private String CLASS_PAHT; private String relative_path; public FeedReader() {Properties proerties = PropertiesUtil.getInstance().getProerties(); CLASS_PAHT= proerties .getProperty("image_path"); relative_path = proerties.getProperty("relative_path");} /** * @param url rss website address such as: http://www.ithome.com/rss/ * @return all article objects * @throws Exception */ public ListgetRss(String url) throws Exception {URL feedUrl = new URL(url);//SyndFeedInput: Convert the content from remote reading to xml structure into SyndFeedImpl instance URLConnection conn = feedUrl.openConnection (); conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"); //deceive the server SyndFeedInput input = new SyndFeedInput();//rome is generated by SyndFeed type Examples of rss and atom, SyndFeed feed = input.build(new XmlReade r(conn)); //SyndFeed is the interface of rss and atom implementation class SyndFeedImpl List entries = feed.getEntries(); RSSItemBean item = null; List rssItemBeans = new ArrayList (); for (SyndEntryImpl entry: entries) {item = new RSSItemBean(); item.setTitle(entry.getTitle().trim()); item.setType(feed.getTitleEx().getValue().trim()); item. setUri(entry.getUri()); item.setPubDate(entry.getPublishedDate()); item.setAuthor(entry.getAuthor()); rssItemBeans.add(item);} return rssItemBeans;} /** * From the html Get news text* * @param website website object, my own definition* @return RSS object object linked list with news text added* @throws Exception */ public List getContent(Website website) throws Exception {String content; List rssList = getRss(website.getUrl()); FindHtml findHtml = new FindHtml(website.getStartTag(), website.getEndTag(), webs ite.getEncoding()); for (RSSItemBean rsItem: rssList) {String link = rsItem.getUri(); content = findHtml.getContent(link); //Key method, get news solicitation content = processImages(content); // Convert picture rsItem.setContent(content); //break; rsItem.setFid(Integer.parseInt(website.getFid()));} return rssList;} /** * Remove the in the article * * @param input * @return */ private String removeLinks(String input) {String output = input; // The regular expression of at the beginning String regEx = "]*>"; Pattern p = Pattern.compile( regEx, Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(input); output = m.replaceAll(""); // Regular expression at the end of regEx = ""; p = Pattern.compile(regEx, Pattern.CASE_INSENSITIVE); m = p.matcher(output); output = m.replaceAll(""); return output;} public static void main(String[] args){ UUID uuid = UUID. ra ndomUUID(); System.out.println(uuid.toString()); System.out.println(uuid.toString());} /** * Process the pictures in the article* * @param input * @return */ private String processImages(String input) {String output = input; String regEx = "]*>"; Pattern p = Pattern.compile(regEx, Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(input) ; List imgs = new ArrayList (); // read all tags while (m.find()) {imgs.add(m.group());} // save the image Go to the local and replace the src value of the tag for (String img: imgs) {int begin = -1; int end = -1; String path = ""; if (img.indexOf("src=\"" ) != -1) {begin = img.indexOf("src=\""); path = img.substring(begin + 5); end = path.indexOf("\""); if (end !=- 1) {path = path.substring(0, end);} else {path = ""; }} if (img.indexOf("src='") != -1) {begin = img.indexOf("src='"); path = img.substring(begin + 5); end = path.indexOf( "'"); if (end != -1) {path = path.substring(0, end);} else {path = "";}} if (!path.equals("")) {// String filepath = this.writeImageToServer(path); String filepath = writeToFile(path); while (filepath.indexOf('\\') != -1) {filepath = filepath.replace('\\','/'); } output = output.replaceAll(path, filepath);}} // System.out.println(output); return output;} /** * Write the picture to the database* * @param path Original picture path* @return local Picture path*/ public String writeToFile(String path) {String dirName = ""; String fileName = ""; OutputStr eamWriter osw = null; File directory = null; File file = null; try {// Take the image format int begin = path.lastIndexOf("."); String suffix = path.substring(begin + 1); if(suffix .contains("!")){ //Some website pictures jyijaktkyzkk.jpg!292x420 int index = suffix.indexOf("!"); suffix = suffix.substring(0,index);} // read image URL url = new URL(path); BufferedImage image = ImageIO.read(url); dirName = CLASS_PAHT; //File directory directory = new File(dirName); if (!directory.exists()) {directory.mkdirs();} if (directory.exists()) {String name = UUID.randomUUID() + "." + suffix; fileName = dirName + name; file = new File(fileName); //The real file name FileOutputStream fos = new FileOutputStream(file ); ImageIO.wr ite(image, suffix, fos); fos.close(); return relative_path+name;}} catch (Exception e) {e.printStackTrace();} return ""; }}
There is also a third text, rss grabs the text and saves it to discuz
< p>