The previous article roughly analyzed the fingerprint recognition part of Spaghetti. This article will roughly analyze it The crawler part.
Let’s first look at urlextract.py in the extractor directory: span>
class span> UrlExtract:
@staticmethoddef run(content):
try:urls = re.findall(r'href=['"]?([^'" >]+)|Allow: (/ .*)|Disallow: (/.*)|(.+?) ',content)return urls
except Exception,e:< pre style="font-size:12px; font-family:consolas,'Courier New',courier,monospace; width:100%; margin:0em; background-color:#fbfbfb"> pass
The class defined by this file is mainly to parse the URL connection in the response return and the specified path in the inherent files robots.txt and sitemap.xml.
The source code in forms.py mainly focuses on the processing of forms. BeautifulSoup is used for form processing.
try:params = zip(*[iter(params)]*2 )data = urllib.unquote(urllib.urlencode(params))
if< /span> method == []:
method = ['get']method = method[0]if method.upper() = = "GET":return data< /pre>except Exception,e :pass
The combination of zip and iter in the above code is worth knowing, you can refer to here. The latter urlencode is equivalent to encoding the dict into the form of key=value&key=value of the url. In fact, as long as the tuple queue is composed of two elements, it can be successfully encoded.
Next, let’s analyze the crawler function implementation file crawler.py. Look at the code of the function run:
def run(self):
links_list = []try:for path in ('','robots.txt','sitemap .xml','spaghetti'):
url = self.ucheck.path(self.url,path)resp = self .request.send(url,cookies=self.cookie)links = self.extract.run(resp.content)if links == None: link=[]
forms = self.form s.run(resp.content,self.url)if forms == None: forms=[]
links_list += linkslinks_list += formsreturn self .get(self.parse(links_list))
except Exception,e:passThis function is mainly for the specified URL, which is the target address, and then tries to find the robots.txt and sitemap.xml files in the root directory to form a list of URL addresses that need to be crawled.
Look at the get function:
< pre style="overflow:auto; border-top:#cecece 1px solid; border-right:#cecece 1px solid; width:650px; border-bottom:#cecece 1px solid; padding-bottom:5px; padding-top:5px ; padding-left:5px; min-height:40px; border-left:#cecece 1px solid; padding-right:5px; background-color:#fbfbfb">
def get(self,lista1 ):
lista = []for i in lista1:if re.search('=',i,re.I):lista.append(i)return lista
This function is to find URLs with equal signs. Is this For follow-up XSS, SQL, etc. testing? Finding a URL with an equal sign is equivalent to finding a URL with key=value, that is, finding a place with an input point.
Look at the parse function, here are a few pieces to look at: span>
for link in links:for i in link:if i =='':
passelse:
if i not in flinks:
tlinks.append(i)blacklist = ['.png' ,'.jpg','.jpeg','.mp3','.mp4','.avi','.gif','.svg','.pdf','.js','.zip' ,'.css','.doc','mailto']for link in tlinks:for bl in blacklist:if bl in link:
pblacklist.append(link)for link in tlinks:for bl in pblacklist:if bl == link:
index = tlinks.index(bl) pre>del tlinks[index]
This piece is mainly to filter out some resource links such as pictures, videos, css, js, etc. I feel that this piece is not well written. The author's idea is to find all the links that meet the blacklist, save it in pblacklist, and then double loop through tlinks and pblacklist, and remove the links that meet pblacklist in tlinks. However, when you filter the pblacklist before, you can directly remove the links that do not meet the requirements, and there is no need to traverse it again. If I write, just filter out the links that contain the blacklist and open a list to save the links that meet the requirements.
Continue to see:
for link in tlinks:if link.startswith(' ./'):
link = link.split('.')[1]
if link.startswith('http://')or link.startswith('https://'):if link not in dlinks: pre>dlinks.append(link )elif link .startswith('www.'):link ='http://'+link
if link not in dlinks:
dlinks.append(link)< /pre>elif link.startswith ('/'):link = self.ucheck.path(self.url,link)
if link not in dlinks:
dlinks.append(link)else:
link = self.ucheck.path(self. url,link)
if link not in dlinks:
dlinks.append(link)for link in dlinks:if not link.startswith ('http'):
passelif self.parser.host() not in link:passelif link.startswith('http://http://') or link.startswith('https://http://' ):
link ='http'+link.split('http')[2]
complete.append(link)else:
complete.append(link)for i in complete:i = urllib.unquote(i)i = i.replace('&','&' )
if i not in deflinks:
deflinks.append(i)return deflinks
This piece is the splicing url, so that it becomes a standard url link. Many of the URLs obtained before are relative addresses or URLs of non-target hosts, and they will be filtered out here.
thread_run function:
def thread_run(self): pre>links = self.run ()links_list = []if'--crawler' in sys.argv :self. output.info('Starting deep crawler for %s'%self.parser.host())try:for link in links:resp = self.request.send(link,cookies=self.cookie)links_extract = self. extract.run(resp.content)if links_extract == None: links_extract=[]
forms_extract = self.forms.run(resp.content,self.url)if forms_extract == None: forms_extract=[]
links_list += links_extractlinks_list += forms_extrac tlinks_list = self.get(self.parse(links_list))links_list = links_list+linksreturn links_list
except Exception,e:pass< span style="color:#0000ff">return linksIt can be seen that the above functions are combined, and then the link that meets the requirements is finally obtained.
Overall, this crawler does not deal with dynamic pages related to Ajax technology. From the source code point of view, this crawler crawled the first-level links. Then url de-duplication is not achieved.
That's probably it.