python 2.7 - why my function doesn't remove objects from final_list? -
hey guys:)im sorry code feel necessary u see everything..
i tried everything... hidden prints in code, debugging ten times, triple checked built in methods, , still, .crawl() method dosnt remove object final_list.
the object of assignment built 2 classes: web_page : holds data of web page.(the pages come in form of html files saved in folder on desktop. crawler: compare between pages , hold list of uniqe pages---> final_list
import re import os def remove_html_tags(s): tag = false quote = false out = "" c in s: if c == '<' , not quote: tag = true elif c == '>' , not quote: tag = false elif (c == '"' or c == "'") , tag: quote = not quote elif not tag: out = out + c return out def lev(s1, s2): return lev_iter(s1, s2, dict()) def lev_iter(s1, s2, mem): (i,j) = (len(s1), len(s2)) if (i,j) in mem: return mem[(i,j)] s1_low = s1.lower() s2_low = s2.lower() if len(s1_low) == 0 or len(s2_low) == 0: return max(len(s1_low), len(s2_low)) d1 = lev_iter(s1_low[:-1], s2_low, mem) + 1 d2 = lev_iter(s1_low, s2_low[:-1], mem) + 1 last = 0 if s1_low[-1] == s2_low[-1] else 1 d3 = lev_iter(s1_low[:-1], s2_low[:-1], mem) + last result = min(d1, d2, d3) mem[(i,j)] = result return result def merge_spaces(content): return re.sub('\s+', ' ', content).strip() """ class holds data on web page """ class webpage: def __init__(self, filename): self.filename = filename def process(self): f = open(self.filename,'r') line_lst = f.readlines() self.info = {} in range(len(line_lst)): line_lst[i] = line_lst[i].strip(' \n\t') line_lst[i] = remove_html_tags(line_lst[i]) lines = line_lst[:] line in lines: if len(line) == 0: line_lst.remove(line) self.body = ' '.join(line_lst[1:]) self.title = line_lst[0] f.close() def __str__(self): return self.title + '\n' + self.body def __repr__(self): return self.title def __eq__(self,other): n = lev(self.body,other.body) k = len(self.body) m = len(other.body) return float(n)/max(k,m) <= 0.15 def __lt__(self,other): return self.title < other.title """ class crawls web """ class crawler: def __init__(self, directory): self.folder = directory def crawl(self): pages = [f f in os.listdir(self.folder) if f.endswith('.html')] final_list = [] in range(len(pages)): pages[i] = webpage(self.folder + '\\' + pages[i]) pages[i].process() k in range(len(final_list)+1): if k == len(final_list): final_list.append(pages[i]) elif pages[i] == final_list[k]: if pages[i] < final_list[k]: final_list.append(pages[i]) final_list.remove(final_list[k]) break print final_list self.pages = final_list everything works fine besides freaking line final_list.remove(final_list[k]). please? whats wrong here?
i'm not sure why code doesn't work, it's difficult test because don't know kind of input should end calling remove().
i suggest following these steps:
- make sure
remove()called @ point. remove()relies on__eq__()method find item remove, make sure__eq__()isn't culprit.
as side note, want replace this:
self.folder + '\\' + pages[i] with:
import os.path # ... os.path.join(self.folder, page[i]) this simple change should make script work on operating systems, rather on windows only. (gnu/linux, mac os , other unix-like os use “/” path separator.)
please consider replacing loops of form:
for in range(len(sequence)): # sequence[i] with:
for item in sequence: # item if need item index, use enumerate():
for i, item in enumerate(sequence):
Comments
Post a Comment