python 2.7 - why my function doesn't remove objects from final_list? -


hey guys:)im sorry code feel necessary u see everything..

i tried everything... hidden prints in code, debugging ten times, triple checked built in methods, , still, .crawl() method dosnt remove object final_list.

the object of assignment built 2 classes: web_page : holds data of web page.(the pages come in form of html files saved in folder on desktop. crawler: compare between pages , hold list of uniqe pages---> final_list

import re import os  def remove_html_tags(s):     tag = false     quote = false     out = ""      c in s:             if c == '<' , not quote:                 tag = true             elif c == '>' , not quote:                 tag = false             elif (c == '"' or c == "'") , tag:                 quote = not quote             elif not tag:                 out = out + c      return out   def lev(s1, s2):     return lev_iter(s1, s2, dict())  def lev_iter(s1, s2, mem):      (i,j) = (len(s1), len(s2))     if (i,j) in mem:         return mem[(i,j)]      s1_low = s1.lower()     s2_low = s2.lower()     if len(s1_low) == 0 or len(s2_low) == 0:         return max(len(s1_low), len(s2_low))     d1 = lev_iter(s1_low[:-1], s2_low, mem) + 1     d2 = lev_iter(s1_low, s2_low[:-1], mem) + 1     last = 0 if s1_low[-1] == s2_low[-1] else 1     d3 = lev_iter(s1_low[:-1], s2_low[:-1], mem) + last     result = min(d1, d2, d3)      mem[(i,j)] = result      return result  def merge_spaces(content):     return re.sub('\s+', ' ', content).strip()   """ class holds data on web page """ class webpage:      def __init__(self, filename):          self.filename = filename      def process(self):          f = open(self.filename,'r')         line_lst = f.readlines()          self.info = {}          in range(len(line_lst)):             line_lst[i] = line_lst[i].strip(' \n\t')             line_lst[i] = remove_html_tags(line_lst[i])         lines = line_lst[:]         line in lines:             if len(line) == 0:                 line_lst.remove(line)         self.body = ' '.join(line_lst[1:])         self.title = line_lst[0]         f.close()      def __str__(self):         return self.title + '\n' + self.body      def __repr__(self):         return self.title      def __eq__(self,other):         n = lev(self.body,other.body)         k = len(self.body)         m = len(other.body)         return float(n)/max(k,m) <= 0.15      def __lt__(self,other):         return self.title < other.title  """ class crawls web """      class crawler:     def __init__(self, directory):          self.folder = directory      def crawl(self):          pages = [f f in os.listdir(self.folder) if f.endswith('.html')]          final_list = []          in range(len(pages)):              pages[i] = webpage(self.folder + '\\' + pages[i])             pages[i].process()              k in range(len(final_list)+1):                 if k == len(final_list):                     final_list.append(pages[i])                       elif pages[i] == final_list[k]:                      if pages[i] < final_list[k]:                         final_list.append(pages[i])                         final_list.remove(final_list[k])                         break          print final_list          self.pages = final_list 

everything works fine besides freaking line final_list.remove(final_list[k]). please? whats wrong here?

i'm not sure why code doesn't work, it's difficult test because don't know kind of input should end calling remove().

i suggest following these steps:

  • make sure remove() called @ point.
  • remove() relies on __eq__() method find item remove, make sure __eq__() isn't culprit.

as side note, want replace this:

self.folder + '\\' + pages[i] 

with:

import os.path # ... os.path.join(self.folder, page[i]) 

this simple change should make script work on operating systems, rather on windows only. (gnu/linux, mac os , other unix-like os use “/” path separator.)

please consider replacing loops of form:

for in range(len(sequence)):     # sequence[i] 

with:

for item in sequence:     # item 

if need item index, use enumerate():

for i, item in enumerate(sequence): 

Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -