How to convert pdf into text file using itext liberary -


below code converting pdf file text file. code runs, doesn't generate resulting text file (sample.txt). can shed light on this? code partly based on example of first itext in action book...

import com.lowagie.text.*; import com.lowagie.text.pdf.*;  public class convertpdftotext {     public static void main(string[] args) throws ioexception {         try {             document document = new document();             document.open();             pdfreader reader = new pdfreader("data dictinary a4.pdf");             pdfdictionary dictionary = reader.getpagen(1);             prindirectreference reference = (prindirectreference)             dictionary.get(pdfname.contents);             prstream stream = (prstream) pdfreader.getpdfobject(reference);             byte[] bytes = pdfreader.getstreambytes(stream);             prtokeniser tokenizer = new prtokeniser(bytes);             fileoutputstream fos=new fileoutputstream("sample.txt");             stringbuffer buffer = new stringbuffer();             while (tokenizer.nexttoken()) {                 if (tokenizer.gettokentype() == prtokeniser.tk_string) {                     buffer.append(tokenizer.getstringvalue());                 }             }             string test=buffer.tostring();             stringreader streader = new stringreader(test);             int t;             while((t=streader.read())>0)                 fos.write(t);             document.add(new paragraph(".."));             document.close();         }         catch (exception e) {}     } } 

which example using? if the 1 page 575 read following:

"what have here poor man’s text extractor. works example, won’t work pdf files can found in wild. many aspects should taken account if want use itext text-extraction library."

the next chapter named "why itext doesn’t text extraction" - itext in version limited when comes text extration. in end have 2 possibilities:

  1. upgrade new version of itext provides better text extraction capabilities

  2. if must stick version 2.1.7 have @ pdftextextractor.java instead of doing. here code found in post:

    pdfreader reader = new pdfreader(yourinputstream); pdftextextractor extractor = new pdftextextractor(reader);         int pagenumber = reader.getnumberofpages();  for(int = 1; i<= pagenumber; i++) {     system.out.println("============page number " + + "=============" );     string line = extractor.gettextfrompage(i);     system.out.println(line); } 

    but can see in other post depending on pdf, extraction doesn't work in version...


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -