How to extract parameter from pdf file using java code & pdfbox


How to extract parameter from pdf file using java code & pdfbox



I am doing a java program which is to extract parameter from pdf files. I would like to extract the pdf to get the parameter like



parameter:



parameter



so I wish to get the output shown in the picture below:



convert text





So you want to extract the text from the PDF, and then count the occurences?
– notyou
Jul 2 at 9:41





@notyou Yes. Do you know how?
– charlsalad
Jul 2 at 9:53





@notyou I am able to do it using pdfid in Kali Linux but I have no idea how to do it using java for my program.
– charlsalad
Jul 2 at 9:54





First of, what you call "parameters" is a mixture of syntactical elements (e.g. obj and endobj enveloping an indirect object) and PDF names (e.g. Pages for the type of an inner pages tree node). Furthermore it is not clear where you want to search for these texts, only in the raw file or also inside encrypted or compressed streams.
– mkl
Jul 2 at 10:38


obj


endobj





You mention the pdfid tool. It is meant to help identify malicious PDFs. Its author says "it will also generate false positives"... I'd say it will predominantly create false positives for common documents produced nowadays.
– mkl
Jul 2 at 10:52




1 Answer
1



Going by comment above So you want to extract the text from the PDF, and then count the occurences?, you can do as follows:



Read the PDF file in:


String words = null;
try (PDDocument document = PDDocument.load(new File("C:\path\to\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
words = pdfFileInText.split("\s+");
}
}



And then print the occurrences of words:


Arrays.stream(words)
.collect(Collectors.groupingBy(s -> s))
.forEach((k, v) -> System.out.println(k + " " + v.size()));



You may need to tweak this slightly to your own needs.





Considering the nature of the search strings (syntactical PDF elements and PDF names), I doubt the OP wants to do text extraction.
– mkl
Jul 2 at 10:40





@mkl I thought that and I thought I got clarification from the OP in the comments. It may appear not, though.
– notyou
Jul 2 at 10:41





I think the OP himself actually is a bit unsure what he is about to do... The tool he mentions in a comment, pdfid, explicitly has been designed not to understand PDF syntax, so I wonder why the OP tries to reproduce its working using a library that does...
– mkl
Jul 2 at 10:57






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Rothschild family

Cinema of Italy