[ad_1]
Following this actual solution I am trying to get all the words inside a TextChunk and each of its coordinates (actual page, top, bottom, left, right).
Since a TextChunk could be a phrase, a word or whatever, I tried to do this manually, counting on the last word's rectangle and cutting it each time. I noticed this manual method could be so buggy (I would need to manually count on special characters and so on), so I asked myself if ITextSharp provides any easier way to perform this.
My Chunk and LocationTextExtractionStragy inherited classes are the following:
public class Chunk
public Guid Id get; set;
public Rectangle Rect get; set;
public TextRenderInfo Render get; set;
public BaseFont BF get; set;
public string Text get; set;
public int FontSize get; set;
public Chunk(Rectangle rect, TextRenderInfo renderInfo)
this.Rect = rect;
this.Render = renderInfo;
this.Text = Render.GetText();
Initialize();
public Chunk(Rectangle rect, TextRenderInfo renderInfo, string text)
this.Rect = rect;
this.Render = renderInfo;
this.Text = text;
Initialize();
private void Initialize()
this.Id = Guid.NewGuid();
this.BF = Render.GetFont();
this.FontSize = ObtainFontSize();
private int ObtainFontSize()
return Convert.ToInt32(this.Render.GetSingleSpaceWidth() * 12 / this.BF.GetWidthPoint(" ", 12));
public class LocationTextExtractionPersonalizada : LocationTextExtractionStrategy
//Save each coordinate
public List<Chunk> ChunksInPage = new List<Chunk>();
//Automatically called on each chunk on PDF
public override void RenderText(TextRenderInfo renderInfo)
renderInfo == null)
return;
//Get chunk Vectors
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create Rectangle based on previous Vectors
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);
if (rect == null)
return;
//Add each chunk with its coordinates
ChunksInPage.Add(new Chunk(rect, renderInfo));
So once I get the file and so on, I proceed this way:
private void ProcessContent()
for (int page= 1; page <= pdfReader.NumberOfPages; page++)
var strategy = new LocationTextExtractionPersonalizada();
var currentPageText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pagina,
strategy);
//Here is where I want to get each word with its coordinates
var chunksWords= ChunkRawToWord(strategy.ChunksInPage);
private List<Chunk> ChunkRawToWord(IList<Chunk> chunks)
Afterwards, I wrote a comment on Mkl's solution, being replied with "use getCharacterRenderInfos()", which I use and I get every single character into a TextRenderInfo's List.
I'm sorry but I'm starting to mix concepts, ways to find out how to apply that solution and blowing my mind.
I would really appreciate a hand here. Thanks in advance.
[ad_2]
لینک منبع