برندینگ و برندسازی: c# - Extract coordinates of each sepparate word into a TextChunk in a pdf file

۱۳۹۶ مهر ۱۲, چهارشنبه

c# - Extract coordinates of each sepparate word into a TextChunk in a pdf file

[ad_1]

Following this actual solution I am trying to get all the words inside a TextChunk and each of its coordinates (actual page, top, bottom, left, right).

Since a TextChunk could be a phrase, a word or whatever, I tried to do this manually, counting on the last word's rectangle and cutting it each time. I noticed this manual method could be so buggy (I would need to manually count on special characters and so on), so I asked myself if ITextSharp provides any easier way to perform this.

My Chunk and LocationTextExtractionStragy inherited classes are the following:

public class Chunk

 public Guid Id get; set; 
 public Rectangle Rect get; set; 
 public TextRenderInfo Render get; set; 
 public BaseFont BF get; set; 
 public string Text get; set; 
 public int FontSize get; set; 


 public Chunk(Rectangle rect, TextRenderInfo renderInfo)
 
 this.Rect = rect;
 this.Render = renderInfo;
 this.Text = Render.GetText();
 Initialize();
 


 public Chunk(Rectangle rect, TextRenderInfo renderInfo, string text)
 
 this.Rect = rect;
 this.Render = renderInfo;
 this.Text = text;
 Initialize();
 


 private void Initialize()
 
 this.Id = Guid.NewGuid();
 this.BF = Render.GetFont();
 this.FontSize = ObtainFontSize();
 

 private int ObtainFontSize()
 
 return Convert.ToInt32(this.Render.GetSingleSpaceWidth() * 12 / this.BF.GetWidthPoint(" ", 12));
 


public class LocationTextExtractionPersonalizada : LocationTextExtractionStrategy

 //Save each coordinate
 public List<Chunk> ChunksInPage = new List<Chunk>();

 //Automatically called on each chunk on PDF
 public override void RenderText(TextRenderInfo renderInfo)
 renderInfo == null)
 return;

 //Get chunk Vectors
 var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
 var topRight = renderInfo.GetAscentLine().GetEndPoint();

 //Create Rectangle based on previous Vectors
 var rect = new Rectangle(
 bottomLeft[Vector.I1],
 bottomLeft[Vector.I2],
 topRight[Vector.I1],
 topRight[Vector.I2]);

 if (rect == null)
 return;

 //Add each chunk with its coordinates
 ChunksInPage.Add(new Chunk(rect, renderInfo));

So once I get the file and so on, I proceed this way:

private void ProcessContent()

 for (int page= 1; page <= pdfReader.NumberOfPages; page++)
 
 var strategy = new LocationTextExtractionPersonalizada();

 var currentPageText = PdfTextExtractor.GetTextFromPage(
 pdfReader,
 pagina,
 strategy);

 //Here is where I want to get each word with its coordinates
 var chunksWords= ChunkRawToWord(strategy.ChunksInPage);
 


private List<Chunk> ChunkRawToWord(IList<Chunk> chunks)

Afterwards, I wrote a comment on Mkl's solution, being replied with "use getCharacterRenderInfos()", which I use and I get every single character into a TextRenderInfo's List.

I'm sorry but I'm starting to mix concepts, ways to find out how to apply that solution and blowing my mind.

I would really appreciate a hand here. Thanks in advance.

[ad_2]

لینک منبع

دنبال کننده ها

۱۳۹۶ مهر ۱۲, چهارشنبه

c# - Extract coordinates of each sepparate word into a TextChunk in a pdf file