دنبال کننده ها

۱۳۹۶ مهر ۱۲, چهارشنبه

c# - Extract coordinates of each sepparate word into a TextChunk in a pdf file

[ad_1]



Following this actual solution I am trying to get all the words inside a TextChunk and each of its coordinates (actual page, top, bottom, left, right).



Since a TextChunk could be a phrase, a word or whatever, I tried to do this manually, counting on the last word's rectangle and cutting it each time. I noticed this manual method could be so buggy (I would need to manually count on special characters and so on), so I asked myself if ITextSharp provides any easier way to perform this.



My Chunk and LocationTextExtractionStragy inherited classes are the following:



public class Chunk

public Guid Id get; set;
public Rectangle Rect get; set;
public TextRenderInfo Render get; set;
public BaseFont BF get; set;
public string Text get; set;
public int FontSize get; set;


public Chunk(Rectangle rect, TextRenderInfo renderInfo)

this.Rect = rect;
this.Render = renderInfo;
this.Text = Render.GetText();
Initialize();



public Chunk(Rectangle rect, TextRenderInfo renderInfo, string text)

this.Rect = rect;
this.Render = renderInfo;
this.Text = text;
Initialize();



private void Initialize()

this.Id = Guid.NewGuid();
this.BF = Render.GetFont();
this.FontSize = ObtainFontSize();


private int ObtainFontSize()

return Convert.ToInt32(this.Render.GetSingleSpaceWidth() * 12 / this.BF.GetWidthPoint(" ", 12));



public class LocationTextExtractionPersonalizada : LocationTextExtractionStrategy

//Save each coordinate
public List<Chunk> ChunksInPage = new List<Chunk>();

//Automatically called on each chunk on PDF
public override void RenderText(TextRenderInfo renderInfo)
renderInfo == null)
return;

//Get chunk Vectors
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();

//Create Rectangle based on previous Vectors
var rect = new Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]);

if (rect == null)
return;

//Add each chunk with its coordinates
ChunksInPage.Add(new Chunk(rect, renderInfo));




So once I get the file and so on, I proceed this way:



private void ProcessContent()

for (int page= 1; page <= pdfReader.NumberOfPages; page++)

var strategy = new LocationTextExtractionPersonalizada();

var currentPageText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pagina,
strategy);

//Here is where I want to get each word with its coordinates
var chunksWords= ChunkRawToWord(strategy.ChunksInPage);



private List<Chunk> ChunkRawToWord(IList<Chunk> chunks)



Afterwards, I wrote a comment on Mkl's solution, being replied with "use getCharacterRenderInfos()", which I use and I get every single character into a TextRenderInfo's List.



I'm sorry but I'm starting to mix concepts, ways to find out how to apply that solution and blowing my mind.



I would really appreciate a hand here. Thanks in advance.




[ad_2]

لینک منبع