Alert icon
We're changing our privacy policy. This stuff matters.  Learn more  Dismiss

How to Extract Text Contents from PDF (part 1/3)

Loading...

Sign in or sign up now!
Alert icon
Upgrade to the latest Flash Player for improved playback performance. Upgrade now or more info.
3,905
Loading...
Alert icon
Sign in or sign up now!
Alert icon

Uploaded by on Mar 22, 2010

Demonstrates extracting text contents from PDF by hand, using basic UNIX tools only.

PDFMiner (PDF extraction tool in Python):
http://www.unixuser.org/~euske/python/pdfminer/

Category:

Education

Tags:

License:

Standard YouTube License

  • likes, 1 dislikes

Link to this comment:

Share to:

Uploader Comments (yusukeshinyama)

  • テキストじゃなくて線とか円を取り出すには?

  • @linus19741018 ページのcontent streamの中に m とか S とかいうコマンドが書かれています。これが図形の描画指令なので­、これを取り出せばOKです。PDFはPostScriptと同­様の描画モデルを使っていて、直線や曲線、円を区別しません。こ­れらはすべて3次のベジエ曲線で表現されています。ですが、色の­指定や線の太さ、クリッピングなどがあるので、データを取り出し­ても実際の画像を描画するのは結構大変です。

see all

All Comments (7)

Sign In or Sign Up now to post a comment!
  • I can you please use this to fix my automobile insurance papers? My insurer forgot to send me the current ones and the date is expired, I need them to get out of my ticket. I have to go to court in sex hours

  • @66georgs

    oh, did PDF mean fun to you until you watched this vid?

  • Thanks for demonstration! What keyboard are you using? :)

  • @yusukeshinyama

    お礼が遅くなりました。ベジェ曲線として保存されているというこ­とですね。パーサとかテキスト形式のベクターフォーマットに変換­するツールとかあればなあ。

  • very boring video..

Loading...

Alert icon
0 / 00Unsaved Playlist Return to active list
    1. Your queue is empty. Add videos to your queue using this button:
      or sign in to load a different list.
    Loading...Loading...Saving...
    • Clear all videos from this list
    • Learn more