Skip to content

File search

About

To index text inside files (.txt, .sql, .excel, .pptx, .docx, .pdf...) in a flexible way, for:

  • Document classification.
  • File and folder content search, e.g. finding old code, names of sheets in excel files, columns used in database, attachement in .msg files, comments written in .pptx files.
  • Preparing text corpus in a selective manner (e.g. extract only the top headings of powerpoint slides...) for LLM training

Structure

File search is a scan of major file types to give control over text corpus usage for further analysis.

Demo

  • example\fs_build.py: Demo to build a text corpus of major file types.
  • example\fs_query.py: Functions to explore the text corpus.
  • example\fs_classification.py: Demo on parsing .sql code from the corpus combined with database information schema to classify and organize .sql codebase.