File search¶
About¶
To index text inside files (.txt, .sql, .excel, .pptx, .docx, .pdf...) in a flexible way, for:
- Document classification.
- File and folder content search, e.g. finding old code, names of sheets in excel files, columns used in database, attachement in .msg files, comments written in .pptx files.
- Preparing text corpus in a selective manner (e.g. extract only the top headings of powerpoint slides...) for LLM training
Structure¶
File search is a scan of major file types to give control over text corpus usage for further analysis.
Demo¶
example\fs_build.py: Demo to build a text corpus of major file types.example\fs_query.py: Functions to explore the text corpus.
Latest updates¶
Currently no major plans to make new changes due to busy schedule. The code here is relatively latest with fixes for effeciency compared to the analytics_tasks PyPi pacakge.