Doctor is a microservice for converting and extracting documents and audio files.
As a part of building CourtListener, we have spent years optimizing our document extraction and audio conversion pipelines. Doctor is the culmination of this work and has functionality like:
Extracting text from documents, including WPD, PDF, DOC, DOCX, RTF, and more. Completing optimized OCR extraction on image-based PDFs. Getting page counts from different document types. Converting audio files from WMA, OGG, WAV, and others to MP3. Making a PDF from images. Creating thumbnails from PDFs. Doctor is designed to scale while providing performant high-quality results. It can be scaled horizontally via a multi-worker or orchestrated single-worker model.
The code in Doctor has processed tens of millions of documents and over 2.5 million minutes of audio.
Organization Type: | Non-profit / charity / foundation |
---|---|
Status: | Active |
Parent Organization: | Free Law Project |
Open Source: | Yes |
Last Modified: | 11/19/2024 |
Added on: | 11/19/2024 |