Skip to content

2. Creating a Custom Extractor

If you need to support a programming language not yet implemented by CodableLLM, you can easily extend the framework by creating your own custom extractor.

All custom extractors must subclass Extractor, an abstract base class that defined the interface for parsing source files and returning extracted functions.

Methods to Implement

When extending Extractor, you must implement two methods:

  1. extract: Parses a given source code file and returns a sequence of SourceFunction instances.
  2. get_extractable_files: Given a directory or file path, returns all files that the extractor can process. This typically involves filtering by file extensions.

Example: Creating a Custom Language Extractor

Below is a template for defining your own extractor. The implementation of each method will depend on how you choose to parse and extract functions for your language (e.g. using Tree-sitter, regex, or another parser).

Creating Your Custom Extractor Class

from pathlib import Path
from typing import Sequence

from codablellm.core.extractor import Extractor
from codablellm.core.function import SourceFunction

class CustomLanguageExtractor(Extractor):
    '''
    Extractor for CustomLanguage source code files.
    '''

    def extract(self, file_path: Path | str, repo_path: Path | str | None = None) -> Sequence[SourceFunction]:
        # Implementation goes here: parse the file and return SourceFunction objects
        pass

    def get_extractable_files(self, path: Path | str) -> Sequence[Path]:
        # Implementation goes here: return a list of extractable source files (e.g., based on file extensions)
        pass

Registering Your Custom Extractor

Once your extractor is implemented, you must register it with CodableLLM's extractor:

from codablellm import extractor

extractor.add_extractor(
    'CustomLanguage',
        'classpath.to.CustomLanguageExtractor'  # Replace with your actual import path
)