Skip to content

4. Configuration Overview

CodableLLM is designed to be highly configurable through dataclasses with default values. These configurations allow you to control how repositories are built, how source functions are extracted, and how binaries are decompiled and mapped.

This section provides a high-level overview of the key configuration classes and examples of how to use them.

ManageConfig

ManageConfig defines settings for managing the build and optional cleanup process of a repository.

from codablellm import compile_dataset, ManageConfig

dataset = compile_dataset(
    'path/to/demo-c-repo',              # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                  # Binaries that will be built and decompiled
    'make',                             # Command/script path used to build the repository
    manage_config=ManageConfig(
        cleanup_command='make clean',   # Command/script path used to clean the repository after build
        cleanup_error_handling='ignore' # How to handle errors during cleanup (e.g., ignore or none)
    )
)

SourceCodeDatasetConfig

SourceCodeDatasetConfig controls how source code functions are extracted and how the dataset is generated. It also lets you specify the DatasetGenerationMode, which determines whether to use the repository directly or work from a temporary copy:

from codablellm import create_source_dataset, SourceCodeDatasetConfig

dataset = codablellm.create_source_dataset(
    'path/to/demo-c-repo'           # Path to the repository
    config=SourceCodeDatasetConfig(
        generation_mode='path'      # Generates the dataset directly from the local repository path
    )
)

DecompiledCodeDatasetConfig

DecompiledCodeDatasetConfig controls how binaries are decompiled and how decompiled functions are mapped to source functions. You can also define a custom Mapper function to control matching behavior:

from codablellm import compile_dataset, DecompiledFunction, \
    DecompiledCodeDatasetConfig, SourceFunction

def custom_mapper(decompiled: DecompiledFunction,
                  source: SourceFunction) -> bool:
    # Example: case-insensitive function name matching
    return decompiled.name.casefold() == source.name.casefold()

dataset = compile_dataset(
    'path/to/demo-c-repo',                      # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                          # Binaries that will be built and decompiled
    'make',                                     # Command/script path used to build the repository
    dataset_config=DecompiledCodeDatasetConfig(
        strip=True,                             # Strip symbols from decompiled functions
        mapping=custom_mapper                   # Custom mapping logic between decompiled and source functions
    )
)

ExtractConfig

ExtractConfig controls how source code functions are extracted from repositories. This allows you to include or exclude certain paths, manage checkpointing, and apply custom transformations.

from codablellm import compile_dataset, ExtractConfig

dataset = compile_dataset(
    'path/to/demo-c-repo',                      # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                          # Binaries that will be built and decompiled
    'make',                                     # Command/script path used to build the repository
    dataset_config=DecompiledCodeDatasetConfig(
        extract_config=ExtractConfig(
            exclude_subpaths = {'tests'}        # Exclude specific subpaths during source extraction
        )
    )
)

DecompileConfig

DecompileConfig controls how binaries are decompiled, including concurrency and timeout settings.

from codablellm import compile_dataset, DecompileConfig

dataset = compile_dataset(
    'path/to/demo-c-repo',                      # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                          # Binaries that will be built and decompiled
    'make',                                     # Command/script path used to build the repository
    dataset_config=DecompiledCodeDatasetConfig(
        decompile_config=DecompileConfig(
            max_workers=1                       # Limit parallel decompilation to one binary at a time
        )
    )
)