4. Configuration Overview
CodableLLM is designed to be highly configurable through dataclasses with default values. These configurations allow you to control how repositories are built, how source functions are extracted, and how binaries are decompiled and mapped.
This section provides a high-level overview of the key configuration classes and examples of how to use them.
ManageConfig
ManageConfig defines settings for managing the build and optional cleanup process of a repository.
from codablellm import compile_dataset, ManageConfig
dataset = compile_dataset(
'path/to/demo-c-repo', # Path to the repository
[
'path/to/demo-c-repo/main_app',
'path/to/demo-c-repo/tool',
], # Binaries that will be built and decompiled
'make', # Command/script path used to build the repository
manage_config=ManageConfig(
cleanup_command='make clean', # Command/script path used to clean the repository after build
cleanup_error_handling='ignore' # How to handle errors during cleanup (e.g., ignore or none)
)
)
SourceCodeDatasetConfig
SourceCodeDatasetConfig controls how source code functions are extracted and how the dataset is generated. It also lets you specify the DatasetGenerationMode, which determines whether to use the repository directly or work from a temporary copy:
from codablellm import create_source_dataset, SourceCodeDatasetConfig
dataset = codablellm.create_source_dataset(
'path/to/demo-c-repo' # Path to the repository
config=SourceCodeDatasetConfig(
generation_mode='path' # Generates the dataset directly from the local repository path
)
)
DecompiledCodeDatasetConfig
DecompiledCodeDatasetConfig controls how binaries are decompiled and how decompiled functions are mapped to source functions. You can also define a custom Mapper function to control matching behavior:
from codablellm import compile_dataset, DecompiledFunction, \
DecompiledCodeDatasetConfig, SourceFunction
def custom_mapper(decompiled: DecompiledFunction,
source: SourceFunction) -> bool:
# Example: case-insensitive function name matching
return decompiled.name.casefold() == source.name.casefold()
dataset = compile_dataset(
'path/to/demo-c-repo', # Path to the repository
[
'path/to/demo-c-repo/main_app',
'path/to/demo-c-repo/tool',
], # Binaries that will be built and decompiled
'make', # Command/script path used to build the repository
dataset_config=DecompiledCodeDatasetConfig(
strip=True, # Strip symbols from decompiled functions
mapping=custom_mapper # Custom mapping logic between decompiled and source functions
)
)
ExtractConfig
ExtractConfig controls how source code functions are extracted from repositories. This allows you to include or exclude certain paths, manage checkpointing, and apply custom transformations.
from codablellm import compile_dataset, ExtractConfig
dataset = compile_dataset(
'path/to/demo-c-repo', # Path to the repository
[
'path/to/demo-c-repo/main_app',
'path/to/demo-c-repo/tool',
], # Binaries that will be built and decompiled
'make', # Command/script path used to build the repository
dataset_config=DecompiledCodeDatasetConfig(
extract_config=ExtractConfig(
exclude_subpaths = {'tests'} # Exclude specific subpaths during source extraction
)
)
)
DecompileConfig
DecompileConfig controls how binaries are decompiled, including concurrency and timeout settings.
from codablellm import compile_dataset, DecompileConfig
dataset = compile_dataset(
'path/to/demo-c-repo', # Path to the repository
[
'path/to/demo-c-repo/main_app',
'path/to/demo-c-repo/tool',
], # Binaries that will be built and decompiled
'make', # Command/script path used to build the repository
dataset_config=DecompiledCodeDatasetConfig(
decompile_config=DecompileConfig(
max_workers=1 # Limit parallel decompilation to one binary at a time
)
)
)