Skip to content

5. Transforming Extracted Functions

One of the main features of CodableLLM is the ability to apply custom transformation to extracted source code functions using a Transform. This allows you to preprocess, augment, or modify functions before they are included in a dataset. These transformations are applied through the transform field in the ExtractConfig.

The transform Field

The transform field accepts a callable function that receives each extracted SourceFunction and returns a modified version:

from codablellm import compile_dataset, ExtractConfig, SourceFunction

def add_comment(source: SourceFunction) -> SourceFunction:
    declaration, *definition = source.definition.splitlines()
    return source.with_definition('\n'.join([declaration,
                                            '// Transformed by CodableLLM',
                                            *definition]))

dataset = compile_dataset(
    'path/to/demo-c-repo',
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],
    'make',
    dataset_config=DecompiledCodeDatasetConfig(
        extract_config=ExtractConfig(
            transform=add_comment
        )
    )
)

Common uses for transform:

  • Injecting custom code or comments
  • Redacting or removing certain patterns
  • Normalizing formatting and style
  • Applying static modifications across an entire dataset

Integrating LLMs for Automated Code Modification

The transform field can also be used to integrate LLMs into your workflow. For example, you could send each extracted function to an LLM prompt to rewrite comments, rename variables, or enhance documentation before including it in your dataset.

⚠️ Note: LLM-based transforms can be resource-intensive and are recommended for smaller, curated datasets or offline batch jobs.

Fine-Grained AST Transformations

For advanced use cases, CodableLLM provides an ASTEditor utility. This allows you to perform precise structural modifications to source code using a Tree-sitter AST parser — without relying on fragile text-based modifications:

from codablellm import compile_dataset, ExtractConfig, SourceFunction
from codablellm.languages.c import CEXtractor
from codablellm.core.utils import ASTEditor

GET_COMMENTS_QUERY = '(comment) @comment'

def remove_comments(source: SourceFunction) -> SourceFunction:
    editor = ASTEditor(CEXtractor.PARSER, source.definition)
    editor.match_and_edit(GET_COMMENTS_QUERY,
                          {'comment': ''})
    return source.with_definition(editor.source_code)

dataset = compile_dataset(
    'path/to/demo-c-repo',
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],
    'make',
    dataset_config=DecompiledCodeDatasetConfig(
        extract_config=ExtractConfig(
            transform=remove_comments
        )
    )
)

Why This Matters

Being able to transform extracted functions opens up advanced dataset creation workflows, allowing you to:

  • Clean, standardize, and enrich functions before using them for training
  • Pre-process legacy code for consistency
  • Create datasets that compare original vs. modified functions for LLM fine-tuning
  • Perform structural edits without writing manual regexes or string manipulation logic