Skip to content

3. Methods for Creating Datasets

CodableLLM provides three primary methods for creating datasets, each tailored for different workflows. This section offers a high-level overview of these methods and guidance on when to use each one.

create_source_dataset

Use this function when you only want to extract source code functions from a repository, without dealing with binaries or decompilation.

import codablellm

dataset = codablellm.create_source_dataset('path/to/demo-c-repo')

create_decompiled_dataset

Use this function when you already have compiled binaries and want to create a dataset that maps decompiled functions to source functions. Unlike compile_dataset, this function does not run a build command — it assumes that the repository has already been built and that the binaries are ready:

import codablellm

dataset = codablellm.create_decompiled_dataset(
    'path/to/demo-c-repo',              # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                  # Binaries that will be decompiled
)

compile_dataset

This is the most comprehensive function and the one that was used in the Quickstart guide — it automates the process of building a repository, extracting source functions, decompiling binaries, and mapping decompiled functions to their possible source code function matches:

import codablellm

dataset = codablellm.compile_dataset(
    'path/to/demo-c-repo',              # Path to the repository
    [
        'path/to/demo-c-repo/main_app',
        'path/to/demo-c-repo/tool',
    ],                                  # Binaries that will be built and decompiled
    'make',                             # Command/script path used to build the repository
)