3. Methods for Creating Datasets
CodableLLM provides three primary methods for creating datasets, each tailored for different workflows. This section offers a high-level overview of these methods and guidance on when to use each one.
create_source_dataset
Use this function when you only want to extract source code functions from a repository, without dealing with binaries or decompilation.
create_decompiled_dataset
Use this function when you already have compiled binaries and want to create a dataset that maps decompiled functions to source functions. Unlike compile_dataset, this function does not run a build command — it assumes that the repository has already been built and that the binaries are ready:
import codablellm
dataset = codablellm.create_decompiled_dataset(
'path/to/demo-c-repo', # Path to the repository
[
'path/to/demo-c-repo/main_app',
'path/to/demo-c-repo/tool',
], # Binaries that will be decompiled
)
compile_dataset
This is the most comprehensive function and the one that was used in the Quickstart guide — it automates the process of building a repository, extracting source functions, decompiling binaries, and mapping decompiled functions to their possible source code function matches: