About
CodableLLM
CodableLLM is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.
Installation
PyPI
Install CodableLLM directly from PyPI:
Docker Compose (Recommended)
CodableLLM uses Prefect for orchestration and parallel processing. Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.
Run an example extraction using Docker Compose:
docker compose run --rm app \
codablellm \
--url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
/tmp/demo-c-repo \
./demo-c-repo.csv \
/tmp/demo-c-repo \
--strip \
--transform my_transform.transform \
--generation-mode temp-append \
--build make
This command does the following:
- Downloads and extracts a compressed C project archive from the given --url to
/tmp/demo-c-repo. - Uses
/tmp/demo-c-repoas both the source of extracted code and the location of compiled binaries. - Outputs a dataset to
./demo-c-repo.csv(relative to your host machine). - Runs the build command (
make) inside the extracted repo directory to generate binaries. - Applies transformations using the function defined in
my_transform.py(i.e.,my_transform.transform). - Uses --generation-mode
temp-append, which appends transformed outputs to the original dataset, preserving both.
This uses the
appservice defined indocker-compose.yml, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.
Features
- Extracts functions and methods from source code repositories using tree-sitter.
- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
- Language-agnostic design with support for plugin-based extractor and decompiler extensions.
- Extendable API for building your own workflows and datasets.
- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.
Documentation
Complete documentation is available on Read the Docs:
Citation
If you use this tool in your research, please cite the paper associated with it:
@misc{manuel2025codablellmautomatingdecompiledsource,
title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation},
author={Dylan Manuel and Paul Rad},
year={2025},
eprint={2507.22066},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2507.22066},
}
Contributing
We welcome contributions from the community! See CONTRIBUTING.md for guidelines, development setup, and how to get started.