Skip to content

About

Build Status Python Version PyPI Downloads License Documentation Status

CodableLLM

CodableLLM is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.

Installation

PyPI

Install CodableLLM directly from PyPI:

pip install codablellm

CodableLLM uses Prefect for orchestration and parallel processing. Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.

Run an example extraction using Docker Compose:

docker compose run --rm app \
  codablellm \
  --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
  /tmp/demo-c-repo \
  ./demo-c-repo.csv \
  /tmp/demo-c-repo \
  --strip \
  --transform my_transform.transform \
  --generation-mode temp-append \
  --build make

This command does the following:

  • Downloads and extracts a compressed C project archive from the given --url to /tmp/demo-c-repo.
  • Uses /tmp/demo-c-repo as both the source of extracted code and the location of compiled binaries.
  • Outputs a dataset to ./demo-c-repo.csv (relative to your host machine).
  • Runs the build command (make) inside the extracted repo directory to generate binaries.
  • Applies transformations using the function defined in my_transform.py (i.e., my_transform.transform).
  • Uses --generation-mode temp-append, which appends transformed outputs to the original dataset, preserving both.

This uses the app service defined in docker-compose.yml, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.

Features

  • Extracts functions and methods from source code repositories using tree-sitter.
  • Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
  • Language-agnostic design with support for plugin-based extractor and decompiler extensions.
  • Extendable API for building your own workflows and datasets.
  • Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.

Documentation

Complete documentation is available on Read the Docs:

Citation

If you use this tool in your research, please cite the paper associated with it:

@misc{manuel2025codablellmautomatingdecompiledsource,
      title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, 
      author={Dylan Manuel and Paul Rad},
      year={2025},
      eprint={2507.22066},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.22066}, 
}

Contributing

We welcome contributions from the community! See CONTRIBUTING.md for guidelines, development setup, and how to get started.