Vulnerability Detection
Overview
In this example, we use CodableLLM to generate a synthetic vulnerability dataset focused on buffer overflows.
While real-world vulnerability datasets are significantly more complex — often requiring advanced static analysis, taint tracking, and manual verification — this demonstration illustrates how CodableLLM can be used to rapidly create training data for LLMs that specialize in vulnerability detection.
Creating the Datasets
To support this use case, we generate two datasets:
- A dataset of the original source code and decompiled functions (benign)
- A dataset of the transformed versions with synthetic buffer overflow vulnerabilities (vulnerable)
These can later be merged or aligned using metadata to train models that distinguish between secure and insecure patterns.
```python from codablellm import compile_dataset, DecompiledCodeDatasetConfig, ExtractConfig
def add_one_to_array_access(source: SourceFunction) -> SourceFunction: ... # Injects +1 into array accesses to simulate a buffer overflow
Original (benign) dataset
benign_dataset = compile_dataset( 'path/to/demo-c-repo', [ 'path/to/demo-c-repo/main_app', 'path/to/demo-c-repo/tool', ], 'make', dataset_config=DecompiledCodeDatasetConfig( extract_config=ExtractConfig(), generation_mode='path' # No transformation ) )
Transformed (vulnerable) dataset
vulnerable_dataset = compile_dataset( 'path/to/demo-c-repo', [ 'path/to/demo-c-repo/main_app', 'path/to/demo-c-repo/tool', ], 'make', dataset_config=DecompiledCodeDatasetConfig( extract_config=ExtractConfig( transform=add_one_to_array_access ), generation_mode='temp' # Applies transform in a temp clone ) )
benign_dataset.save_as('benign_dataset.csv') vulnerable_dataset.save_as('vulnerable_dataset.csv')