- Python 86.3%
- Shell 13.7%
| data | ||
| scripts | ||
| tests/audio | ||
| .gitignore | ||
| .gitmodules | ||
| evaluate_tests.py | ||
| infer.py | ||
| README.md | ||
| requirements.txt | ||
| train.py | ||
Profanity Detection from Speech (TAPAD + Clean Speech, ONNX)
This repository contains a speech-based profanity detector trained on the
TAPAD (Tagged Profanity Audio Dataset)
together with additional clean and non-speech audio.
The model is designed to:
- Detect whether an audio clip contains profanity (binary classification).
- Optionally identify the specific profanity word (multi-class classification).
The core model is exported to ONNX for easy use in other languages and runtimes.
Features
- Multi-task CNN with two heads:
- Binary head:
cleanvsprofane. - Word head: specific profanity word from TAPAD’s vocabulary.
- Binary head:
- Training code in PyTorch with torchaudio-based preprocessing.
- Support for additional clean datasets (Google Speech Commands, LibriSpeech, ESC-50, UrbanSound8K, etc.).
- ONNX export for deployment in Python, Go, C#, Rust, Node, or any ONNX Runtime environment.
- CLI inference helper in Python for quick testing.
Getting the ONNX Model from Releases
If you only want to use the model (not train it):
-
Go to the releases page:
https://git.macco.dev/insidiousfiddler/profanity-detection/releases -
Download the latest assets, for example:
profanity_detection.onnxlabels.json(mapping from output indices to profanity words), if provided
-
Place them somewhere in your project, or use them directly from disk.
You can use the ONNX model in any environment that supports ONNX Runtime.
ONNX Model Interface
The exported model expects log-Mel spectrograms as input, not raw waveforms.
Input
- Name:
input - Shape:
(batch_size, 1, 64, 128) - Type:
float32
Where each sample is:
-
Sampled at 24 kHz.
-
1.0 second of audio is:
- Center-cropped or padded to exactly 1 second.
-
Transformed as:
-
MelSpectrogram with:
sample_rate = 24000n_fft = 1024hop_length = 256n_mels = 64
-
Then converted to decibels via
AmplitudeToDB.
-
In Python (PyTorch + torchaudio), preprocessing looks like:
import torch
import torchaudio
import torch.nn.functional as F
SAMPLE_RATE = 24000
CLIP_SECONDS = 1.0
CLIP_SAMPLES = int(SAMPLE_RATE * CLIP_SECONDS)
N_MELS = 64
MAX_FRAMES = 128
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=SAMPLE_RATE,
n_fft=1024,
hop_length=256,
n_mels=N_MELS,
)
amp_to_db = torchaudio.transforms.AmplitudeToDB()
def load_waveform(path: str) -> torch.Tensor:
wav, sr = torchaudio.load(path)
if wav.shape[0] > 1:
wav = wav.mean(dim=0, keepdim=True)
if sr != SAMPLE_RATE:
wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
return wav
def center_chunk(wav: torch.Tensor) -> torch.Tensor:
num_samples = wav.shape[1]
if num_samples < CLIP_SAMPLES:
pad = CLIP_SAMPLES - num_samples
wav = F.pad(wav, (0, pad))
return wav
start = max(0, (num_samples - CLIP_SAMPLES) // 2)
return wav[:, start:start + CLIP_SAMPLES]
def waveform_to_fixed_logmel(wav: torch.Tensor) -> torch.Tensor:
mel = mel_transform(wav)
mel_db = amp_to_db(mel)
time_dim = mel_db.shape[2]
if time_dim < MAX_FRAMES:
pad = MAX_FRAMES - time_dim
mel_db = F.pad(mel_db, (0, pad))
else:
mel_db = mel_db[:, :, :MAX_FRAMES]
return mel_db # (1, 64, 128)
You then add a batch dimension and convert to NumPy for ONNX Runtime.
Outputs
The model has two outputs:
-
profanity_logits- Shape:
(batch_size, 2) - Interpretation:
- Index
0->clean - Index
1->profane
- Index
- Shape:
-
word_logits- Shape:
(batch_size, N) N= number of profanity word classes (depends on training; often ~300+).- Each index corresponds to a profanity token; see
labels.jsonorlabel_to_wordfrom the checkpoint.
- Shape:
Typical decoding:
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("profanity_detection.onnx")
# mel_batch: (B, 1, 64, 128) as float32 numpy
outputs = session.run(
["profanity_logits", "word_logits"],
{"input": mel_batch},
)
profanity_logits, word_logits = outputs
profanity_probs = np.exp(profanity_logits) / np.exp(profanity_logits).sum(axis=1, keepdims=True)
word_probs = np.exp(word_logits) / np.exp(word_logits).sum(axis=1, keepdims=True)
# Example for a single sample (index 0)
p_clean, p_profane = profanity_probs[0]
word_idx = int(word_probs[0].argmax())
word_conf = float(word_probs[0, word_idx])
You then compare p_profane and word_conf against thresholds to decide the final label.
Using the Python Inference Helper
If you clone this repository and have Python installed, you can use the helper script:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Ensure dist/profanity_detection.onnx and profanity_detection.pt exist
python infer.py path/to/audio.wav
Example output:
Result:
is_profane: True
profanity_confidence: 0.9998
clean_confidence: 0.0002
word: fuck
word_confidence: 0.5834
top5_words:
fuck: 0.5834
fucked: 0.2504
fucks: 0.0269
fag: 0.0235
fuckup: 0.0169
The helper script:
- Handles waveform loading and preprocessing.
- Runs the ONNX model.
- Applies thresholds to decide:
is_profaneword(or__unknown__/__none__)
Training Your Own Model
If you want to retrain or fine-tune:
-
Place TAPAD under
data/profanity/TAPAD/audio:data/profanity/TAPAD/audio/<lang>/<files>.wav data/profanity/TAPAD/audio/en-gb/arse.mp3 data/profanity/TAPAD/audio/en-gb/fuck.wav ... -
Download clean speech / noise datasets:
chmod +x download_clean_data.sh # Baseline: Speech Commands only (~1 GB) ./download_clean_data.sh # With additional environmental noise and small LibriSpeech splits: ./download_clean_data.sh \ --with-esc50 \ --with-urbansound8k \ --with-librispeech-small -
Train:
python -m venv venv source venv/bin/activate pip install -r requirements.txt # Example: use GPU 0, limit CPU threads, tune num-workers export CUDA_VISIBLE_DEVICES=0 taskset -c 0-11 python train.py \ --device cuda \ --cpu-threads 12 \ --num-workers 4
The training script will:
-
Build a dataset from:
- TAPAD audio as
profanesamples (with word labels). clean_data/**ascleansamples (no word labels).
- TAPAD audio as
-
Train the multi-task CNN for several epochs.
-
Save the best checkpoint to:
dist/profanity_detection.pt
-
Export the ONNX model to:
dist/profanity_detection.onnx
You can then upload the ONNX (and label mapping file) to releases for others to consume.
Thresholds and Decision Logic
The Python inference script uses:
PROFANITY_THRESHOLD(onp_profane)WORD_THRESHOLD(onmax(word_probs))
Typical interpretation:
-
If
p_profane >= PROFANITY_THRESHOLD:-
is_profane = True -
If
max(word_probs) >= WORD_THRESHOLD:word = that profanity token
-
Else:
word = "__unknown__"
-
-
Else:
is_profane = Falseword = "__none__"
You can tune these thresholds based on your tolerance for false positives vs false negatives.