GitHub - pocketpy/cTensor: Portable Tensor Library with Automatic Differentiation in Modern C

A lightweight neural network library written in C11 for embedded systems.

Overview

cTensor is a compact tensor computation library designed for small client-side devices, such as mobile phones and microcontrollers. The library implements automatic differentiation and dynamic compute graph functionality, allowing for efficient training and deployment of neural networks on resource-constrained devices.

This library was developed as part of GSoC 2025 and has been successfully validated on ARM Cortex-M3 microcontrollers, achieving 90% classification accuracy on the Iris dataset in a bare-metal environment.

Features

Core Infrastructure

Lightweight C11 Implementation: Minimal dependencies for wide compatibility
Automatic Differentiation Framework: Complete gradient computation with backward pass
Dynamic Compute Graph: Efficient computation flow with gradient tracking
Pool-based Memory Management: Efficient memory allocation system for embedded devices

Tensor Operations

Basic Arithmetic: add, subtract, multiply, divide, power (both tensor-tensor and tensor-scalar)
Unary Operations: negation, absolute value, square, reciprocal
Matrix Operations: matrix multiplication, transpose
Mathematical Functions: logarithm, exponential, sine, cosine, tangent
Shape Operations: unsqueeze, detach
Broadcasting: Element-wise broadcasting for operations on tensors with different shapes

Reduction Operations

Sum: All elements or along specific dimension
Mean: All elements or along specific dimension
Max/Min: All elements or along dimension with indices
Argmax: Find indices of maximum values

Neural Network Components

Layers: Linear (fully connected) layer
Activation Functions: ReLU, Sigmoid, Tanh, ELU, SELU, Softmax
Loss Functions: Cross-entropy, Softmax Cross-entropy, MSE, MAE, Huber Loss
Weight Initialization: Glorot/Xavier initialization

Optimizers

SGD: Stochastic Gradient Descent with momentum
Adam: Adaptive moment estimation
RMSProp: Root Mean Square Propagation
AdaGrad: Adaptive Gradient Algorithm
Features: Weight decay support for all optimizers

Training Utilities

Gradient Clipping: By norm, value, range, positive/negative values
Evaluation Mode: Disable gradient computation for inference
Dataset Utilities: Normalization, shuffling

Validation

cTensor has been successfully deployed and tested on:

ARM Cortex-M3 (STM32F103ZE) using Keil MDK simulation
Task: Neural network classification on Iris dataset
Result: 90% accuracy matching desktop performance
Complete validation project: cTensor_Cortex_SIM

Getting Started

Prerequisites

C Compiler with C11 support (GCC, Clang)
CMake (3.10+) for build configuration
Math library (automatically linked on non-Windows systems)

Building with CMake

On Windows:

On Linux/macOS:

mkdir -p build && cd build
cmake ..
cmake --build .
cd ..

Building with Direct Compilation

On Linux/macOS:

On Windows with GCC:

gcc -std=c11 -Iinclude -O0 -Wfatal-errors -g -DDEBUG -lm src/nn.c src/operator.c src/basic.c src/iris_dataset.c src/context.c src/pool.c src/utils.c src/common/vector.c src/optimizer/sgd.c src2/main.c -o main

and run main.exe from root directory

Testing the Library

cTensor uses a custom test framework. To run the tests:

# Build the test executable with CMake
mkdir -p build && cd build
cmake ..
cmake --build .

# Run the tests
./cten_exe

For detailed testing information, refer to Testing Documentation.

Usage Example

Here's a complete example of training a neural network to predict sine wave values with noise:

#include "cten.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Define memory pools
enum MemoryPoolIds {
    PoolId_Default = 0,
    PoolId_Model = 1,
    PoolId_Optimizer = 2,
};

// Define the model structure
typedef struct {
    Tensor w1, b1;
    Tensor w2, b2;
    Tensor w3, b3;
} Model;

// Forward pass for the model
Tensor Model_forward(Model* model, Tensor x) {
    x = nn_linear(x, model->w1, model->b1);
    x = nn_elu(x, 1.0f);
    x = nn_linear(x, model->w2, model->b2);
    x = nn_elu(x, 1.0f);
    x = nn_linear(x, model->w3, model->b3);
    return x;
}

int main() {
    cten_initilize();

    // Generate sine wave data
    int n_samples = 2048;
    float* x_data = malloc(n_samples * sizeof(float));
    float* y_data = malloc(n_samples * sizeof(float));
    // ... (data generation logic) ...

    // Create model and allocate in its own memory pool
    Model model;
    cten_begin_malloc(PoolId_Model);
    model.w1 = Glorot_init((TensorShape){1, 64}, true);
    model.b1 = Tensor_zeros((TensorShape){1, 64}, true);
    model.w2 = Glorot_init((TensorShape){64, 32}, true);
    model.b2 = Tensor_zeros((TensorShape){1, 32}, true);
    model.w3 = Glorot_init((TensorShape){32, 1}, true);
    model.b3 = Tensor_zeros((TensorShape){1, 1}, true);
    cten_end_malloc();

    // Create optimizer
    float learning_rate = 0.01f;
    cten_begin_malloc(PoolId_Optimizer);
    optim_adam* optimizer = optim_adam_new(6, (Tensor*)&model, learning_rate, 0.9f, 0.999f, 1e-8f, 0.0f);
    cten_end_malloc();

    // Training loop
    int batch_size = 64;
    for (int epoch = 0; epoch < 200; epoch++) {
        // ... (training logic with batching, loss calculation, backpropagation) ...
        
        cten_begin_malloc(PoolId_Default); // for temporary tensors in each step

        // ... create input and y_true tensors ...

        optim_adam_zerograd(optimizer);
        Tensor y_pred = Model_forward(&model, input);
        
        // Combined Loss
        Tensor huber = nn_huber_loss(y_true, y_pred, 1.0f);
        Tensor mae = nn_mae_loss(y_true, y_pred);
        Tensor loss = Tensor_add(huber, Tensor_mulf(mae, 0.3f));

        Tensor_backward(loss, Tensor_ones((TensorShape){1}, false));
        
        // Gradient Clipping
        cten_clip_grad_norm((Tensor*)&model, 6, 5.0f);

        optim_adam_step(optimizer);
        
        cten_end_malloc();
        cten_free(PoolId_Default); // free temporary tensors
    }

    // Evaluate model
    cten_begin_eval();
    // ... (evaluation logic) ...
    cten_end_eval();

    // Free memory pools
    cten_free(PoolId_Optimizer);
    cten_free(PoolId_Model); 

    cten_finalize();
    return 0;
}

API Overview

Tensor Creation and Management

// Basic tensor creation
Tensor Tensor_new(TensorShape shape, bool requires_grad);
Tensor Tensor_zeros(TensorShape shape, bool requires_grad);
Tensor Tensor_ones(TensorShape shape, bool requires_grad);

// Tensor manipulation
Tensor Tensor_transpose(Tensor self);
Tensor Tensor_detach(Tensor self);
Tensor Tensor_unsqueeze(Tensor self, int dim);

// Element access
float Tensor_get(Tensor self, int i, int j, int k, int l);
void Tensor_set(Tensor self, int i, int j, int k, int l, float value);

// Backpropagation
void Tensor_backward(Tensor self, Tensor grad);

Basic Operations

// Element-wise operations with tensors
Tensor Tensor_add(Tensor self, Tensor other);
Tensor Tensor_sub(Tensor self, Tensor other);
Tensor Tensor_mul(Tensor self, Tensor other);
Tensor Tensor_div(Tensor self, Tensor other);
Tensor Tensor_pow(Tensor self, Tensor other);

// Element-wise operations with scalars
Tensor Tensor_addf(Tensor self, float other);
Tensor Tensor_subf(Tensor self, float other);
Tensor Tensor_mulf(Tensor self, float other);
Tensor Tensor_divf(Tensor self, float other);
Tensor Tensor_powf(Tensor self, float other);

// Matrix operations
Tensor Tensor_matmul(Tensor self, Tensor other);

// Unary operations
Tensor Tensor_neg(Tensor self);
Tensor Tensor_abs(Tensor self);
Tensor Tensor_square(Tensor self);
Tensor Tensor_reciprocal(Tensor self);

Mathematical Functions

// Logarithmic and exponential
Tensor nn_log(Tensor self);
Tensor nn_exp(Tensor self);

// Trigonometric functions
Tensor nn_sin(Tensor self);
Tensor nn_cos(Tensor self);
Tensor nn_tan(Tensor self);

Reduction Operations

// Reduction operations (with macro dispatch)
Tensor Tensor_sum(Tensor self);           // Sum all elements
Tensor Tensor_sum(Tensor self, int dim);  // Sum along dimension

Tensor Tensor_mean(Tensor self);          // Mean of all elements
Tensor Tensor_mean(Tensor self, int dim); // Mean along dimension

Tensor Tensor_max(Tensor self);           // Max of all elements
TensorMaxMinResult Tensor_max(Tensor self, int dim); // Max along dimension

Tensor Tensor_min(Tensor self);           // Min of all elements
TensorMaxMinResult Tensor_min(Tensor self, int dim); // Min along dimension

// Argmax operation
void Tensor_argmax(Tensor self, int* out);

Neural Network Functions

// Neural network layers
Tensor nn_linear(Tensor input, Tensor weight, Tensor bias);

// Activation functions
Tensor nn_relu(Tensor input);
Tensor nn_sigmoid(Tensor input);
Tensor nn_tanh(Tensor input);
Tensor nn_elu(Tensor self, float alpha);
Tensor nn_selu(Tensor self);
Tensor nn_softmax(Tensor input, int dim);

// Loss functions
Tensor nn_crossentropy(Tensor y_true, Tensor y_pred);
Tensor nn_softmax_crossentropy(Tensor y_true, Tensor logits);
Tensor nn_mse_loss(Tensor y_true, Tensor y_pred);
Tensor nn_mae_loss(Tensor y_true, Tensor y_pred);
Tensor nn_huber_loss(Tensor y_true, Tensor y_pred, float delta);

// Weight initialization
Tensor Glorot_init(TensorShape shape, bool requires_grad);

Optimizers

// SGD Optimizer
optim_sgd* optim_sgd_new(int n_params, Tensor* params, float weight_decay);
void optim_sgd_config(optim_sgd* self, float lr, float momentum);
void optim_sgd_zerograd(optim_sgd* self);
void optim_sgd_step(optim_sgd* self);

// Adam Optimizer
optim_adam* optim_adam_new(int n_params, Tensor* params, float lr, 
                          float β1, float β2, float ε, float weight_decay);
void optim_adam_zerograd(optim_adam* self);
void optim_adam_step(optim_adam* self);

// RMSProp Optimizer
optim_rmsprop* optim_rmsprop_new(int n_params, Tensor* params, float lr, 
                                float β, float ε, float weight_decay);
void optim_rmsprop_zerograd(optim_rmsprop* self);
void optim_rmsprop_step(optim_rmsprop* self);

// AdaGrad Optimizer
optim_adagrad* optim_adagrad_new(int n_params, Tensor* params, float lr, 
                                float ε, float weight_decay);
void optim_adagrad_zerograd(optim_adagrad* self);
void optim_adagrad_step(optim_adagrad* self);

Gradient Clipping

// Gradient clipping functions
void cten_clip_grad_norm(Tensor* params, int n_params, float max_norm);
void cten_clip_grad_value(Tensor* params, int n_params, float max_value);
void cten_clip_grad_value_range(Tensor* params, int n_params, float min_value, float max_value);
void cten_clip_grad_positive(Tensor* params, int n_params, float max_value);
void cten_clip_grad_negative(Tensor* params, int n_params, float min_value);

Utility Functions

// TensorShape utilities
int TensorShape_numel(TensorShape shape);
int TensorShape_dim(TensorShape shape);
int TensorShape_asdim(TensorShape shape, int dim);
int TensorShape_tostring(TensorShape shape, char* buf, int size);

// Dataset utilities
int load_iris_dataset(const float (**X)[4], const int** y);
void Tensor_normalize_dataset(const float (*X)[4], float (*X_norm)[4], int n_samples, int n_train_samples, int n_features);
void Tensor_shuffle_dataset(const float (*X)[4], const int *y, float (*X_shuffled)[4], int *y_shuffled, int n_samples, int n_features);

// Evaluation mode
void cten_begin_eval();
bool cten_is_eval();
void cten_end_eval();

// Broadcasting
bool cten_elemwise_broadcast(Tensor* a, Tensor* b);
Tensor reduce_gradient_for_broadcasting(Tensor grad, TensorShape original_shape, 
                                       TensorShape broadcasted_shape);

Memory Management

cTensor uses a pool-based memory management system to efficiently handle tensor allocations:

void cten_initilize();
void cten_finalize();
void cten_begin_malloc(PoolId id);
void cten_end_malloc();
void cten_free(PoolId id);

Project Structure

cTensor/
├── include/          # Header files defining the API
│   └── cten.h       # Complete API header
├── src/             # Core implementation files
│   ├── basic.c      # Basic tensor operations
│   ├── nn.c         # Neural network primitives
│   ├── operator.c   # Mathematical operators
│   ├── context.c    # Memory management
│   ├── utils.c      # Utility functions
│   ├── optimizer/   # Optimizer implementations
│   └── ...
├── src2/            # Example applications
│   └── main.c       # Sine regression example
└── tests/           # Test suite

Implemented Features Summary

Category	Components	Status
Core Structs	`Tensor`, `GradNode`, `TensorMaxMinResult`	✅
Autograd	`Tensor_backward`, `requires_grad`, `detach`	✅
Tensor Creation	`Tensor_new`, `zeros`, `ones`, `Glorot_init`	✅
Binary Operations	`add`, `sub`, `mul`, `div`, `pow`, `matmul`	✅
Unary Operations	`neg`, `abs`, `square`, `reciprocal`	✅
Math Functions	`log`, `exp`, `sin`, `cos`, `tan`	✅
Aggregations	`sum`, `mean`, `max`, `min` (with indices)	✅
Search/Sort	`argmax`	✅
Shape Operations	`transpose`, `unsqueeze`	✅
NN Layers	`nn_linear`	✅
Activations	`ReLU`, `Sigmoid`, `Tanh`, `ELU`, `SELU`, `Softmax`	✅
Loss Functions	`CrossEntropy`, `MSE`, `MAE`, `Huber`	✅
Optimizers	`SGD`, `Adam`, `RMSProp`, `AdaGrad`	✅
Training Utils	`Gradient Clipping`, `Evaluation Mode`, `Weight Decay`	✅

Contributing

Contributions to cTensor are welcome! Key areas for contribution include:

Performance Optimization: Benchmarking and SIMD implementations
Advanced Layers: Convolutional and recurrent neural network layers
Documentation: Examples, tutorials, and API documentation improvements
Testing: Expanding test coverage and validation on different platforms

GSoC 2025 Acknowledgments

This project was developed during Google Summer of Code 2025 by Advait Gaur under the mentorship of PrimedErwin, Anurag Bhat, and blueloveTH. The project successfully transformed cTensor from a basic prototype into a functional deep learning framework suitable for embedded applications.

License

This project is licensed under the MIT License - see the LICENSE file for details.