Porter Stemmer for Go
This is a straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. The C version this port is based on is available for download here: http://tartarus.org/~martin/PorterStemmer/c_thread_safe.txt
The original algorithm is described in the paper:
M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp
130-137.
Features
- Thread-safe implementation
- Multiple APIs: simple string API and zero-allocation byte-slice API
- Command-line tool for batch processing
- Comprehensive test suite
- Benchmarked and optimized
- No external dependencies
Installation
Library
go get github.com/a2800276/porter
CLI Tool
go install github.com/a2800276/porter/cmd/porter@latest
Usage
As a Library
package main import ( "fmt" "log" "github.com/a2800276/porter" ) func main() { // Simple string API (with allocations) stemmed, err := porter.Stem("running") if err != nil { log.Fatal(err) } fmt.Println(stemmed) // Output: run // Efficient byte-slice API (zero allocations) word := []byte("running") stemmed_bytes, err := porter.StemBytes(word) if err != nil { log.Fatal(err) } fmt.Println(string(stemmed_bytes)) // Output: run }
As a CLI Tool
Install the command-line tool:
go install github.com/a2800276/porter/cmd/porter@latest
Use it to stem words:
# Stem words from arguments $ porter running jumped easily run jump easili # Stem words from stdin $ echo -e "running\njumped\neasily" | porter run jump easili # Process a file $ cat words.txt | porter > stemmed.txt # Count unique stems $ cat corpus.txt | porter | sort | uniq -c | sort -rn
API
The package provides two functions for different use cases:
Stem(word string) (string, error)
The simplest API that takes a string and returns a stemmed string. Handles case conversion automatically. Returns an error if stemming fails (though this is rare in normal use).
StemBytes(b []byte) ([]byte, error)
Zero-allocation API that stems the byte slice in-place and returns the stemmed portion as a slice. The input is converted to lowercase. Best for high-performance scenarios. Returns an error if stemming fails.
Performance
The implementation is highly optimized:
String API (convenient, with allocations)
BenchmarkStem-24 14064384 77.29 ns/op 16 B/op 2 allocs/op
Byte-Slice API (fastest, zero allocations)
BenchmarkStemBytes-24 23443530 51.85 ns/op 0 B/op 0 allocs/op
The byte-slice API (StemBytes) is ~35% faster and performs zero allocations,
making it ideal for high-performance applications.
Note: Error handling adds minimal overhead (~2ns) but provides explicit feedback on failures.
Limitations
- The algorithm operates on English words only. Input is automatically converted to lowercase.
- For the
Stem()function, strings are converted to byte slices internally. For zero-copy operation, useStemBytes(). - Unicode handling: The algorithm is designed for ASCII English text. Non-ASCII characters should be handled by the caller before stemming.
Development
Building
make build # Build the CLI tool make install # Install CLI to $GOPATH/bin
Running Tests
make test # Run tests make coverage # Generate coverage report make bench # Run benchmarks
Linting and Formatting
make fmt # Format code make vet # Run go vet make lint # Run golangci-lint (requires installation)
Contributing
Contributions are welcome! Please ensure:
- Tests pass:
make test - Code is formatted:
make fmt - No linting errors:
make lint
License
MIT licensed. See LICENSE file for details.