This is a experimental translator of C (ISO/IEC 9899:2018) programs to EO programs.
How to Use
Assuming, you are on Ubuntu 22.04+:
$ apt update
$ apt install -y software-properties-common
$ apt-key adv --keyserver keyserver.ubuntu.com --recv-keys F7C91591CC543ECA
$ add-apt-repository 'deb http://c2eo.polystat.org/debian/ c2eo-rep non-free main contrib'
$ apt-get install -y clang
$ apt-get install -y c2eoThen, just run:
$ c2eo <path-to-c-file-name> <eo-file-name>.eo
You can also use yegor256/c2eo image via Docker:
$ docker run -v $(pwd):/eo yegor256/c2eo:<tag> hello.c hello.eo
Assuming you have hello.c in the current directory, the hello.eo will be created next to it.
We do not support the utility for other distributions and operating systems yet. However, you can try to build the project from source at your own risk.
How to Contribute
Again, we recommend Ubuntu 22.04+ and you will need wget 1.21+, tar 1.30+, git 2.32.+, cmake 3.18+, gcc 11.2.+, g++ 11.2.+, ninja-build 1.10.1+, clang 14.0.0+ and python3 3.10.0+. You will also need requirements for the EO project (Maven 3.3+ and Java 8+)
Then, you need to install GTest 1.12.1+
$ apt install libgtest-dev googletest $ cd /usr/src/googletest $ cmake . $ make $ lib $ cp *.a /usr/local/lib
After that, you need to install LLVM/Clang 12.0.1 or you may use an alternative way below this code:
$ wget https://github.com/llvm/llvm-project/archive/refs/tags/llvmorg-12.0.1.tar.gz $ tar -xvf llvmorg-12.0.1.tar.gz $ mv ./llvm-project-llvmorg-12.0.1 ./llvm-clang $ cd llvm-clang $ mkdir build && cd $_ $ cmake --no-warn-unused-cli -DBUILD_SHARED_LIBS:STRING=ON -DLLVM_TARGETS_TO_BUILD:STRING=X86 -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=TRUE "-DLLVM_ENABLE_PROJECTS:STRING=clang;compiler-rt" -DCMAKE_BUILD_TYPE:STRING=Debug -DLLVM_OPTIMIZED_TABLEGEN:STRING=ON -DLLVM_USE_SPLIT_DWARF:STRING=ON -DLLVM_USE_LINKER:STRING=gold ../llvm -G Ninja $ cmake --build . --config Debug --target all -j 10 -- -j1 -l 2 $ cd ../..
You may also try our own pre-packaged archive:
$ apt install megatools
$ megadl 'https://mega.nz/#!cZ9WQCqB!z713CuC-GNFQAXIxZwZxI05zOH4FAOpwYHEElgOZflA'
$ tar -xvf llvm-clang.tar.gzIt is assumed that the llvm-clang dir is located in the c2eo dir. If your llvm-clang is in different place, set the path in that line.
Formally speaking, this is where the preparation can be completed. However, in order to fully work with the project, testing and executing the translated code, you need to study the EO compiler project and fulfill its necessary requirements. After that, it will be possible to proceed with further steps.
Making changes
All sources files of transpiler are located in project/src/transpiler. The transpiler's work begins with the code from the source file project/src/transpiler/main.cpp. Аfter making changes in these files, we will need to rebuild the executable file c2eo. To do this, you need to go to the project dir. For the first time, create the build folder:
then go to the build folder and run the following commands:
As you have already noticed, the project is being built in the project/build folder. The result of this build is the c2eo file in project/bin. Now you have a transpiler and you can convert programs from C to EO. Just run:
$ ./c2eo <path-to-c-file-name> <eo-file-name>.eo # ./c2eo ../some_dir/example.c example.eo
Checking before creating PR
Your PR will pass the following checks, so before creating PR run these locally to make sure everything is ok:
$ clang-format project/src/transpiler/*.(cpp|h) -i
$ cpplint --filter=-runtime/references,-runtime/string,-build/c++11 project/src/transpiler/** $ cd project/scripts
$ python3 clang_tidy.py$ cd project/scripts $ python3 transpile.py <your_path_to_the_folder>/gcc.c-torture -s gcc -n
$ cd project/scripts $ python3 test.py -p <your_path_to_the_folder>/c-testcuite -s testcuite -n
- test
$ cd project/scripts $ python3 test.py -s test
- unit-tests
$ cd project/scripts $ python3 build_c2eo.py $ project/bin/ $ ./unit_tests --gtest_filter=*
How to release
From project/scripts/ directory:
$ python3 update-release.py -h usage: update-release.py [-h] [--branch BRANCH] [--version VERSION] Release maker optional arguments: -h, --help show this help message and exit --version VERSION specify the new version
Example
$ python3 update-release.py --version=0.1.1
To use this script, make sure you have the following packages installed:
$ pip3 install git_config pgpy s3cmd
$ apt install md5deep reprepro gcc cmake dpkg wget tar s3cmd -y
# for the latest version of the cmake package, try:
$ pip3 install cmakeNotes:
- Use
.as a version delimiter. - This script uses the current date, time, and time zone. Make sure they are configured correctly.
- This script extracts your name and email from
git config. Make sure you have them.
This script will write automatically generated merges to the changelog file. You can view an approximate list of changes by running the following command in the terminal:
$ git log $(git describe --tags --abbrev=0)..HEAD --merges --oneline --format=" * %h %s by %an <%aE>"
Algorithm:
- Build the executable file.
- Create a deb file (basic: HABR)
- Create a repository (basic: UNIXFORUM)
- Upload a repository tree into the bucket's virtual 'directory'.
The following files will be generated
$ tree
.
├── c2eo-X.X.X
│ ├── DEBIAN
│ │ ├── changelog
│ │ ├── control
│ │ ├── copyright
│ │ └── md5sums
│ └── usr
│ ├── bin
│ │ └── c2eo
│ └── lib
│ ├── libclangAnalysis.so
│ ├── libclangAnalysis.so.12
│ ├── ...
│ └── libLLVMTransformUtils.so.12
├── c2eo-X.X.X.deb
├── readme.md
├── repository
│ ├── conf
│ │ └── distributions
│ ├── db
│ │ ├── checksums.db
│ │ ├── contents.cache.db
│ │ ├── packages.db
│ │ ├── references.db
│ │ ├── release.caches.db
│ │ └── version
│ ├── dists
│ │ └── c2eo-rep
│ │ ├── contrib
│ │ │ ├── binary-amd64
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── debian-installer
│ │ │ │ ├── binary-amd64
│ │ │ │ │ ├── Packages
│ │ │ │ │ └── Packages.gz
│ │ │ │ └── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ └── Packages.gz
│ │ │ └── source
│ │ │ ├── Release
│ │ │ └── Sources.gz
│ │ ├── InRelease
│ │ ├── main
│ │ │ ├── binary-amd64
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── debian-installer
│ │ │ │ ├── binary-amd64
│ │ │ │ │ ├── Packages
│ │ │ │ │ └── Packages.gz
│ │ │ │ └── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ └── Packages.gz
│ │ │ └── source
│ │ │ ├── Release
│ │ │ └── Sources.gz
│ │ ├── non-free
│ │ │ ├── binary-amd64
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ ├── Packages.gz
│ │ │ │ └── Release
│ │ │ ├── debian-installer
│ │ │ │ ├── binary-amd64
│ │ │ │ │ ├── Packages
│ │ │ │ │ └── Packages.gz
│ │ │ │ └── binary-i386
│ │ │ │ ├── Packages
│ │ │ │ └── Packages.gz
│ │ │ └── source
│ │ │ ├── Release
│ │ │ └── Sources.gz
│ │ ├── Release
│ │ └── Release.gpg
│ └── pool
│ └── main
│ └── c
│ └── c2eo
│ └── c2eo_X.X.X_all.deb
├── todo.sh
└── update-release.py
35 directories, 120 files
Then you have to upload ./repository/dists and ./repository/pool to c2eo.polystat.org/debian/.
Principles of Transpilation from C to EO
C is a system-level procedural programming language with direct access to the underlying hardware architecture elements, such as memory and registers. EO, on the other hand is a high-level object-oriented language. There are a number of non-trivial mechanisms of translating constructs from the former to the latter, which are explained below:
✔️ Implemented:
- basic data types: double, int, bool, char, string
- const
- arrays
- structures
- unions
- functions
- function call operators
- multiple return
- pointers
- external links
- if-else
- ternary operator
- while
- do-while
- for
- break
- continue
- switch case default
- operators
✔️ Implemented:
Direct memory access for basic data types
Let's take the following C code as an example:
In EO, we represent the global memory space as a copy of ram object, which we call global. Thus, the variable z would be accessed as a block of 8 bytes inside ram at the very beginning, since it's the first variable seen. For example, to change the value of z we write 8 bytes to the 0th position of global:
ram > global global.write 0 (3.14.as-bytes)
Const
We transform const like ordinary variable.
const int a = 3; if (a == 10) { ... }
a.write-as-int32 3 // only once if a.read-as-int32.eq 10 seq ... True
Enums
We can work with enumerated types as well as with constants and substitute numeric values instead of names.
enum State {Working = 1, Failed = 0}; if (10 == Working) { ... }
if 10.eq 1 seq ... True seq True
Arrays
If we have fixed-size arrays we can work like with one-dimension array and calculate bias from start for any element and dimensions. In this example, we use a special object address, which makes it more convenient to read and write information from memory from a certain position.
int a[2] = { 5, 6 }; ╭─────┬─────╮ | 5 │ 6 │ ├─────┼─────┤ | 0th │ 4th │ ╰─────┴─────╯
address global-ram 0 > a a.write (4.mul 0) (5.as-bytes) a.write (4.mul 1) (6.as-bytes)
Structures
We know the size of structures so we generate additional objects that store the bias of the fields of the structure and allow access to them. For nested structures and other types, we can also calculate bias and generate corresponding objects.
struct Rectangle {int x; int y;} rect; rect.x = 5; ╭───────┬───────╮ | int x │ int y │ ├───────┼───────┤ | 0th │ 4th │ ╰───────┴───────╯
address global-ram 0 > rect 0 > x 4 > y (rect.add x).write 5
Unions
The size of the union is determined by the nested object with the maximum size. The main feature is that internal objects are located at the beginning of the same address. We do the same with nested structures.
union { int a; int b; } u; u.a = 5; ╭───────┬───────╮ | int a │ int b │ ├───────┼───────┤ | 0th │ 0th │ ╰───────┴───────╯
address global-ram 0 > u 0 > a 0 > b (u.add a).write 5
Functions
In a similar way we deal with function call, we calculate the necessary space for arguments (param-start and param-size) and local variables in global for each function call. The variable r will be "pushed" to global and accessible by the code inside the function foo by the 0th position with local offset. The local variable x will also be pushed to the global and will be accessible by the 4th with local offset, because the length of int is four.
Also we use separate copy of ram named return for storing function return result. Here, we are trying to simulate the bevaviour of a typical C compiler. The declaration of foo and its execution may look like this:
double pi = 3.14; void circle(int r) { double x = 2 * pi * r; return x; } circle(10); ╭──────────┬───────┬──────────╮ | double z │ int r │ double x │ // variables in global ├──────────┼───────┼──────────┤ | 0th │ 8th │ 12th │ // start position in global ╰──────────┴───────┴──────────╯
address global-ram 0 > pi [param-start param-size] > circle global.read param-start > r global.read (add param-start 4) > x seq > @ x.write (2.mul (pi.mul r)) return.write x seq pi.write 0 3.14 global.write 8 10 // write 10 to circle arguments stack circle 8 4 // arguments stack start from 8th byte and have 4 bytes for r
Function call operators
The function has input variables and local variables. To determine the amount of memory for input variables, we use two parameters in the function description. For the convenience of accessing local variables, we use the bias local-start of the local position. To indicate a free position, we use empty-local-position. We divide the nested function call into several consecutive calls, the result of which is passed to subsequent calls.
long long func1(long long x) { return x - 111; } long long func2(long long x) { return x - 10; } void main() { long long a; a = func1(func2(5)); printf("%lld\n", a); }
[param-start param-size] > func1 add param-start param-size > local-start add local-start 0 > empty-local-position address global-ram (param-start.add 0) > x seq > @ return.write (x.sub 111) TRUE [param-start param-size] > func2 add param-start param-size > local-start add local-start 0 > empty-local-position address global-ram (param-start.add 0) > x seq > @ return.write (x.sub 10) TRUE [] > main seq > @ a.write // write func1 return in a seq write // write func2 return in temp place address global-ram (add empty-local-position 0) seq write // write 5 to func2 arguments stack address global-ram (add empty-local-position 0) 5 ^.func2 empty-local-position 8 return ^.func1 empty-local-position 8 return printf "%d\n" a
Multiple return
We generate a record of the result in a separate ram memory object. Further, other functions can read the result from there. To solve the multiple return problem, we can use the goto object in eo. By wrapping the entire function in a similar object, we can interrupt its execution at any time. To do this, you just need to generate a g.forward call for each return.
function { ... return <result_1>; ... return <result_2>; ... return <result_3>; }
[] > function goto > @ [g] seq > @ ... return.write <result_1> g.forward TRUE ... return.write <result_2> g.forward TRUE ... return.write <result_3> g.forward TRUE
Pointers
C code may get an address of a variable, which is either in stack or in global memory:
int f = 7; void bar() { int t = 42; int* p = &t; // local scope *p = 500; // write from local scope to local p = &f; // global scope *p = 500; // write from local scope to global } ╭───────┬───────┬────────╮ | int f │ int t │ int* p │ // variables in global ├───────┼───────┼────────┤ | 0th │ 4th │ 8th │ // start position in global ╰───────┴───────┴────────╯
However, as in C, our variables are located in global and have absolute address.
The object param-start provided as an argument to EO object bar is a calculated offset in global addressing the beginning of the frame for function call. Thus, &t would return param-start + 0, while &f would be just 0:
[param-start] > bar global.write 8 // int* p param-start // &t -> function offset position in global space global.write 8 0 // &f -> address of f in global seq > @ bar 4
External links
To compile files with any external links, we use the following solution:
In the file where the external call is used, we generate the following alias
#include <string> strncpy(str2, str1, 8);
+alias c2eo.external.strcpy strncpy str2 st1 8
Сreating a file of the same name by the specified alias with an empty implementation
+package c2eo.external [args...] > strncpy TRUE > @
If-else
In EO, we have analogues of if-else and if objects, so we just convert without any significant changes.
if (condition) { ... } else { ... } // ----------------- if (condition) { ... }
if-else condition seq ... TRUE seq // else ... TRUE // ----------------- if condition seq ... TRUE
Ternary operator
We can turn the ternary operator into the same if-else, only seq must be without True at the end, because its return value will be used.
if-else condition seq a seq // else b
While
We can generate of C while on the EO by using goto, conditional operator and analogs for break and continue.
while (condition) { ... }
goto [while-loop-label] while-loop-label.backward > continue while-loop-label.forward TRUE > break if > @ condition seq body continue TRUE
Do-while
We can generate an analog of C do-while on EO by using nested goto for further checking by a conditional operator and analogs for break and continue.
do { body } while (condition)
goto [do-while-loop-label-1] do-while-loop-label-1.forward TRUE > break seq > @ goto [do-while-loop-label-2] do-while-loop-label-2.forward TRUE > continue body > @ if condition do-while-loop-label-1.backward TRUE
For
We can generate an analog of C for on EO using the nested goto to execute loop-expression after executing the body of the loop, conditional operator and analogs for break and continue.
for(init;condition;loop-expression) { body }
init goto [for-loop-label-1] for-loop-label-1.forward TRUE > break if > @ condition seq goto [for-loop-label-2] for-loop-label-2.forward TRUE > continue body > @ loop-expression for-loop-label-1.backward TRUE
Break
With goto object we can transofrm any number of breaks in cycle to g.forward TRUE call.
while (condition) { ... break; ... }
goto [while-loop-label] while-loop-label.backward > continue while-loop-label.forward TRUE > break if > @ condition seq ... break ... TRUE
Continue
With goto object we can transofrm any number of continue in cycle to g.backward call.
while (condition) { ... continue; ... }
goto [while-loop-label] while-loop-label.backward > continue while-loop-label.forward TRUE > break if > @ condition seq ... continue ... TRUE
Switch case default
We can convert such simple switch statement to goto object.
switch (x): { case 1: op1; break; case 2: case 3: op2; break; case 4: op3; case 5: op4; break; case 6: default: op6: break; }
memory > flag goto > @ [end] seq > @ write flag 0 if or (eq x 1) flag seq write flag 1 op1 end.forward TRUE TRUE if or (eq x 2) flag seq write flag 1 TRUE if or (eq x 3) flag seq write flag 1 op2 end.forward TRUE TRUE if or (eq x 4) flag seq write flag 1 op3 TRUE if or (eq x 5) flag seq write flag 1 op4 end.forward TRUE TRUE if or (eq x 6) flag seq write flag 1 TRUE op6 end.forward TRUE TRUE
Operators
The table of all C operators and similar objects in the EO.
| С | EO |
|---|---|
| + | plus |
| - | minus |
| * | times |
| * | write|read-as-<type> |
| / | div |
| = | write-as-<type> |
| % | mod |
| +x | pos |
| -x | neg |
| ++x | pre-inc-<type> |
| x++ | post-inc-<type> |
| --x | pre-dec-<type> |
| x-- | post-dec-<type> |
| == | eq |
| != | neq |
| < | lt |
| <= | lte |
| > | gt |
| >= | gte |
| && | and |
| || | or |
| ! | not |
| & | bit-and |
| & | addr-of |
| | | bit-or |
| ^ | bit-xor |
| ~ | bit-not |
| << | shift-right |
| >> | shift-left |
| (type casting) | as-<type> |
For assignment operations, we generate the following constructs
⚠️ Partially implemented:
Basic types
In EO, an implementation of at least 8 bytes is used to store floating-point numbers. At the moment, full support for numbers with fewer bytes is not possible. So far, to work with such numbers, we also use 8 bytes for storage.
float b = 5.0; // 4 bytes
write-as-float32 b 5.0 // 8 bytes
At the moment, the largest type in EO is int64, there is no support for uint64 numbers and it crashes with an error at the compilation stage. The current our implementation supports numbers in the range of type uint63 (the first bit is always 0 for correct translation to int).
unsigned long long int c = 18446744073709551615; // max uint64 value unsigned long long int c = 9223372036854775807; // max int64 value
// [COMPILATION EXCEPTION] the number is too high write-as-uint64 c 10223372036854775807 // correct write-as-uint64 d 9223372036854775807
Pointers on function
Source: https://stackoverflow.com/questions/840501/how-do-function-pointers-in-c-work
Let's start with a basic function which we will be pointing to:
int addInt(int n, int m) { return n + m; }
First thing, let's define a pointer to a function which receives 2 ints and returns an int:
int (*functionPtr)(int, int);
Now we can safely point to our function:
In EO we generate special object call with array for storing all function call:
[index param-start param-size] > call at. > @ * <function_name_1> param-start param-size addInt param-start param-size // our function has an index of 1 ... <function_name_n_n> param-start param-size index
Now, if we want to assign the function to a pointer, we replace this expression with a specific index value of this function in our array
write-as-ptr functionPtr 1
Now that we have a pointer to the function, let's use it:
int sum = (*functionPtr)(2, 3); // sum == 5
... // before calling the function, we place its arguments in memory write-as-int32 sum call param-start param-size read-as-ptr functionPtr // return 1
Current development at this stage
Passing the pointer to another function is basically the same:
int add2to3(int (*functionPtr)(int, int)) { return (*functionPtr)(2, 3); }
We can use function pointers in return values as well (try to keep up, it gets messy):
// this is a function called functionFactory which receives parameter n // and returns a pointer to another function which receives two ints // and it returns another int int (*functionFactory(int n))(int, int) { printf("Got parameter %d", n); int (*functionPtr)(int, int) = &addInt; return functionPtr; }
But it's much nicer to use a typedef:
typedef int (*myFuncDef)(int, int); // note that the typedef name is indeed myFuncDef myFuncDef functionFactory(int n) { printf("Got parameter %d", n); myFuncDef functionPtr = &addInt; return functionPtr; }
❌ Not implemented
Goto and labels
The EO language has a goto object that supports transitions similar to the continue and break statements. We also use it to implement multiple return statements to terminate a function.
The goto statements in the C language provides a variety of jumps to the appropriate labels, violating the principles of structured programming. It cannot be implemented for many situations using the goto object that EO has. Replacing goto in C with other statements requires additional effort. It is very difficult to implement based on AST analysis alone. For example:
if (a) { A(); goto L3; } B(); L1: if (b) { L2: C(); L3: D(); goto L1; } else if (c) { E(); goto L2; } F();
stateDiagram-v2
state "if (a)" as if_1
state "if (b)" as if_2
state "else if (c)" as if_3
state "L1:" as L1
state "L2:" as L2
state "L3:" as L3
state "A();" as A
state "B();" as B
state "C();" as C
state "D();" as D
state "E();" as E
state "F();" as F
[*] --> if_1
if_1 --> A: True
A --> L3
if_1 --> B: False
B --> L1
L1 --> if_2
if_2 --> L2: True
L2 --> C
C --> L3
L3 --> D
D --> L1
if_2 --> if_3: False
if_3 --> E: True
E --> L2
if_3 --> F: False
F --> [*]
To solve this problem in the future, we can propose to implement goto-statement and goto-label objects in EO, which directly implement the goto semantics of the C language and are used when transforming from C to EO. The use of these objects in direct EO programming can be disabled. In addition, these constructs can be replaced when performing static analysis of EO programs.
Calling functions with variable number of arguments
Modern C compilers do not have direct support functions with a variable number of arguments. This is due to the use of new standards for passing arguments through registers. Therefore, an additional library is used for implementation, the interface of which is connected via the stdarg.h header file. The library describes such constructs as va_list\verb, va_start, va_arg, va_end and others that are added to a C program at link time. Therefore, the implementation of these functions during transformation is impossible. For example:
#include <stdarg.h> #include <stdio.h> double average(int num,...) { va_list valist; double sum = 0.0; int i; /* initialize valist for num number of arguments */ va_start(valist, num); /* access all the arguments assigned to valist */ for (i = 0; i < num; i++) { sum += va_arg(valist, int); } /* clean memory reserved for valist */ va_end(valist); return sum / num; } int main() { printf("Average of 1, 2, 3, 4 = %f\n", average(4, 1, 2, 3, 4)); printf("Average of 1, 2, 3 = %f\n", average(3, 1, 2, 3)); }
In the further development of the project, it may be to try implementing a linker that combines EO library packages with compilation units obtained during the transformation of C programs into a single program. Another possible, but labor-intensive option could be additional parsing of functions that support a variable number of parameters. This option introduces additional features, but leads to deviations from the standard of C compiler.
Bitwise fields
In the C language, bitwise fields can be formed as structures. They provide access to individual bits of signed and unsigned numbers. Since at the moment EO does not support work at the bit level, the implementation of work with bit fields requires a lot of time and manipulations with logical operations at the byte level. This task was not a priority during the current development.
// memory-optimized date storage structure struct date { unsigned int day: 5; // the maximum value of days is 31, so we need 5 bits for this unsigned int month: 4; // the maximum value of months us 12, so weed 4 bits for this unsigned int year; }; struct date d = {15, 7, 2022}; printf("Date size is: %lu bytes\n", sizeof(d)); // 8 bytes instead 12 printf("Date is %d.%d.%d", d.day, d.month, d.year); // 15.7.2022