Saturday, July 18, 2015

Source code of unicode encoding/decoding in C embedding Python

Unicode is very important as it is a standard that allows computers to represent and manipulate, consistently, the text of any existing system of writing. 

Here I have presented the source code in C where the non ascii character can be displayed properly in the console. To display the non ascii character , we need to implement unicode encoding/decoding process. There can be variouus ways to Implement unicode in C, However here I have embedded Python to implement unicode. 

To embeed python in C, we need to add Python.h header in the program. we need the python-dev package which contains Python.h  to be installed prior to running this program.

more detail on embedding python on c is described   here 




source code
#include <glib.h>

#include <Python/Python.h>


char *get_encoded_msg(char *buffer, char *charset)
{
    Py_ssize_t ssize = (Py_ssize_t)strlen(buffer);
    PyObject *pyobject_unicode= PyUnicode_Decode(buffer,ssize,charset,"replace");
    if(pyobject_unicode==NULL)
    {
        printf("decode failed for: %s",buffer);
        return NULL;
    }
    PyObject *pystring= PyUnicode_AsUTF8String(pyobject_unicode);
    if(pystring == NULL)
    {
        printf("UTF-8 encode failed for: %s",buffer);
        return NULL;   
    }
    const char *encoded_str = PyString_AsString(pystring);
    char *encoded_str_dup = strdup(encoded_str);
    Py_DECREF(pystring);
    Py_DECREF(pyobject_unicode);
    printf("Encoded string: %s",encoded_str_dup);
     int new_glength = g_utf8_strlen (encoded_str_dup, 9);
     printf("new length = %d",new_glength);
     char *test = "laxmi";
      int new_len = g_utf8_strlen (test, 4);
       printf("new = %d",new_len);

       char *required_message = g_utf8_substring(encoded_str_dup, 0,3);

       printf("final value = %s",required_message);
    
    return encoded_str_dup;
}

int main()
{

	 // Initialize the Python Interpreter
  Py_Initialize();
  printf("here");
  char *encoded_msg;
  char *message1 = "象形字 xiàngxíngzì";
   char *message = "象形字";

   int len= strlen(message);
   printf("length of unicode string = %d\n",len);

  char *charset = "UTF-8";

  encoded_msg = get_encoded_msg(message, charset);

}


No comments:

Post a Comment