Sunday, January 29, 2017

Effective C++ 50 Items Notes




Effective C++ 50 Items

  1. Prefer const and inline to #define
    1. const
      1. in class
      2. static member
      3. enum trick
    2. macro
      1. use inline function instead
      2. use template together with the inline function
  2. Prefer <iostream> to <stdio.h>
    1. <iostream>: functions are in the std namespace.
    2. <iostream.h> functions are in the global scope.
  3. Prefer new and delete to malloc and free
    1. with malloc/free, constructors and destructors are not involved and this is not good.
  4. Prefer C++ style comments
  5. Use the same form in corresponding uses of new and delete
    1. pay attention to typedef, especially when the array is involved in the definition.
  6. Use delete on pointer members in destructors
    1. adding a pointer member almost always requires each of the following:
      1. initialization of the pointer in each of the constructor.
      2. deletion of the existing memory and assignment of new memory in the assignment operator.
      3. deletion of the pointer in the destructor.
  7. Be prepared for out-of-memory condition
    1. assertions are checked in debug mode when NDEBUG isn't defined.
    2. a well-defined new-handler function must do one of the following
      1. make more memory available
        1. one way to implement this strategy is to allocate a large block of memory at program start-up, then release it in the first time the new handler is invoked.
      2. install a different new-handler
      3. de-install the new-handler
        1. pass the null pointer to set_new_handler
      4. throw an exception
      5. not return
        1. call abort or exit
  8. Adhere to convention when writing operator new and operator delete
    1. operator new actually tries to allocate memory more than once.
    2. only when the pointer to the error-handling function is null does operator new throw an exception.
    3. operator new is inherited by subclasses.
    4. new T[5] calls operator new[](count) where count = sizeof(T) * 5 + overhead.
    5. because the operator new or operator new[] can be called by a subclass, essentially, we cannot know the exact size of the element, it can be either sizeof(base) or the sizeof(derived)
  9. Avoid hiding the "normal" form of new
    1.  write a class-specific operator new that supports the normal invocation form
    2. an alternative is to provide a default parameter value for each additional parameter you add to operator new.
  10. write operator delete if you write operator new
    1. the main reason we want to overload the operator new and operator delete is performance.
    2. the default operator new/delete may need to handle the communication between these two operators and additional header is needed.
    3. we may need to have a virtual function in the class to make the customized operator delete work.
  11. Declare a copy constructor and an assignment operator for classes with dynamically allocated memory.
    1. write your own copy constructor and assignment operator if you have any pointers in your class.
  12. Prefer initialization to assignment in constructors
  13.  List members in an initialization list in the order in which they are declared
    1. class members are initialized in the order of their declaration in the class.
    2. base class data members are initialized before derived class data members.
  14. Make sure base classes have virtual destructor
    1. you have a class that you'd like to be abstract, but you don't happen to have any functions that are pure virtual. In this situation, you can declare a pure virtual destructor.
  15. Have operator= return a reference to *this
  16. Assign to all data members in operator=
    1. be careful about the implementation in the derived class. Most of the time, we need to explicitly call the operator=, copy constructor of the base class.
  17. Check for assignment to self in operator=
  18. Strive for class interfaces that are complete and minimal
  19. Differentiate among member functions, non-member functions and friend functions
    1. The biggest difference between member functions and non-member functions is that member functions can be virtual and non-member functions cannot.
    2. virtual functions must be members
    3. operator>> and operator<< are never members.
    4.  Only non-member functions get type conversions on their left-most argument.
    5. Everything else should be a member function.
  20. Avoid data members in the public interface.
  21. Use const whenever possible
    • mutable keyword
    • const member functions
      • bitwise constness
      • conceptual constness
  22. Prefer pass-by-reference to pass-by-value
  23. Don't try to return a reference when you must return an object
  24. Choose carefully between function overloading and parameter defaulting
  25. Avoid overloading on a pointer and a numerical type
    • Think about the following example
      • void f(int x);
      • void f(string *ps);
      • call f(0) 
    • member function template
  26. Guard against potential ambiguity
  27. Explicitly disallow use of implicitly generated member functions you don't want
    • declare the function private
    • do not implement the function
  28. Partition the global namespace
  29. Avoid returning "handlers" to internal data
    • conversion of types
  30. Avoid member functions that return non-const pointers or reference to members less accessible than themselves.
  31. Never return a reference to a local object or to a dereference pointer initialized by new within the function.
  32. Postpone variable definitions as long as possible
  33.  Use inlining judiciously
    • The inline directive, like register, is a hint to compilers, not a command.
  34. Minimize compilation dependencies between files
    • Avoid using objects when object references and pointers will do
    • use class declarations instead of class definitions whenever you can.
    • don't #include header files in your header files unless your headers won't compile without them
  35. Make sure public inheritance models "is-a"
  36. Differentiate between inheritance of interface and inheritance of implementation.
    • the purpose of declaring a pure virtual function is to have derived classes inherit a function interface only.
    • it is possible to provide a definition for a pure virtual function. But the only way to call it would be to fully specify the call with the class name.
    • The purpose of declaring a simple virtual function is to have derived classes inherit a function interface as well as a default implementation.
    • the purpose of declaring a non-virtual function is to have derived classes inherit a function interface as well as a mandatory implementation. 
  37. Never redefine an inherited non virtual function
  38. Never redefine an inherited default parameter value
    • virtual functions are dynamically bound but default parameter values are statically bound.
    • the default parameters are determined during the compilation time. The reason for this behaviour is efficiency.
  39. Avoid casts down the inheritance hierarchy
  40. Model "has-a" or "is-implemented-in-term-of" through layering.
  41. Differentiate between inheritance and templates
    • Try to answer the question: does the type T affect the behaviour of the class? If T does not affect the behaviour, you can use a template. If T does affect the behaviour, you'll need virtual functions and you'll therefore use inheritance.
  42. Use private inheritance judiciously
    • In contrast to public inheritance, compilers will generally not convert a derived class object into a base class object if the inheritance relationship between the classes is private.
    • Private inheritance means is-implemented-in-terms-of.
    •  Private inheritance is purely an implementation technique.
    • Private inheritance means that implementation only should be inherited; interface should be ignored.
    • Private inheritance means nothing during software design, only during software implementation.
  43. Use multiple inheritance judiciously
  44. Say what you mean; understand what you are saying
    • A common base class means common traits.
    • Public inheritance means "is-a".
    • Private inheritance means "is-implemented-in-terms-of".
    • Layering or composition means "has-a" or "is-implemented-in-terms-of"
    • A pure virtual function means that only the function's interface is inherited.
    • A simple virtual function means that the function's interface plus a default implementation is inherited.
    • A non-virtual function means that the function's interface plus a mandatory implementation is inherited.
  45. Know what functions C++ silently writes and calls
  46. Prefer compile-time and link-time errors to runtime errors
  47. Ensure that non-local static objects are initialized before they're used
    • singleton pattern
    • the order of non-local static objects
    • use function that returns the reference of a static variable
  48. Pay attention to compiler warnings
  49. Familiarize yourself with the standard library
  50. Improve your understanding of C++












Saturday, January 7, 2017

The Algorithm Design Manual - Exercise 3-15



Here is the statement of the exercise 3-15 in the Algorithm Design Manual.

[8] Design a data structure that allows one to search, insert, and delete an integer
X in O(1) time (i.e. , constant time, independent of the total number of integers
stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available,
where m is the maximum number of integers that can be in the table at any one
time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize
either A or B, as that would take O(m) or O(n) operations. This means the arrays
are full of random garbage to begin with, so you must be very careful.


In addition to the original statement, we impose that if an integer already exists in the data structure, it is not allowed to insert it again unless it is deleted.

Following the hint, we will use array B to store the integers and array A to map the integer k to the index of B where it is stored. In other word, B[A[k]] = k.

As we are not allowed to initialize the two arrays, we will use an incremental approach. We keep track of the next available position in array B and update this information every a element is inserted or deleted. Note that we cannot use any space other than array A and array B.

Here is one solution that might work. We can use the array B itself to track the current state of the data structure. We define the following rules:
(1) if B[i] = k > 0, it means that the k is stored in B[i];
(2) if B[i] = j < 0, it means that B[j] is also a available space.
(3) finally, if B[M] = i < 0, it means that the next available position is B[-i]

The code is listed below:


#include<stdio.h>

#define False 0
#define True (!False)

#define N 10
#define M 5

int A[N + 1];
int B[M + 1];

void initialize();
void insert(int k);
int exist(int k);
void delete(int k);
void print_status();



void initialize()
{
 A[1] = 0;
 B[1] = -2; /* the next available position pointed by B[1] is B[2] */
 B[M] = -1; /* initially, the first spot is available. */
 return;
}


void insert(int k)
{
 printf("insert %d\n", k);
 A[k] = -B[M];    /* link A[k] to the next available position. */
 B[M] = B[A[k]];  /* update the next available position. */
 B[A[k]] = k;     /* put k into the current available position. */

 /* test if next available position has a link to the next-next available position */
 if (B[-B[M]] < 0 && B[-B[M]] >= -M &&  ( B[-B[-B[M]]] < 0 || B[-B[-B[M]]] > N || A[B[-B[-B[M]]]] != B[-B[-B[M]]]))
  return;

 /* in this case, the next-next available position is the one right after the -B[M] */
 if (B[M] < 0)
  B[-B[M]] = B[M] - 1;



}

int exist(int k)
{
 printf("query: does %d exist? ", k);
 if (k >= 1 && k <= N && A[k] >= 1 && A[k] <= M && B[A[k]] == k) {
  printf("Yes\n");
  return True;
 } else {
  printf("No\n");
  return False;
 }
}

void delete(int k)
{
 printf("delete %d\n", k);
 if (exist(k)) {

  if (B[M] < 0) {
   /* B is not full.
    * Suppose we have A[k] <-->B[i] and B[M] --> B[j].
    * After the call, we want to have
    * A[k] = 0 and B[M] --> B[i] and B[i] --> B[j]
    */
   B[A[k]] = B[M];
   B[M] = -A[k];
   A[k] = 0;

  } else {    /* B is full*/
   if (A[k] == M) {
    /* k is put in B[M] and we have A[k] <--> B[M].
     * so we remove the link and set next available spot to M,
     *  */
    B[M] = -M;
    A[k] = 0;

   } else {
    /* k is not in B[M].
     * suppose A[k] <--> B[i] and A[s] <-->B[M]
     * what we want to have after this call is
     * A[k] = 0 and A[s] <-->B[i] and B[M] = -M
     */

    A[B[M]] = A[k];
    B[A[k]] = B[M];
    B[M] = -M;
    A[k] = 0;

   }

  }



 } else {
  printf("%d does not exist.", k);
  return;
 }
}

void print_status()
{
 printf("current status:\n");
 printf("A: ");
 for (int i=0; i <= N; ++i) printf("%d ", A[i]);
 printf("\n");

 printf("B: ");
 for (int i=0; i <= M; ++i) printf("%d ", B[i]);
 printf("\n\n");
}


int main()
{

 initialize(); print_status();

 insert(1); print_status();
 insert(2); print_status();
 exist(3);  print_status();
 exist(4);  print_status();
 insert(3); print_status();
 exist(3);  print_status();
 insert(4); print_status();
 delete(2); print_status();
 exist(2);  print_status();
 insert(5); print_status();
 insert(6); print_status();
 delete(5); print_status();
 exist(5);  print_status();
 insert(7); print_status();
 delete(7); print_status();
 insert(8); print_status();
 delete(3); print_status();


}


And the output is listed below:

current status:
A: 0 0 0 0 0 0 0 0 0 0 0
B: 0 -2 0 0 0 -1

insert 1
current status:
A: 0 1 0 0 0 0 0 0 0 0 0
B: 0 1 -3 0 0 -2

insert 2
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

query: does 3 exist? No
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

query: does 4 exist? No
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

insert 3
current status:
A: 0 1 2 3 0 0 0 0 0 0 0
B: 0 1 2 3 -5 -4

query: does 3 exist? Yes
current status:
A: 0 1 2 3 0 0 0 0 0 0 0
B: 0 1 2 3 -5 -4

insert 4
current status:
A: 0 1 2 3 4 0 0 0 0 0 0
B: 0 1 2 3 4 -5

delete 2
query: does 2 exist? Yes
current status:
A: 0 1 0 3 4 0 0 0 0 0 0
B: 0 1 -5 3 4 -2

query: does 2 exist? No
current status:
A: 0 1 0 3 4 0 0 0 0 0 0
B: 0 1 -5 3 4 -2

insert 5
current status:
A: 0 1 0 3 4 2 0 0 0 0 0
B: 0 1 5 3 4 -5

insert 6
current status:
A: 0 1 0 3 4 2 5 0 0 0 0
B: 0 1 5 3 4 6

delete 5
query: does 5 exist? Yes
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

query: does 5 exist? No
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

insert 7
current status:
A: 0 1 0 3 4 0 2 5 0 0 0
B: 0 1 6 3 4 7

delete 7
query: does 7 exist? Yes
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

insert 8
current status:
A: 0 1 0 3 4 0 2 0 5 0 0
B: 0 1 6 3 4 8

delete 3
query: does 3 exist? Yes
current status:
A: 0 1 0 0 4 0 2 0 3 0 0
B: 0 1 6 8 4 -5





--END--

Friday, January 6, 2017

Python - Pandas - How to read compressed SAS data


Pandas provides the function read_sas to read the sas data. It supports two format: (1) ‘xport’ and (2) ‘sas7bdat’. Sometimes the data is really large and is provided in a compressed file. Unfortunately, it seems that pandas does not support reading from the compressed sas data directly. 


In this post, we will introduce a not-quite-save way to read from the zipped sas data. In order to do so, we need three packages:

  1. gzip
  2. io
  3. pandas
If we try the following code, it is not going to work. The reason is the in the compressed gz file, there exists meta information about the original file and the system in the header before the actual data. So when we pass the file object to the pd.read_sas, the function complains that the file is not a SAS file.

f = gzip.GzipFile('your_input_file.sas7bdat.tar.gz', 'rb')
df = pd.read_sas(f, format='sas7bdat')


So how to address this problem? There is a very straightforward way to handle this, but it is not very safe (this point will be explained later). The simple idea is try to skip the meta information header in the gz file.

Here is the code. In part 1, we overwrite the seek function of the gzip.GzipFile. The function takes into account the fact that we skip the header and when the seek function is called in the pd.read_sas function, an additional offset (of the header) is added to the original offset. In part 2, we just guess where the meta header information ends.

As mentioned previously, this method is not safe because when we overwrite the seek function, we do not take into account the whence parameter. Fortunately, this code can work (at least for some small files).


# part 1: change the "seek" behavior of the file object.
class FileObjGZ(gzip.GzipFile):
    def set_gz_offset(self, val):
        self._gz_offset = val

    def seek(self, offset, whence=io.SEEK_SET):
        new_offset = offset + self._gz_offset
        super(FileObjGZ, new_offset).seek(new_offset)


input_file = 'your_input_file.sas7bdat.tar.gz'

# part 2: try to skip the meta header
with FileObjGZ(input_file, 'rb') as f:
    guess_gz_offset = 0
    while True:
        try:
            f.set_gz_offset(guess_gz_offset)
            f.seek(0)
            df = pd.read_sas(f, format='sas7bdat')
            break
        except ValueError as e:
            print(e)
            guess_gz_offset += 1
    print("loading data is completed.")



---END--