NaturallyStupidPotato: January 2017

Sunday, January 29, 2017

Effective C++ 50 Items Notes

Effective C++ 50 Items

Prefer const and inline to #define

const

in class
static member
enum trick

macro

use inline function instead
use template together with the inline function

Prefer <iostream> to <stdio.h>

<iostream>: functions are in the std namespace.
<iostream.h> functions are in the global scope.

Prefer new and delete to malloc and free

with malloc/free, constructors and destructors are not involved and this is not good.

Prefer C++ style comments
Use the same form in corresponding uses of new and delete

pay attention to typedef, especially when the array is involved in the definition.

Use delete on pointer members in destructors

adding a pointer member almost always requires each of the following:

initialization of the pointer in each of the constructor.
deletion of the existing memory and assignment of new memory in the assignment operator.
deletion of the pointer in the destructor.

Be prepared for out-of-memory condition

assertions are checked in debug mode when NDEBUG isn't defined.
a well-defined new-handler function must do one of the following

make more memory available

one way to implement this strategy is to allocate a large block of memory at program start-up, then release it in the first time the new handler is invoked.

install a different new-handler
de-install the new-handler

pass the null pointer to set_new_handler

throw an exception
not return

call abort or exit

Adhere to convention when writing operator new and operator delete

operator new actually tries to allocate memory more than once.
only when the pointer to the error-handling function is null does operator new throw an exception.
operator new is inherited by subclasses.
new T[5] calls operator new[](count) where count = sizeof(T) * 5 + overhead.
because the operator new or operator new[] can be called by a subclass, essentially, we cannot know the exact size of the element, it can be either sizeof(base) or the sizeof(derived)

Avoid hiding the "normal" form of new

write a class-specific operator new that supports the normal invocation form
an alternative is to provide a default parameter value for each additional parameter you add to operator new.

write operator delete if you write operator new

the main reason we want to overload the operator new and operator delete is performance.
the default operator new/delete may need to handle the communication between these two operators and additional header is needed.
we may need to have a virtual function in the class to make the customized operator delete work.

Declare a copy constructor and an assignment operator for classes with dynamically allocated memory.

write your own copy constructor and assignment operator if you have any pointers in your class.

Prefer initialization to assignment in constructors
List members in an initialization list in the order in which they are declared

class members are initialized in the order of their declaration in the class.
base class data members are initialized before derived class data members.

Make sure base classes have virtual destructor

you have a class that you'd like to be abstract, but you don't happen to have any functions that are pure virtual. In this situation, you can declare a pure virtual destructor.

Have operator= return a reference to *this
Assign to all data members in operator=

be careful about the implementation in the derived class. Most of the time, we need to explicitly call the operator=, copy constructor of the base class.

Check for assignment to self in operator=
Strive for class interfaces that are complete and minimal
Differentiate among member functions, non-member functions and friend functions

The biggest difference between member functions and non-member functions is that member functions can be virtual and non-member functions cannot.
virtual functions must be members
operator>> and operator<< are never members.
Only non-member functions get type conversions on their left-most argument.
Everything else should be a member function.

Avoid data members in the public interface.
Use const whenever possible

mutable keyword
const member functions

bitwise constness
conceptual constness

Prefer pass-by-reference to pass-by-value
Don't try to return a reference when you must return an object
Choose carefully between function overloading and parameter defaulting
Avoid overloading on a pointer and a numerical type

Think about the following example

void f(int x);
void f(string *ps);
call f(0)

member function template

Guard against potential ambiguity
Explicitly disallow use of implicitly generated member functions you don't want

declare the function private
do not implement the function

Partition the global namespace
Avoid returning "handlers" to internal data

conversion of types

Avoid member functions that return non-const pointers or reference to members less accessible than themselves.
Never return a reference to a local object or to a dereference pointer initialized by new within the function.
Postpone variable definitions as long as possible
Use inlining judiciously

The inline directive, like register, is a hint to compilers, not a command.

Minimize compilation dependencies between files

Avoid using objects when object references and pointers will do
use class declarations instead of class definitions whenever you can.
don't #include header files in your header files unless your headers won't compile without them

Make sure public inheritance models "is-a"
Differentiate between inheritance of interface and inheritance of implementation.

the purpose of declaring a pure virtual function is to have derived classes inherit a function interface only.
it is possible to provide a definition for a pure virtual function. But the only way to call it would be to fully specify the call with the class name.
The purpose of declaring a simple virtual function is to have derived classes inherit a function interface as well as a default implementation.
the purpose of declaring a non-virtual function is to have derived classes inherit a function interface as well as a mandatory implementation.

Never redefine an inherited non virtual function
Never redefine an inherited default parameter value

virtual functions are dynamically bound but default parameter values are statically bound.
the default parameters are determined during the compilation time. The reason for this behaviour is efficiency.

Avoid casts down the inheritance hierarchy
Model "has-a" or "is-implemented-in-term-of" through layering.
Differentiate between inheritance and templates

Try to answer the question: does the type T affect the behaviour of the class? If T does not affect the behaviour, you can use a template. If T does affect the behaviour, you'll need virtual functions and you'll therefore use inheritance.

Use private inheritance judiciously

In contrast to public inheritance, compilers will generally not convert a derived class object into a base class object if the inheritance relationship between the classes is private.
Private inheritance means is-implemented-in-terms-of.
Private inheritance is purely an implementation technique.
Private inheritance means that implementation only should be inherited; interface should be ignored.
Private inheritance means nothing during software design, only during software implementation.

Use multiple inheritance judiciously
Say what you mean; understand what you are saying

A common base class means common traits.
Public inheritance means "is-a".
Private inheritance means "is-implemented-in-terms-of".
Layering or composition means "has-a" or "is-implemented-in-terms-of"
A pure virtual function means that only the function's interface is inherited.
A simple virtual function means that the function's interface plus a default implementation is inherited.
A non-virtual function means that the function's interface plus a mandatory implementation is inherited.

Know what functions C++ silently writes and calls
Prefer compile-time and link-time errors to runtime errors
Ensure that non-local static objects are initialized before they're used

singleton pattern
the order of non-local static objects
use function that returns the reference of a static variable

Pay attention to compiler warnings
Familiarize yourself with the standard library
Improve your understanding of C++

Saturday, January 7, 2017

The Algorithm Design Manual - Exercise 3-15

Here is the statement of the exercise 3-15 in the Algorithm Design Manual.

[8] Design a data structure that allows one to search, insert, and delete an integer
X in O(1) time (i.e. , constant time, independent of the total number of integers
stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available,
where m is the maximum number of integers that can be in the table at any one
time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize
either A or B, as that would take O(m) or O(n) operations. This means the arrays
are full of random garbage to begin with, so you must be very careful.

In addition to the original statement, we impose that if an integer already exists in the data structure, it is not allowed to insert it again unless it is deleted.

Following the hint, we will use array B to store the integers and array A to map the integer k to the index of B where it is stored. In other word, B[A[k]] = k.

As we are not allowed to initialize the two arrays, we will use an incremental approach. We keep track of the next available position in array B and update this information every a element is inserted or deleted. Note that we cannot use any space other than array A and array B.

Here is one solution that might work. We can use the array B itself to track the current state of the data structure. We define the following rules:
(1) if B[i] = k > 0, it means that the k is stored in B[i];
(2) if B[i] = j < 0, it means that B[j] is also a available space.
(3) finally, if B[M] = i < 0, it means that the next available position is B[-i]

The code is listed below:

#include<stdio.h>

#define False 0
#define True (!False)

#define N 10
#define M 5

int A[N + 1];
int B[M + 1];

void initialize();
void insert(int k);
int exist(int k);
void delete(int k);
void print_status();



void initialize()
{
 A[1] = 0;
 B[1] = -2; /* the next available position pointed by B[1] is B[2] */
 B[M] = -1; /* initially, the first spot is available. */
 return;
}


void insert(int k)
{
 printf("insert %d\n", k);
 A[k] = -B[M];    /* link A[k] to the next available position. */
 B[M] = B[A[k]];  /* update the next available position. */
 B[A[k]] = k;     /* put k into the current available position. */

 /* test if next available position has a link to the next-next available position */
 if (B[-B[M]] < 0 && B[-B[M]] >= -M &&  ( B[-B[-B[M]]] < 0 || B[-B[-B[M]]] > N || A[B[-B[-B[M]]]] != B[-B[-B[M]]]))
  return;

 /* in this case, the next-next available position is the one right after the -B[M] */
 if (B[M] < 0)
  B[-B[M]] = B[M] - 1;



}

int exist(int k)
{
 printf("query: does %d exist? ", k);
 if (k >= 1 && k <= N && A[k] >= 1 && A[k] <= M && B[A[k]] == k) {
  printf("Yes\n");
  return True;
 } else {
  printf("No\n");
  return False;
 }
}

void delete(int k)
{
 printf("delete %d\n", k);
 if (exist(k)) {

  if (B[M] < 0) {
   /* B is not full.
    * Suppose we have A[k] <-->B[i] and B[M] --> B[j].
    * After the call, we want to have
    * A[k] = 0 and B[M] --> B[i] and B[i] --> B[j]
    */
   B[A[k]] = B[M];
   B[M] = -A[k];
   A[k] = 0;

  } else {    /* B is full*/
   if (A[k] == M) {
    /* k is put in B[M] and we have A[k] <--> B[M].
     * so we remove the link and set next available spot to M,
     *  */
    B[M] = -M;
    A[k] = 0;

   } else {
    /* k is not in B[M].
     * suppose A[k] <--> B[i] and A[s] <-->B[M]
     * what we want to have after this call is
     * A[k] = 0 and A[s] <-->B[i] and B[M] = -M
     */

    A[B[M]] = A[k];
    B[A[k]] = B[M];
    B[M] = -M;
    A[k] = 0;

   }

  }



 } else {
  printf("%d does not exist.", k);
  return;
 }
}

void print_status()
{
 printf("current status:\n");
 printf("A: ");
 for (int i=0; i <= N; ++i) printf("%d ", A[i]);
 printf("\n");

 printf("B: ");
 for (int i=0; i <= M; ++i) printf("%d ", B[i]);
 printf("\n\n");
}


int main()
{

 initialize(); print_status();

 insert(1); print_status();
 insert(2); print_status();
 exist(3);  print_status();
 exist(4);  print_status();
 insert(3); print_status();
 exist(3);  print_status();
 insert(4); print_status();
 delete(2); print_status();
 exist(2);  print_status();
 insert(5); print_status();
 insert(6); print_status();
 delete(5); print_status();
 exist(5);  print_status();
 insert(7); print_status();
 delete(7); print_status();
 insert(8); print_status();
 delete(3); print_status();


}

And the output is listed below:

current status:
A: 0 0 0 0 0 0 0 0 0 0 0
B: 0 -2 0 0 0 -1

insert 1
current status:
A: 0 1 0 0 0 0 0 0 0 0 0
B: 0 1 -3 0 0 -2

insert 2
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

query: does 3 exist? No
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

query: does 4 exist? No
current status:
A: 0 1 2 0 0 0 0 0 0 0 0
B: 0 1 2 -4 0 -3

insert 3
current status:
A: 0 1 2 3 0 0 0 0 0 0 0
B: 0 1 2 3 -5 -4

query: does 3 exist? Yes
current status:
A: 0 1 2 3 0 0 0 0 0 0 0
B: 0 1 2 3 -5 -4

insert 4
current status:
A: 0 1 2 3 4 0 0 0 0 0 0
B: 0 1 2 3 4 -5

delete 2
query: does 2 exist? Yes
current status:
A: 0 1 0 3 4 0 0 0 0 0 0
B: 0 1 -5 3 4 -2

query: does 2 exist? No
current status:
A: 0 1 0 3 4 0 0 0 0 0 0
B: 0 1 -5 3 4 -2

insert 5
current status:
A: 0 1 0 3 4 2 0 0 0 0 0
B: 0 1 5 3 4 -5

insert 6
current status:
A: 0 1 0 3 4 2 5 0 0 0 0
B: 0 1 5 3 4 6

delete 5
query: does 5 exist? Yes
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

query: does 5 exist? No
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

insert 7
current status:
A: 0 1 0 3 4 0 2 5 0 0 0
B: 0 1 6 3 4 7

delete 7
query: does 7 exist? Yes
current status:
A: 0 1 0 3 4 0 2 0 0 0 0
B: 0 1 6 3 4 -5

insert 8
current status:
A: 0 1 0 3 4 0 2 0 5 0 0
B: 0 1 6 3 4 8

delete 3
query: does 3 exist? Yes
current status:
A: 0 1 0 0 4 0 2 0 3 0 0
B: 0 1 6 8 4 -5

--END--

Friday, January 6, 2017

Python - Pandas - How to read compressed SAS data

Pandas provides the function read_sas to read the sas data. It supports two format: (1) ‘xport’ and (2) ‘sas7bdat’. Sometimes the data is really large and is provided in a compressed file. Unfortunately, it seems that pandas does not support reading from the compressed sas data directly.

In this post, we will introduce a not-quite-save way to read from the zipped sas data. In order to do so, we need three packages:

gzip
io
pandas

If we try the following code, it is not going to work. The reason is the in the compressed gz file, there exists meta information about the original file and the system in the header before the actual data. So when we pass the file object to the pd.read_sas, the function complains that the file is not a SAS file.

f = gzip.GzipFile('your_input_file.sas7bdat.tar.gz', 'rb')
df = pd.read_sas(f, format='sas7bdat')

So how to address this problem? There is a very straightforward way to handle this, but it is not very safe (this point will be explained later). The simple idea is try to skip the meta information header in the gz file.

Here is the code. In part 1, we overwrite the seek function of the gzip.GzipFile. The function takes into account the fact that we skip the header and when the seek function is called in the pd.read_sas function, an additional offset (of the header) is added to the original offset. In part 2, we just guess where the meta header information ends.

As mentioned previously, this method is not safe because when we overwrite the seek function, we do not take into account the whence parameter. Fortunately, this code can work (at least for some small files).

# part 1: change the "seek" behavior of the file object.
class FileObjGZ(gzip.GzipFile):
    def set_gz_offset(self, val):
        self._gz_offset = val

    def seek(self, offset, whence=io.SEEK_SET):
        new_offset = offset + self._gz_offset
        super(FileObjGZ, new_offset).seek(new_offset)


input_file = 'your_input_file.sas7bdat.tar.gz'

# part 2: try to skip the meta header
with FileObjGZ(input_file, 'rb') as f:
    guess_gz_offset = 0
    while True:
        try:
            f.set_gz_offset(guess_gz_offset)
            f.seek(0)
            df = pd.read_sas(f, format='sas7bdat')

            break
        except ValueError as e:
            print(e)
            guess_gz_offset += 1
    print("loading data is completed.")

---END--

Labels