Effective Java : Serialization

Serialization

Objective: This section explains the dangers of serialization and how to minimize themObject serialization is Java’s useful framework for encoding objects as byte streams (serializing) and reconstructing objects from their encodings (deserializing). Once objects have been serialized, they can be sent from one virtual machine to another or stored on disk for later deserialization.

Key topics:

  1. Java Serialization
  2. Custom Serialization
  3. Serialization and Security
  4. Serialization and Instance Control
  5. Serialization Proxy

Estimated time: 15-30 minutes.

Java Serialization

Can you spot a problem when you deserialize the following byte stream?

static byte[] createByteStream() {
Set<Object> root = new HashSet<>();
Set<Object> s1 = root;
Set<Object> s2 = new HashSet<>();
for (int i = 0; i < 100; i++) {
Set<Object> t1 = new HashSet<>();
Set<Object> t2 = new HashSet<>();
t1.add(“beer”); // t1 is NOT equal to t2
s1.add(t1); s1.add(t2);
s2.add(t1); s2.add(t2);
s1 = t1;
s2 = t2;
}
return serialize(root); // Method omitted for brevity
}

Yes
No

It is difficult to spot the problem when you deserialize the above stream, just by reading the code. The above object graph consists of 203  HashSet instances, each of which contains 3 or fewer object references. The entire stream is 5,744 bytes long, yet the sun would burn out before you could deserialize it! To understand what happens, note that deserializing a  HashSet  instance requires computing the hash codes of all of its elements. The 2 elements of  root  are also hash sets that contain 2 hash-set elements, each of which contains 2 hash-set elements, and so forth, 100 levels deep! Hence,  hashCode  method will be invoked over 2100 times to deserialize  root OUCH! So, you should avoid using Java serialization wherever possible; you should use a cross-platform, structured-data representation such as JSON instead. If you must use Java serialization then you should NOT deserialize any byte stream that you don’t trust, by creating a whitelist of trusted classes instead of a blacklist of untrusted classes (by default you should reject classes that are not included in the whitelist).

If you choose to implement  Serializable  interface, however, you must be absolutely aware of what you are doing. Here are some major costs of doing so:

  • It decreases the flexibility to change your class’s implementation once it has been released, because its byte-stream encoding or serialized formbecomes part of its exported API and you must support it forever.
  • It increases the likelihood of bugs and security holes, because it provides an extralinguistic mechanism for clients to create instances of your class.
  • It increases the testing burden associated with releasing a new version your class, because of backward compatibility, meaning that you must ensure that the new version of your class can deserialize the serialized form of the old version and vice versa.Here are some recommendations you should consider before you decide to implement  Serializable  interface:
  • Classes designed for inheritance should rarely implement  Serializable , because you don’t, actually cant’, know what its subclasses will do.
  • Inner classes should not implement  Serializable , because the default serialized form of an inner class is ill-defined. You can, however, implement the interface for a static member class.

Custom Serialization

Which of the following implementations of  StringList  would be better? Of course, ignore for the moment that you would probably better off using one of the standard  List  implementations.

Use default serialized forms

public final class StringList implements Serializable {
private int size = 0;
private Entry head = null;

private static class Entry implements Serializable {
String data;
Entry next;
Entry previous;
}
… // Remainder omitted
}

Use custom serialized forms

public final class StringList implements Serializable {
private int size = 0;
private Entry head = null;

private static class Entry {
String data;
Entry next;
Entry previous;
}

public final void add(String s) { … }

private void writeObject(ObjectOutputStream s) throws IOException {
s.defaultWriteObject();
s.writeInt(size);
for (Entry e = head; e != null; e = e.next)
s.writeObject(e.data);
}

private void readObject(ObjectInputStream s) throws IOException, ClassNotFoundException {
s.defaultReadObject();
int numElements = s.readInt();
for (int i = 0; i < numElements; i++)
add((String) s.readObject());
}
… // Remainder omitted
}

You should use the default serialized form only if an object’s physical representation is identical to its logical content, such as a  Person  object that contains first and last names. In the above example, you should use the custom serialized  form because logically  StringList  represents a sequence of strings but physically it represent a sequence of a doubly linked list.

Note that using the default serialized form when an object’s physical representation differs substantially from its logical content has the following advantages:

  • It permanently ties the exported API to the current internal representation, and you must support it forever. For example, in the above example  StringList.Entry  class becomes part of the public API unnecessarily.
  • It can consume excessive space, sometimes cause stack overflow, and make serialization slow. In the above example, links of the doubly linked list will take a lot of space and time to serialize/deserialize them, again unnecessarily.

Here are some other recommendations:

  • You often must implement  readObject  method to ensure invariants and security, even if you decide to choose the default serialized form. For example, transient fields (i.e., not included in serialization) will be initialized to their default values when an instance is deserialized (  null  for object reference fields,  zero  for numeric primitive fields, and  false  for boolean fields). If these values are unacceptable for any transient fields then implementing  readObject  would be a good way to restore them to acceptable values.
  • Before deciding to make a field nontransient (i.e., included in serialization), convince yourself that its value is part of the logical state of the object, so as to avoid complications such as the one mentioned above.
  • In any case, you should impose any synchronization on object serialization that you would impose on any other method that reads the entire state of the object.
  • In any case, you should declare an explicit, static, final, long serialVersionUID in every serializable class you write, just in case you want to break compatibility, though not recommended, with all existing serialized instances of your class.
  • Use  @serial  tag for fields and  @serialData  tag for methods to document your serializable class properly.

Serialization and Security

The following implementation preserves the class’s invariants and immutability by defensively copying  Date  object in its constructor and accessors, before it is serializable:

public final class Period implements Serializable {
private final Date start;
private final Date end;

public Period(Date start, Date end) {
this.start = new Date(start.getTime());
this.end = new Date(end.getTime());
if (this.start.compareTo(this.end) > 0)
throw new IllegalArgumentException(start + ” after ” + end);
}

public Date start() { return new Date(start.getTime()); }
public Date end() { return new Date(end.getTime()); }
… // Remainder omitted
}

After it is serializable, just by easily adding  implements Serializable  to the class declaration,  readObject  method is effectively another public constructor.

So, if you use a default serialized form then you get into a big trouble, because that default method fails to do check validity of its arguments and make defensive copies of parameters where appropriate, meaning that you open the door for an attacker to violate the class’s invariants. Therefore, you must implement custom  readObject . Which of the following implementations would be safe? Why?

Keep it simple

private void readObject(ObjectInputStream s) throws IOException, ClassNotFoundException {
s.defaultReadObject();
if (start.compareTo(end) > 0)
throw new InvalidObjectException(start + ” after ” + end);
}

Be extremely defensive

private void readObject(ObjectInputStream s) throws IOException, ClassNotFoundException {
s.defaultReadObject();
start = new Date(start.getTime());
end = new Date(end.getTime());
if (start.compareTo(end) > 0)
throw new InvalidObjectException(start + ” after ” + end);
}

When an object is serialized, it is crucial to copy any field containing an object reference that a client must not possess, otherwise you open the door for an attacker to violate the class’s invariants.  Note that the defensive copy is performed prior to the validity check, otherwise the class can be subject to an attack between the time of validity check and the time of defensive copy in a multi-thread circumstance.

Note also that using  Date.clone is not safe because attackers can use a subclass of  Date  and you don’t/can’t know what the subclass’s clone method behaves.

Additionally, to use the second implementation above, you must make  start  and  end  nonfinal; which is unfortunate but it is the lesser of the two evils.

Serialization and Instance Control

Which of the following implementations would be safe? Why?

Use Enum

public Enum MichaelJackson {
INSTANCE;
private String[] favoriteSongs = { “Black and White”, “Billie Jean”, “Beat it” };

public void printFavorites() { System.out.println(Arrays.toString(favoriteSongs)); }
}

Use readResolve

public class MichaelJackson implements Serializable {
public static final MichaelJackson INSTANCE = new Elvis();
private MichaelJackson() { }
private String[] favoriteSongs = { “Black and White”, “Billie Jean”, “Beat it” };

public void printFavorites() { System.out.println(Arrays.toString(favoriteSongs)); }

private readResolve() { return INSTANCE; }
}

The second implementation opens the door for an attacker to secure a reference to the serialized object before its  readResolve  method is run.

The attack is somewhat complicated, but it is based on a simple idea that if a singleton contains a nontransient object reference field then the contents of this field will be deserialized before the singleton’s  readResolve  method is run, regardless wether you use default or custom serialized form.

This property enables the attacker to steal a reference to the originally deserialized singleton at the time the contents of the object reference field are deserialized. Therefore, use enum type to enforce instance control invariants wherever possible. If you must, however, implement  readResolve  method for a serializable and instance-controllable class then ensure that all of the class’s instance fields are either primitive or transient.

Serialization Proxy

Which of the following implementations would be preferable? Why?

Use serialization proxy

public final class Period implements Serializable {
private final Date start;
private final Date end;

public Period(Date start, Date end) {
this.start = new Date(start.getTime());
this.end = new Date(end.getTime());
if (this.start.compareTo(this.end) > 0)
throw new IllegalArgumentException(start + ” after ” + end);
}

public Date start() { return new Date(start.getTime()); }
public Date end() { return new Date(end.getTime()); }

private Object writeReplace() { return new SerializationProxy(this); }

private void readObject (ObjectInputStream s) throws InvalidObjectException {
throw new InvalidObjectException(“Proxy required”);
}

private static class SerializationProxy implements Serializable {
private static final long = 982423404852382358L; // Any number would be fine

private final Date start;
private final Date end;

SerializationProxy (Period p) {
this.start = p.start;
this.end = p.end;
}

private Object readResolve() { return new Period(start, end); }
}
… // Remainder omitted
}

Be extremely defensive

public final class Period implements Serializable {
private Date start;
private Date end;

public Period(Date start, Date end) {
this.start = new Date(start.getTime());
this.end = new Date(end.getTime());
if (this.start.compareTo(this.end) > 0)
throw new IllegalArgumentException(start + ” after ” + end);
}

public Date start() { return new Date(start.getTime()); }
public Date end() { return new Date(end.getTime()); }

private void readObject(ObjectInputStream s) throws IOException, ClassNotFoundException {
s.defaultReadObject();
start = new Date(start.getTime());
end = new Date(end.getTime());
if (start.compareTo(end) > 0)
throw new InvalidObjectException(start + ” after ” + end);
}
… // Remainder omitted
}

The serialization proxy pattern above is reasonably straightforward and has been useful whenever you find yourself having to write  readObject and  writeObject  on a class that is not extendable by its clients. This pattern would be the easiest way to serialize nontrivial invariants robustly. Here are some advantages of this pattern:

  • It will never generate a serialized instance of the enclosing class or leak references of its nontransient fields, making the enclosing class safe.
  • In addition, It utilizes existing constructors/static factories of the enclosing class.
  • Moreover, Its nontransient fields can be declared as  final .

Note, however that, the pattern has two main limitations, as follows:

  • It is not compatible with classes that are extendable by their clients.
  • It is also not compatible with some classes whose object graphs contain circularities, because it you attempt to invoke a method on such an object from within its proxy’s  readResolve  method then you’ll get  ClassCastException  because you have only the object’s serialization proxy but not the object yet.
  • The added power and safety of the proxy come with some little more cost: the first implementation above could be 10-20% slower than the second one.

Leave a Reply

%d bloggers like this: