JSON in Postgres - Redfin Real Estate News

JSON in Postgres

by
Updated on October 5th, 2020

Since Dec 2011 we’ve been storing a small portion of our data as JSON in Postgres.  This blog post gives a quick overview of why we decided to do this, how it works, and what we’ve learned so far.

Why JSON in Postgres?

Redfin’s basic stack includes a Java-based web app on top of a relational database.  We started with MySQL and switched to Postgres a few years ago.

Overall we’re happy with a relational DB, but it does have its pain points.  Perhaps the biggest issue is the difficulty involved in changing schema.  To change the shape of a table Postgres must take a table lock, which means that nobody can read or write to the table while the schema change is happening.  Some schema changes take little time to process (e.g. if the table is small or you’re adding a nullable column).  But some schema changes can take a while to process, which translates to a lot of downtime. Various other people have written about this same problem before. To give just one example, here’s what the folks at FriendFeed had to say about it: http://backchannel.org/blog/friendfeed-schemaless-mysql

We had a situation where we wanted to store information about the various service providers in our new Open Book directory.  We expected to iterate on the data model, and we wanted this to be easy.  So we decided to follow a similar approach to the one used by the folks at FriendFeed: to store data in our relational DB but in a semi-structured fashion.

Aside: why not use a NoSQL store?  We considered this.  The main reason we decided against NoSQL (for now) is that it would require a lot more to deploy, monitor, and maintain.  Things that already work with our Postgres setup (backups, etc) would need to be setup anew.  Also, for this particular use-case we didn’t need the scale that NoSQL offers, and we were hesitant about some of the trade-offs that often come with NoSQL (limitations in transactional support, queries you can pose, etc.).

Postgres offers a few ways to manage semi-structured data:

We were hoping to use JSON because it’s more flexible than hashmaps or arrays, and because it’s familiar – we use it all the time.  Unfortunately, Postgres doesn’t (yet) offer built-in support for JSON.  It appears that one guy, Joey Addams, worked on adding JSON support to Postgres as a Google summer-of-code project back in 2010, but he didn’t quite finish at the time.  Looks like he may finish soon:  http://stackoverflow.com/questions/4995945/optimize-escape-json-in-postgresql-9-0/5448248#5448248.

In the meantime we developed our own strategy to store data as JSON in Postgres, to define indexes on facts inside the JSON, and to query it.

How does it Work?

Let’s take a hypothetical example – you want to store a bunch of products (televisions, shoes…).  They will have some common facts (price) and also some facts that are unique to each type of product (televisions have a resolution, shoes have a size).  You suspect that you’ll want to change the schema frequently, to add/modify both the common facts and the custom facts.

create table products (
   id    bigserial primary key,
   json  text
)

insert into products(json) values
   ('{"type":"television", "price": 899.99, "resolution":"1080p"}'),
   ('{"type":"shoe", "price": 74.99, "size":10.5}')
;

As far as Postgres is concerned, the json column is a plain-old text column.

Clearly you can fetch a product by ID:

select * from products where id=1;

But how do you query for products based on facts inside the JSON field?  Here’s how we do it:

select * from products
   where
   jsonGet(json, '$.type') = 'television'
   and jsonGet(json, '$.resolution') = '1080p';

What’s going on here?  We’re making use of a special function: jsonGet.  This is not a built-in function; it’s one we added to Postgres, a custom function. You can write custom Postgres functions in one of several procedural languages. We opted for Perl. Our function uses jsonpath, which is like XML path – it’s a simple language for identifying a portion of a JSON document.

What about performance?  If we do nothing special, these queries will be very inefficient – Postgres will be forced to do a table scan and evaluate the jsonGet function on each record.  The way to avoid table scans is to add the appropriate indexes.  Here’s how to add an index for all product prices:

create index products_price on products(jsonGet(json,'$.price'));

This is a functional index.  Whenever you create or update a record in the products table, Postgres will run the jsonGet function to extract the price fact from JSON and store just that fact inside the index.  If, later, you query by price using the same jsonGet function, Postgres will use the index and avoid the table scan.  In other words, you’re trading off some performance at create/update-time in order to achieve better performance at query-time.

As another example, here’s how to add an index for TV resolutions:

create index products_television_resolutions
   on products(jsonGet(json,'$.resolution'))
   where jsonGet(json,'$.type') = 'television';

This is a partial index.  Postgres will only store records in this index for products of type ‘television’.  Similarly, Postgres will use this index for queries that include both of the jsonGet filters – the one on type and the one on resolution.

So far I described how it works at the DB-level.  There’s more to say about how this works in our Java layer, but in the interest of brevity I’ll leave that discussion out of this blog post (if you’re interested, leave a note in the comments below).

So…does it Work?  Do we like it?

Yes, it does work!  We’ve been using this technique since December to track information about service providers in our Open Book directory.  The quantity of data is still modest (in the low 10’s of thousands of records) but we haven’t seen any major gotchas.

Is this a silver bullet?  No, I wouldn’t say that.  This strategy has pluses and minuses.

The biggest benefit is that schema changes are generally easier.  Changing the shape of our JSON data doesn’t feel like a big scary thing.  We follow the DRY principal – instead of defining the data model both in SQL and in Java, we define it just once, in Java.  It’s very easy to change the Java code to add a new field.  It’s also quick to change the Java code to modify or remove a field, but those changes are definitely trickier – our app code now needs to be smart enough to deal with records of various shapes, both old and new.

The JSON format gives us the freedom to collapse one-to-many relationships into a single record (denormalize).  This isn’t always a good thing to do, but where appropriate this strategy allows us to skip a join and achieve better performance.

Finally, this solution required less work than deploying a new unfamiliar NoSQL solution, and allowed us to maintain full transactional support.

The minuses are:

1. There’s still more infrastructure.  We had to deploy plperl, json, and jsonpath libraries to our DB servers.  It took some tinkering to figure this out (most recently, we ran into issues with plperl on Postgres 9.1 on Windows).

2. This is an unfamiliar technique.  Again, it’s not as radically different as a full-on transition to NoSQL, but it still takes more ramp-up time for other developers to figure out how to query using jsonpath.

3. This strategy doesn’t jive too well with our Java layer.  I didn’t talk much about our Java layer, but the short story is that we use hibernate.  I can’t say we love everything about hibernate, but it’s our standard, and it does make it easy to work with tables and relationships.  For example, if you have a customers table and a real_estate_agents table, then hibernate maps these records to Java objects and gives you a simple way to walk the relationship from one object to another: customer.getRealEstateAgent().  This just doesn’t work when you have rich data stored as JSON inside fat columns – hibernate doesn’t understand the make-up of these JSON columns.  So, you, the developer, are forced to crack open the JSON, find the ID of the related object, and query for it separately.  It’s all doable, but it leads to longer code.

What do you think?

Do you use a relational database? Have you struggled with schema changes? Have you tried storing semi-structued data in a relational DB? Did it work out? Or did you choose to switch to a non-relational store? We’re interested to know – please leave your thoughts in the comments below.

Source for jsonGet

In case it’s useful, here’s the source code of our jsonGet function:

create or replace function jsonGet(text,text) returns text as
$$
	my ($json_text,$json_path_text) = @_;

	use JSON;
	$json = JSON->new->allow_nonref;
	$json_hash = $json->decode($json_text);

	use JSONPath;
	my $jpath = JSONPath->new();
	my $result_array = $jpath->run($json_hash, $json_path_text);

	# If there is a single item in the result array,
	# just format that item and return it
	# Otherwise, format the whole array

	$length = @{$result_array};

	my $formatted_result;
	if ($length == 1) {
		$formatted_result = $json->encode($result_array->[0]);
		# strip quotes
		$formatted_result =~ s/^"(.*)"$/$1/;
	} else {
		$formatted_result = $json->encode($result_array);
	}
	return $formatted_result;
$$
language plperlu immutable;

Kudos

Kudos to masukomi for his jsonPath implementation!

Also kudos to Ben Scheirman for the Happy Elephant photo 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *

Be the first to see the latest real estate news:

  • This field is for validation purposes and should be left unchanged.

By submitting your email you agree to Redfin’s Terms of Use and Privacy Policy

Scroll to Top