EDIT: Removing mysqlnd as noted here didn’t actually fix the issue, we will post a follow up shortly.

A few weeks ago we where testing a performance increase by using asynchronous queries to mysql and having the property data required for a result set there waiting by the time the system needed it. To do this we needed to switch to mysqlnd rather than the standard static mysqlibs. The asynchronous queries caused problems with requiring 100’s of persistent connections to the mysql server and needless to say this feature didn’t stay activated for long.

The side effect of this was we never saw the need to remove mysqlnd as everything was  working fine. However, recently we have seen problems caused by mysterious metadata locks during database maintenance which would just lock all queries when there was nothing running. We traced this back to being somehow caused by mysqlnd and tonight we have rolled back to using the standard mysql and mysqli php interfaces.

If you have this problem and are using mysqlnd but dont really need mysqlnd, try switching back to the standard mysql libraries and you may find that the problem cures itself.

From our first beta release 2 years ago almost to the day, we wanted to be the fastest, a good portion of our day is spent tracing results to find any slow downs.

We have been playing with New Relic these last few days and it definitely gives some interesting stats so we thought we would let you know what StormRETS does on the backend.

We handle approx. 150,000 API requests per day.

Average application response time is 41.3ms 

Our Apdex score is 0.99 (out of a maximum of 1)

% of Requests satisfied in under 0.25 seconds is 97.6%

% of Requests satisfied in under 1 second is 99.8%

% of Requests satisfied in over 1 second is 0.03%

In the time that it takes you to blink your eye StormRETS can search millions of property listings 8 times over.

Recently we have had a customer looking to increase the performance and cut down the some latency and if possible number of duplicate API calls that hit the StormRETS backend and we came up with two options, one written in PHP and the second using a local varnish cache.

PHP and Shared Memory

One idea was to use persistent tcp connection in php for whatever reason a running a local proxy could not be done. 

As an experiment to see if it could be done we wrote the following script to act as t synchronous server. The code uses a synchronous background php worker (note this isn’t asynchronous) to handle the persistent connections to the StormRETS API and does interprocess communication via shared memory.

function writeLine($line) {
    echo $line."\r\n";
}

function unchunkHttp11($data) {
    $fp = 0;
    $outData = "";
    while ($fp < strlen($data)) {
        $rawnum = substr($data, $fp, strpos(substr($data, $fp), "\r\n") + 2);
        $num = hexdec(trim($rawnum));
        $fp += strlen($rawnum);
        $chunk = substr($data, $fp, $num);
        $outData .= $chunk;
        $fp += strlen($chunk);
    }
    return $outData;
}

header('Content-Type: text/plain; charset=ISO-8859-1');

ob_start();

writeLine('                                        ');
writeLine('StormRETS API Pipelining Server');
writeLine('Copyright (c) 2010-2011 StormRETS, Inc.');
writeLine('                                        ');
writeLine('This server uses shared memory to pass data between the server and the executing');
writeLine('PHP script utilizing HTTP Keep Alive to cut out API connection time between');
writeLine('page loads.');
writeLine('');
writeLine('This API Pipelining Server is currently experimental.');
writeLine('                                        ');
writeLine('');

$queue = msg_get_queue(100381);
$required_msgtype = 1;
$maxsize = 2048;
$option_receive = MSG_IPC_NOWAIT;
$serialize_needed = true;
$queue_status = msg_stat_queue($queue);
$max_message_size = $queue_status['msg_qbytes'] - 100;
$socket = null;

writeLine('Waiting for Data...');

while (true) {
    
    // Check for Messages in the Queue
    $queue_status = msg_stat_queue($queue);
    if ($queue_status['msg_qnum'] --> 0) {
        
        $data = '';
        if (msg_receive($queue, $required_msgtype, $msgtype, $maxsize, $data, $serialize_needed, $option_receive, $err)===true) {
            
            echo "[-] Received a Message...\n";
            $data = json_decode($data);
            $return_type = $data->rt;
            $url = $data->url;
            
            $header = '';
            
            if ($url && ($data->validity >= time())) {
                
                echo "[-] Message is Valid...\n";
                $url_data = parse_url($url);
                
                $req_path = $url_data['path'];
                if (array_key_exists('query', $url_data) && !empty($url_data['query'])) $req_path .= '?' . $url_data['query'];
                
                $req = "";
                $req .= "GET $req_path HTTP/1.1\r\n";
                $req .= "Host: {$url_data['host']}\r\n";
                $req .= "Connection: Keep-Alive\r\n";
                $req .= "Keep-Alive: 300\r\n";
                $req .= "Accept-Encoding: deflate\r\n";
                $req .= "Accept: */*\r\n";
                $req .= "\r\n\r\n";
                
                # If no socket or socket closed create a socket
                $socket = pfsockopen($url_data['host'], 80);
                
                $transfer_start = microtime(true);
                
                $response = '';
                fputs($socket, $req);
                stream_set_blocking($socket, true); 
                stream_set_timeout($socket, 5);
                $socket_info = stream_get_meta_data($socket); 
                
                do {
                    $header .= fread($socket, 1); 
                }
                while (!preg_match('/\\r\\n\\r\\n$/', $header));
                
                echo $header;
                
                if (!strstr($header, "Transfer-Encoding: chunked")) {
                    while (!feof($socket)) {
                        $response .= fgets($socket, 128);
                    }
                } else {
                    while ($chunk_length = hexdec(fgets($socket)))  {
                        $responseContentChunk = '';
                        
                        $read_length = 0;
                        
                        while ($read_length < $chunk_length) {
                            $responseContentChunk .= fread($socket, $chunk_length - $read_length);
                            $read_length = strlen($responseContentChunk);
                        }
                        
                        $response .= $responseContentChunk;
                        fgets($socket);
                    }
                }
                
                $transfer_end = microtime(true);
                $transfer_time = $transfer_end - $transfer_start;
                echo "[-] Transfer took {$transfer_time} seconds\n";
                
                # Strip Headers
                //$response = trim(substr($response, strpos($response, "\r\n\r\n")));
                
                $return_start = microtime(true);
                
                $ret_queue = msg_get_queue($data->rt);
                $response_arr = str_split($response, $max_message_size);
                
                # Send the Chunks
                $chunk_count = 0;
                foreach ($response_arr as $a) {
                    $retry_count = 0;
                    $message = array(
                        'chunks' => sizeof($response_arr),
                        'chunk' => $chunk_count,
                        'encoding' => (strstr($header, "Content-Encoding")) ? 'deflate' : 'plain',
                        'data' => $a,
                    );
                    while ($retry_count <= 5) {
                        if (@msg_send($ret_queue, 1, $message, true, false, $err)) {
                            $retry_count = 100;
                        } else {
                            $retry_count++;
                            usleep(5000);
                        }
                        if ($retry_count == 5) {
                            echo "[!] Chunk Transfer Failed\n";
                        }
                    }
                    $chunk_count++;
                }
                
                $return_end = microtime(true);
                $return_time = $return_start - $return_end;
                echo "[-] Return to Caller took {$return_time} seconds\n";
                
            }
            
        }
        
    }
    
    # Flush everything
    if (ob_get_length()){            
        @ob_flush();
        @flush();
        @ob_end_flush();
    }
    
    sleep(1);
    
}

And to call in to the API we use the following code.

# Check/Fire the API Pipelining Server
#
$pid = AgentStormSettingCache::get('as_backpid', 0);
if (!$pid || !$this->isProcessRunning($pid)) {
	$pid = $this->runInBackground('/usr/bin/php ' . dirname(__FILE__) . '/agentstorm_keepalive_server.php');
	AgentStormSettingCache::set('as_backpid', $pid);
}

if (!$this->isProcessRunning($pid)) {
	die("Failed to start server");
}

$queue = msg_get_queue(100381);

# Build our message
$message = new stdClass();
$message->url = $url;
$message->rt = rand(110000, 120000);
$message->validity = time()+5;

$msgtype = 1;
$maxsize = 102400;

# Send the Request
#
$m = json_encode($message);
msg_send($queue, $msgtype, $m, true, true, $err);

# Get the Result
#
$ret_queue = msg_get_queue($message->rt);
$chunks = array();
$encoding = 'plain';
if (msg_receive($ret_queue, 1, $rt_memtype, $maxsize, $data, true)===true) {
	$chunks[] = $data;
	$encoding = $data['encoding'];
	for ($i=0; $i <= $data['chunks']-2; $i++ ) {
		$rety_count = 0;
		while ($rety_count <= 5) {
			if (msg_receive($ret_queue, 1, $rt_memtype, $maxsize, $data, true)===true) {
				$chunks[] = $data;
				break;
			}
			$rety_count++;
			usleep(10000);
		}
	}
}
msg_remove_queue($ret_queue);

# Parse the Chunked Data
#
$data = '';
foreach ($chunks as $chunk) {
	$data .= $chunk['data'];
}

# Deflate is needed
#
if ($encoding == 'deflate') {
	$data = gzuncompress($data);
}

We didn’t see any performance degradation while testing the above code, but neither did we see any performance gain either. 

Varnish Cache

Using varnish cache is the best option, if you want to try it you can use the the following default.vcl file with varnish, the default.vcl file below will cache results for up to 5 minutes however depending on your StormRETS import schedule we recommend you cache no longer than your query import schedule to ensure you get the latest data displayed on your web site.

backend stormrets {
  .host = "yoursubdomain.stormrets.com";
  .port = "80";
}

sub vcl_recv {
    set req.http.Host = "yoursubdomain.stormrets.com";
    if (req.url ~ "^/properties.xml") {
        unset req.http.cookie;
    }
    if (req.url ~ "^/properties.json") {
        unset req.http.cookie;
    }
}

sub vcl_fetch {
    if (req.url ~ "^/properties.xml") {
        remove beresp.http.set-cookie;
        set beresp.ttl = 300s;
        set beresp.http.magicmarker = "1";
    }
    if (req.url ~ "^/properties.json") {
        remove beresp.http.set-cookie;
        set beresp.ttl = 300s;
        set beresp.http.magicmarker = "1";
    }
}

sub vcl_deliver {
    if (resp.http.magicmarker) {
        unset resp.http.magicmarker;
        set resp.http.age = "0";
    }
    unset resp.http.X-StormRETS-ServedBy;
    set resp.http.X-StormRETS-ServedBy = "Local-Cache";
}

ETL

Importing Property Data its so easy theres a nice acronym for it… ETL or Export Transform Load.

In reality handling a few million property records is no easy task, many a developers have taken on the challenge without really knowing how much data is involved. For a medium sized statewide mls board the amount of data is something like:

Active Only (<>25,000 listings)

  • 500MB Database Storage
  • 1GB Indexes
  • 50GB Images

`Thats not bad` i hear you say. Sure but no one ever wants only Active listings and when you bring Inactive listings into the mix:

All Statuses (2,000,000 listings)

  1. <> 100GB Database Storage
  2. <> 200GB Indexes
  3. <> 500GB Images in 12,000,000 small files

The first thing that probably stands out is that we’re saying a 2:1 index to data ratio, this is because your going to need a lot of indexes somewhere around 70+ indexes on the main property tables to cover all your searches, some have these will have a very low hit rate. 

One thing to consider if your using Mysql 5.5 is the Barracuda file format and compresses tables. you will generally see tables and indexes being 300% smaller than a non-compressed table, querying a compressed table can also be faster du to less needed IO

On this dataset full table rebuilds (Alter table, Index creation etc) against this many records take between 2-3 hours and cause a read write lock on the table for that duration of time. Of course this is based on the fact you made the trade off of larger data vs complexity. When dealing with Big Data its a good idea to stay away from joins, theres many reasons for this all of which are better explained by others. It may take a little longer to insert or update but thats all happening in a background job anyway right?

On top of the data requirements your next issue is to accurately import every listing from the MLS. Some RETS servers make this easy by giving you a limit equal to the total number of active records in the largest property class. Others will set a limit of 2,000 properties meaning to import every property record from the MLS means running your importer hundreds of times with queries like “(LIST_DATE=1992-01-01-1992-02-01)” and running it a month at a time. The reason for using LIST_DATE and not ModificationTimestamp in the previous example is that some MLS’s sometimes do bulk status updates and what you may find is that on some random day between 23:59:59 and 00:00:00 more records where updated than your able to retrieve in a single query.

Image storage is a tricky topic, most people will just think hey i’ll just dump the images into Amazon S3 and they can stay there. The cost just in PUT operations on Amazon S3 for 12,000,000 files is $120 and a further $70 a month in S3 data storage. Thats going to put a downer on any quote when your quoting $1,000 a year just to host a few images for a client.

Well there you go, managing property data is probably a little more complicated than you initially thought. 

Welcome to the StormRETS tech blog where we will be talking about technical topics relating to StormRETS, general RETS topics, and managing Big Real Estate Data.