Mistakes: Importing Data with MySQL

I spent the better part of today trying to import—in various attempts—between 800 MB and 3.1 GB of data into a local MySQL server. The whole time I was Doing It Wrong™. Now I feel like this:

frustrated

I was using pipe viewer, which obscured the error for most of the day. (If there’s a way to stop pv from eating messages to stderr, please let me know!) But eventually I figured out it was hitting the max_allowed_packet limit and my import was dying. I was using the MySQL docs as my reference point, and they imply that the following should work:

$ mysql --max_allowed_packet=32M db < data.sql

I know now that this does not work. And I feel like a moron. If you read carefully, which I did not, you’ll see that they’re talking about server startup, not client startup, even though the example uses “mysql” instead of “mysqld”.

When I stopped using pipe viewer, I got this series of errors:

$ mysql --max_allowed_packet=16M db < data.sql
ERROR 1153 (08501) at line 175458: Got a packet \
    bigger than 'max_allowed_packet' bytes
$ mysql --max_allowed_packet=128M db < data.sql
ERROR 1153 (08501) at line 175458: Got a packet \
    bigger than 'max_allowed_packet' bytes
$ mysql --max_allowed_packet=256M db < data.sql
ERROR 1153 (08501) at line 175458: Got a packet \
    bigger than 'max_allowed_packet' bytes
$ mysql --max_allowed_packet=1G db < data.sql
ERROR 1153 (08501) at line 175458: Got a packet \
    bigger than 'max_allowed_packet' bytes

Now here’s the thing, data.sql was just shy of 80 MB. So obviously something was wrong.

A quick edit to /etc/my.cnf:

[mysqld]
...
max_allowed_packet=32M
...

Then all I had to do was restart mysqld, and the next time I tried:

mysql db < data.sql

It finished without a problem.

7 hours and a massive facepalm later… I hope this keeps someone from wasting the same amount of time I did on such a dumb mistake.

Responsible SQL: How to Authenticate Users

Most SQL-injection articles set a horrible example for young programmers.

Here is a very typical “bad example” of why you need to escape user data before it goes into SQL queries:

(ed. The symbol « is a line break that’s not in the real code.)

$username = $_POST[‘username’]; // username=admin
$password = $_POST[‘password’]; // password=’ OR 1=1; — ‘

$user = $db->query("SELECT * FROM users WHERE «
           username=’$username’ AND «
           password=’$password’ LIMIT 1;"
);

The point, of course, is that you must sanitize your user input, or else this person would run this query:

$user = $db->query("SELECT * FROM users WHERE «
           username=’admin’ AND «
           password = ” OR 1=1; — ‘ LIMIT 1;"
);

Which grants the sneaky user all your admin privileges. Other versions have nefarious users dropping your users or articles tables.

The problem is: this is the wrong way to authenticate users. These examples are written for beginners to understand the importance of sanitizing input, but they also provide a model to those beginners for how user authentication works. And it’s a very bad model.

This is a long one, more after the break. (more…)

Help Me Scale

I’ve been reading Eran Hammer-Lahav’s intelligent posts on microblog scalability, and now I’m concerned about my own “microblog” site, Picofiction.

Similar to social networks, social updates, social messaging, social… Like many social web sites—amongst our weaponry…—Picofiction lets you “follow” your favorite authors, displaying all their posts along with yours.

I handle this very naïvely: everything is offloaded to the database. There are three tables involved here, one of users, one of posts, and one of follower/followee bindings.

Here’s the basic structure of this query:

SELECT post_id, post_body, post_date, post_type,
  user_name AS author_name, user_id AS author_id
FROM posts
LEFT JOIN users
ON posts.author_id = users.user_id
WHERE author_id = 'CURRENT_USER'
OR author_id IN (
  (SELECT followed_id
   FROM followers
   WHERE following_id = 'CURRENT_USER')
  )
ORDER BY post_date DESC
LIMIT PAGE_START,20;

Here’s where I need help: this works great on a single database, but it does not scale horizontally.

Since this horizontal scalability is such a hot topic right now, I’m asking for ideas. I’d like to put in the infrastructure before there is a need for it.

Eran points out that caching is not as simple a solution as we’d like to think. What do you cache? How do you keep caches in sync?

Does anyone have experience with MySQL Cluster Servers? It seems like the best way of scaling is to make the process as parallelizable as possible. The database then handles the parallelization, so the less I can do in the program the better, right?

MySQL Subqueries

I often find it difficult to find tips and advice for doing relatively simple things in things like MySQL, Ruby, Python, etc. So, starting with this post, I will help fill that niche. Today’s topic is Using Subqueries to Simplify your SQL Queries.

For this article, I’m using PHP and MySQL for examples. There are slightly different implementations of SQL in the various database engines, but this is one thing they all have in common.

SQL is called “structured query language” because it allows subqueries to make complex queries easier and faster. The idea of a subquery is simple: have the database perform one query and insert it into another.

There are dozens of useful ways of using subqueries, but I will concentrate on two: subqueries in the select expression and subqueries in the where clause.

Security Concerns

In most web programming languages, the interface between the script and the database only allows one query per access for security reasons: an injection attack could input something like '; DELETE * FROM users; and do some serious damage to a website. Imagine your SQL query to login looked something like:

SELECT * FROM users WHERE user_name = '$username' AND password = '$password';

If you are not checking and cleaning the input appropriately, someone could type the snippet above into your login form and, if multiple queries were allowed, MySQL would execute the following:

SELECT * FROM users WHERE user_name = ''; DELETE * FROM users; AND password='';

Since the empty string wouldn’t match any rows (hopefully), the first query would be discarded. The second query, the DELETE statement, would run, terminating at the second semicolon. Since the third piece of code is nonsense, MySQL would throw it out with an error.

To solve this problem, languages like PHP cause MySQL to issue an error any time there is more text (except comments) after the line terminator, usually the semicolon. The downside is that situations arise where you need to run multiple queries. The result is either often either a godawfully complicated statement with multiple JOINs, or running several queries, each of which requires communication with your database server and can slow down your applications.

In the examples below, I’ll pretend we’re building a forum that has four tables:

  • users with primary key user_id
  • forums, a list of all the boards, with primary key forum_id
  • threads which links each thread to a forum with forum_id and has primary key thread_id
  • posts which links each post to a thread with thread_id and has primary key post_id

Subqueries in Select Expressions

One way to speed up your queries again is to use subqueries. Subqueries are full SQL queries nested within another query. For example:

SELECT (SELECT * FROM t1);

Obviously it’s a pretty simple example. Notice the parentheses. Subqueries must always be in parentheses, even if they are inside a function, like:

SELECT MAX((SELECT salary FROM employees));

Let’s get to work on our forum. Say that while reading all the threads of a forum you’d like to have both the number of threads and the number of posts in the forum. One way is to run two separate queries:

SELECT COUNT(*) AS threads FROM threads WHERE forum_id='1';
SELECT COUNT(*) AS posts FROM posts LEFT JOIN threads USING(thread_id) WHERE forum_id='1';

That might not be so bad if your SQL server is localhost, but more and more hosts are running dedicated SQL servers, meaning that every query has to run across the internet, be processed, and run back, slowing down your application. But we can run this in one query with two subqueries:

SELECT
(SELECT COUNT(*) FROM threads WHERE forum_id='1') AS threads,
(SELECT COUNT(*) FROM posts LEFT JOIN threads USING(thread_id) WHERE forum_id='1') AS posts;

We can add the above to our query to get the name of the forum and its description, so we can further decrease the number of trips to the database:

SELECT
(SELECT COUNT(*) FROM threads WHERE threads.forum_id=forums.forum_id) AS threads,
(SELECT COUNT(*) FROM posts LEFT JOIN threads USING(thread_id) WHERE threads.forum_id=forums.forum_id) AS posts,
forum_name,
forum_description
FROM forums WHERE forum_id='1';

Notice that we also changed the WHERE clauses to match whatever forum ID we put into the “outer query“.

Subqueries in Where Clauses

Another simple and useful way to use a subquery is in a WHERE clause. Here you must be careful to match the WHERE syntax and the type of data returned by the subquery. For example, in WHERE user_name = (...), the subquery ((...)) must return a single value, while in WHERE post_date IN (...), the subquery can return a list.

In our forum, we might want to search for all posts by a specific user, but we don’t want our visitors to need to know the user ID—or perhaps we want a more descriptive URL, like search.php?user=USER_NAME instead of search.php?user=#ID#. But in our forum, to be efficient, we link posts to their author by the user_id column.

One way to do this is to run a query to find the ID then run another query to find the posts. Another way in this particular case is to use a JOIN statement. But yet another way is to do this:

SELECT * FROM posts WHERE user_id = (SELECT user_id FROM users WHERE user_name = 'foo');

In the case above, a JOIN would also get us the information we want, but in some cases this isn’t true, for example:

SELECT column1 FROM t1
WHERE column1 = (SELECT MAX(column2) FROM t2);

When you need to COUNT or otherwise aggregate one column, you’ll need to use a subquery instead of a JOIN, as well.

Summary

This article only scratched the surface of subqueries. Subqueries can be nested, they can appear in other places and do other things, and they can make your SQL more readable, among others. I don’t claim that the SQL statements above are the world’s most efficient or best way to do things—if you know a better way, let me know! I just want to give an introduction to subqueries, a very basic part of SQL that few people I’ve met seem to understand.