Migration combat: Discourse from PostgreSQL to MySQL to TiDB丨The story behind the AskTUG forum

I believe everyone is familiar with AskTUG.com technical question and answer site, but in addition to the front-end pages that are familiar to everyone, there is an unknown story behind the database that supports its operation. This article is shared by Mr. Wang Xingzong, one of the authors of AskTUG.com, revealing the wonderful story of AskTUG.com, which was born in Discourse, migrated from PostgreSQL to MySQL and finally runs stably on TiDB.

An ad:

AskTUG.com is a gathering place for TiDB Users, Contributors, and partners. Here you can find answers to all TiDB related questions. Welcome everyone to register for the experience~

Link: https://asktug.com/

background

"Through one platform, you can definitely find satisfactory answers to all TiDB questions."

Because of this desire, users, Contributors, and partners in the TiDB ecosystem jointly established the AskTUG.com technical question and answer website, which was officially launched in August 2019. As a "gathering place" for TUG members to learn and share, TiDB users can ask and answer questions, exchange and discuss with each other, and gather the collective wisdom of TiDB users here. Since its launch, AskTUG.com has gradually attracted the attention of more and more users. As of the end of June 2021, AskTUG.com has 7000+ registered users, accumulating 1.6w+ questions and 300+ technical articles.

Many friends have discovered that the back-end program of AskTUG.com is a Discourse program. About Discourse is a new open source forum project launched by Jeff Atwood, the co-founder of Stack Overflow, whose purpose is to change the forum software that has not changed for ten years. When AskTUG.com was established, it was decided to use Discourse from the following perspectives:

is powerful : Discourse is rich in features and highly customizable. It is the WordPress of the forum world. Compared with other traditional forums, Discourse simplifies the classification of traditional forums and replaces them with hot posts. This is a bit like a question and answer. It can prevent users from finding directions after entering traditional forums. This feature is from the AskTUG.com page. It can be seen:

audience : Most popular open source projects choose to use Discourse to build their own communities, including:
- Docker:https://forums.docker.com/
- Github Atom:https://discuss.atom.io/
- Mozilla:https://discourse.mozilla.org/
- TiDB:https://asktug.com/
- Discourse:https://meta.discourse.org/
- Rust:https://users.rust-lang.org/

- More: https://discourse.org/customers
easy to use : Discourse posts are displayed in bubble form, all Ajax loading, computer and mobile version, the forum adopts waterfall flow design, automatically load the next page, no need to manually turn the page, in short, this is one Great system.

Why move

Up to now, Everything is Good, except for one point: Discourse officially only supports PostgreSQL, a database.

As an open source database vendor, we have great enthusiasm and good reason to let AskTUG.com run on our own database TiDB. When we first had this idea, of course, we were looking for a solution that had already ported the Discourse port to MySQL. The result was Many people asked but did not act.

So we decided to do the transformation of the Discourse database ourselves. There are two reasons:

Eat your own dog food to verify the compatibility of TiDB.
Discourse is a typical HTAP application. Its management background has very complex report queries. As the amount of forum data increases, stand-alone PostgreSQL and MySQL are prone to performance bottlenecks. TiDB 5.0 just meets the needs of this application scenario. By introducing TiFlash nodes, some complex statistical analysis queries are processed in parallel to achieve acceleration. And there is no need to modify SQL and complex ETL processes.

migration practice

Earlier we talked about the reasons for doing the AskTUG & Discourse database transformation project. Next, we will talk in detail about the "pits" stepped on by migrating from PostgreSQL to MySQL / TiDB. If you have friends who migrate from PG to MySQL, you can get it. Come for reference.

TiDB is also compatible with the MySQL protocol and ecology, with convenient migration and extremely low operation and maintenance costs. Therefore, the Discourse from PG to TiDB is roughly divided into two steps :

Step 1: Migrate Discourse to MySQL;

Step 2: Adapt TiDB.

Migrate to MySQL 5.7

🌟mini_sql

minisql is a lightweight sql wraper, which is convenient to do some queries that ORM is not good at, and can prevent SQL injection. Previously, only PG and sqlite were supported. Discourse's code relies on minisql in many places, and the workload is huge if it is rewritten. Patch mini_sql to support MySQL is an important step to complete the migration: https://github.com/discourse/mini_sql/pull/5

🌟 schema migration

Rails schema migration is used to maintain DDL, which reflects the change process of the database schema. For migration, it actually increases the workload. The solution is to first generate a final schema.rb file and make changes to the final result. Generate a new migration file. Just delete the migration file generated in the intermediate process.

🌟 character set utf8mb4

database.yml

development:
  prepared_statements: false
   encoding: utf8mb4
   socket: /tmp/mysql.sock
   adapter: mysql2

/etc/mysql/my.cnf

[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

🌟 MySQL can index only the first N chars of a BLOB or TEXT column

All types of PG can be indexed. MySQL cannot index the text type. The solution is to specify the length when indexing:

t.index ["error"], name: "index_incoming_emails_on_error", length: 100

But for the composite index, the situation is more complicated. You can only ignore the text type. Fortunately, the index does not affect the function.

🌟 data migration

pg2mysql can convert the insert statement from pgdump into a form compatible with MySQL syntax, but it is limited to a simple form. Some formats with array and json will be messed up, but this part is correct when processed by Ruby, and is divided into two parts. , First, pg2mysql processes and excludes some conversion error tables, such as user_options, site_settings, etc.:

PGPASSWORD=yourpass pg_dump discourse_development -h localhost 
  --quote-all-identifiers 
  --quote-all-identifiers 
  --inserts 
  --disable-dollar-quoting 
  --column-inserts 
  --exclude-table-data user_options 
  --exclude-table-data user_api_keys 
  --exclude-table-data  reviewable_histories 
  --exclude-table-data  reviewables 
  --exclude-table-data  notifications 
  --exclude-table-data site_settings 
  --exclude-table-data  reviewables  
  --no-acl 
  --no-owner 
  --format p 
  --data-only -f pgfile.sql

The remaining part of the data is migrated using seed_dump:

bundle exec rake db:seed:dump 
  MODELS=UserApiKey,UserOption,ReviewableHistory,
    Reviewable,Notification,SiteSetting 
EXCLUDE=[] IMPORT=true

🌟 distinct on

PG has a distinct on usage, which is equivalent to the effect when the MySQLONLY\_FULL\_GROUP_BY parameter is turned off, but starting from MySQL 5.7, this parameter has been turned on by default. So one solution is to turn off the ONLY\_FULL\_GROUP_BY parameter, and the other is to simulate with GROUP and aggregate functions:

# postgresql
SELECT DISTINCT ON (pr.user_id) pr.user_id, pr.post_id, pr.created_at granted_at
  FROM post_revisions pr
  JOIN badge_posts p on p.id = pr.post_id
  WHERE p.wiki
      AND NOT pr.hidden
      AND (:backfill OR p.id IN (:post_ids))
# mysql
SELECT pr.user_id, MIN(pr.post_id) AS post_id, MIN(pr.created_at) AS granted_at
  FROM post_revisions pr
  JOIN badge_posts p on p.id = pr.post_id
  WHERE p.wiki
      AND NOT pr.hidden
      AND (:backfill OR p.id IN (:post_ids))
  GROUP BY pr.user_id

🌟 returning

PG's UPDATE, DELETE, and INSERT statements can all carry a returning keyword to return the result after modification/insertion. For UPDATE and DELETE statements, MySQL is relatively easy to change. It only needs to be divided into two steps, first find out the primary key, and then update or delete:

update users set updated_at = now() where id = 801 returning id,updated_at ;
 id  |        updated_at
-----+---------------------------
 801 | 2019-12-30 15:43:35.81969

MySQL version :

update users set updated_at = now() where id = 801;
select id, updated_at from users where id = 801;
+-----+---------------------+
| id  | updated_at          |
+-----+---------------------+
| 801 | 2019-12-30 15:45:46 |
+-----+---------------------+

In the case of a single INSERT, the last\_insert\_id() function needs to be used:

PG version :

insert into category_users(user_id, category_id, notification_level) values(100,100,1) returning id, user_id, category_id;
 id | user_id | category_id
----+---------+-------------
 59 |     100 |         100

Change to MySQL version:

insert into category_users(user_id, category_id, notification_level) values(100,100,1);
select id, category_id, user_id from category_users where id = last_insert_id();
+----+-------------+---------+
| id | category_id | user_id |
+----+-------------+---------+
| 48 |         100 |     100 |
+----+-------------+---------+

For batch inserts, you need to change to a single INSERT, and then use the last\_insert\_id() function, because MySQL does not provide the last\_insert\_id() function:

ub_ids = records.map do |ub|
  DB.exec(
    "INSERT IGNORE INTO user_badges(badge_id, user_id, granted_at, granted_by_id, post_id) 
     VALUES (:badge_id, :user_id, :granted_at, :granted_by_id, :post_id)",
     badge_id: badge.id,
     user_id: ub.user_id,
     granted_at: ub.granted_at,
     granted_by_id: -1,
     post_id: ub.post_id
  )
  DB.raw_connection.last_id
end

DB.query("SELECT id, user_id, granted_at FROM user_badges WHERE id IN (:ub_ids)", ub_ids: ub_ids)

🌟 insert into on conflict do nothing

PG 9.5 started to support upsert, MySQL also has the same function, but the writing is inconsistent:

# postgresql
DB.exec(<<~SQL, args)
  INSERT INTO post_timings (topic_id, user_id, post_number, msecs)
  SELECT :topic_id, :user_id, :post_number, :msecs
  ON CONFLICT DO NOTHING
SQL
# MySQL
DB.exec(<<~SQL, args)
  INSERT IGNORE INTO post_timings (topic_id, user_id, post_number, msecs)
  SELECT :topic_id, :user_id, :post_number, :msecs
SQL

🌟 select without from

This syntax is allowed in PG: select 1 where 1=2;

But this is illegal in MySQL, because there is no FROM clause, the solution is very tricky, manually create a table with only one piece of data, specifically to be compatible with this grammar.

execute("create table one_row_table (id int)")
execute("insert into one_row_table values (1)")

MySQL uses:

# MySQL
select 1 from one_row_table where 1=2;

🌟 full outer join

MySQL does not support full outer join, you need to use LEFT JOIN + RIGHT JOIN + UNION to simulate:

# MySQL
SELECT * FROM t1
LEFT JOIN t2 ON t1.id = t2.id
UNION
SELECT * FROM t1
RIGHT JOIN t2 ON t1.id = t2.id

🌟 recursive cte

Before MySQL 8.0, CTE/Recursive CTE is not supported. CTE with simple structure can be directly changed into a subquery. In addition to poor readability, there is no functional impact. Recursive CTE can be simulated with User-defined variables. There is a nested reply query in Discourse:

WITH RECURSIVE breadcrumb(id, level) AS (
    SELECT 8543, 0
    UNION
    SELECT reply_id, level + 1
    FROM post_replies AS r
      JOIN breadcrumb AS b ON (r.post_id = b.id)
    WHERE r.post_id <> r.reply_id
          AND b.level < 1000
  ), breadcrumb_with_count AS (
      SELECT
        id,
        level,
        COUNT(*) AS count
      FROM post_replies AS r
        JOIN breadcrumb AS b ON (r.reply_id = b.id)
      WHERE r.reply_id <> r.post_id
      GROUP BY id, level
  )
  SELECT id, level
  FROM breadcrumb_with_count
  ORDER BY id

Use MySQL 5.7 to be compatible:

# MySQL
SELECT id, level FROM (
  SELECT id, level, count(*) as count FROM (
    SELECT reply_id AS id, length(@pv) - length((replace(@pv, ',', '')))  AS level
      FROM (
             SELECT * FROM post_replies ORDER BY post_id, reply_id) pr,
             (SELECT @pv := 8543) init
     WHERE find_in_set(post_id, @pv)
           AND length(@pv := concat(@pv, ',', reply_id))
  ) tmp GROUP BY id, level
) tmp1
WHERE (count = 1)
ORDER BY id

The cte of PG can be nested, such as this query in Discourse, note that WITH period_actions is nested in flag_count:

 WITH mods AS (
  SELECT
  id AS user_id,
  username_lower AS username,
  uploaded_avatar_id
  FROM users u
  WHERE u.moderator = 'true'
  AND u.id > 0
  ),
  time_read AS (
  SELECT SUM(uv.time_read) AS time_read,
  uv.user_id
  FROM mods m
  JOIN user_visits uv
  ON m.user_id = uv.user_id
  WHERE uv.visited_at >= '#{report.start_date}'
  AND uv.visited_at <= '#{report.end_date}'
  GROUP BY uv.user_id
  ),
  flag_count AS (
      WITH period_actions AS (
      SELECT agreed_by_id,
      disagreed_by_id
      FROM post_actions
      WHERE post_action_type_id IN (#{PostActionType.flag_types_without_custom.values.join(',')})
      AND created_at >= '#{report.start_date}'
      AND created_at <= '#{report.end_date}'
      ),
      agreed_flags AS (
      SELECT pa.agreed_by_id AS user_id,
      COUNT(*) AS flag_count
      FROM mods m
      JOIN period_actions pa
      ON pa.agreed_by_id = m.user_id
      GROUP BY agreed_by_id
      ),
      disagreed_flags AS (
      SELECT pa.disagreed_by_id AS user_id,
      COUNT(*) AS flag_count
      FROM mods m
      JOIN period_actions pa
      ON pa.disagreed_by_id = m.user_id
      GROUP BY disagreed_by_id
   )

This kind of subquery simulation is very complicated. can be compatible with temporary tables. The query part does not require any modification to . You only need to replace the WITH part with a temporary table in the order of dependency:

DB.exec(<<~SQL)
    CREATE TEMPORARY TABLE IF NOT EXISTS mods AS (
      SELECT
      id AS user_id,
      username_lower AS username,
      uploaded_avatar_id
      FROM users u
      WHERE u.moderator = true
      AND u.id > 0
    )
  SQL

  DB.exec(<<~SQL)
    CREATE TEMPORARY TABLE IF NOT EXISTS time_read AS (
      SELECT SUM(uv.time_read) AS time_read,
      uv.user_id
      FROM mods m
      JOIN user_visits uv
      ON m.user_id = uv.user_id
      WHERE uv.visited_at >= '#{report.start_date.to_s(:db)}'
      AND uv.visited_at <= '#{report.end_date.to_s(:db)}'
      GROUP BY uv.user_id
    )
  SQL

🌟 delete & update

The update/delete statement of PG and MySQL are written differently. ORM will be automatically processed, but a large amount of code in Discourse is SQL handwritten using mini_sql, which needs to be replaced one by one.

PG's update statement writing:

# postgresql
UPDATE employees
SET department_name = departments.name
FROM departments
WHERE employees.department_id = departments.id

The MySQL update statement is written:

# MySQL
UPDATE employees
LEFT JOIN departments ON employees.department_id = departments.id
SET department_name = departments.name

The delete statement is similar.

🌟 You can't specify target table xx for update in FROM clause

After migrating from PG to MySQL, many statements will report such an error: You can't specify target table'users' for update in FROM clause.

# MySQL
update users set updated_at = now() where id in (
  select id from users where id < 10
);
# You can't specify target table 'users' for update in FROM clause

The solution is to use derived table in the subquery:

# MySQL
update users set updated_at = now() where id in (
  select id from (select * from users) u where id < 10
);

🌟 MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

Take the above query as an example, if the subquery has LIMIT:

# MySQL
update users set updated_at = now() where id in (
  select id from (select * from users) u where id < 10 limit 10
);
# MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

The simplest solution is to derive it again:

# MySQL
update users set updated_at = now() where id in (
  select id from (
    select id from (select * from users) u where id < 10 limit 10
  ) u1
);

🌟 window function

Before MySQL 8.0, there is no window function, you can use User-Defined Variables instead:

# postgresql
WITH ranked_requests AS (
  SELECT row_number() OVER (ORDER BY count DESC) as row_number, id
    FROM web_crawler_requests
   WHERE date = '#{1.day.ago.strftime("%Y-%m-%d")}'
)
DELETE FROM web_crawler_requests
WHERE id IN (
  SELECT ranked_requests.id
    FROM ranked_requests
   WHERE row_number > 10
)
# MySQL
DELETE FROM web_crawler_requests
WHERE id IN (
  SELECT ranked_requests.id
    FROM (
          SELECT @r := @r + 1 as row_number, id
            FROM web_crawler_requests, (SELECT @r := 0) t
           WHERE date = '#{1.day.ago.strftime("%Y-%m-%d")}'
        ORDER BY count DESC
    ) ranked_requests
   WHERE row_number > 10
)

🌟 swap columns

When MySQL and PG process the update statement, the quoting behavior of the column is inconsistent. PG refers to the original value, while MySQL refers to the updated value. For example:

# postgresql
create table tmp (id integer primary key, c1 varchar(10), c2 varchar(10));

insert into tmp values (1,2,3);
insert into tmp values (2,4,5);

select * from tmp;
 id | c1 | c2
----+----+----
  1 | 3  | 2
  2 | 5  | 4

update tmp set c1=c2,c2=c1;

select * from tmp;
 id | c1 | c2
----+----+----
  1 | 3  | 2
  2 | 5  | 4
# MySQL

create table tmp (id integer primary key, c1 varchar(10), c2 varchar(10));

insert into tmp values (1,2,3);
insert into tmp values (2,4,5);

select * from tmp;
+----+------+------+
| id | c1   | c2   |
+----+------+------+
|  1 | 2    | 3    |
|  2 | 4    | 5    |
+----+------+------+

update tmp set c1=c2,c2=c1;

select * from tmp;
+----+------+------+
| id | c1   | c2   |
+----+------+------+
|  1 | 3    | 3    |
|  2 | 5    | 5    |
+----+------+------+

🌟 function

There will be some inconsistencies in the names and behaviors of some built-in functions of PG and MySQL:

regexp_replace -> replace
pg_sleep -> sleep
ilike -> lower + like
~* -\> regexp
|| -\> concat
set local statementtimeout -> set session statementtimeout
offset a limit b -> limit a offset b
@ -\> ABS
interval -> date_add or datediff
extract epoch from -> unix_timestimp
unnest -> union all
json syntax: json->>'username' to json ->>'$.username'
position in -> locate
generate_series -> union
greatest & least -> greatest/least + coalesce

🌟 type & casting

MySQL uses the cast function, and PG also supports the same syntax, but there are four more commonly used points::, such as SELECT 1::varchar, MySQL's conversion types can only be the following 5 types: CHAR[(N)], DATE, DATETIME, DECIMAL, SIGNED, TIME.

select cast('1' as signed);

In Rails, the string type is mapped to varchar, and MySQL is mapped to varchar(255). The varchar of PG can actually store more than 255. In Discourse, some data using string type will exceed 255 and will be truncated after being converted to MySQL. The solution is to use text type for this part of the column.

🌟 keywords

The keywords lists of MySQL and PG are not exactly the same. For example, read is a keyword in MySQL but not in PG. The SQL produced by ORM has been processed, and some handwritten SQL needs to be quoted by yourself. PG uses "" and MySQL uses".

🌟 expression index

PG supports expression index:

CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));

Some functions in Discourse will add unique constraints to the expression index. MySQL has no direct correspondence, but it can be simulated by using Stored Generated Column. First redundant a Stored Generated Column, and then add unique constraints on it to achieve the same effect.

Rails also supports:

t.virtual "virtual_parent_category_id", type: :string, as: "COALESCE(parent_category_id, '-1')", stored: true 
t.index "virtual_parent_category_id, name", name: "unique_index_categories_on_name", unique: true

🌟 array && json

PG supports array and json types. MySQL 5.7 has JSON. In Discourse, the usage scenarios of ARRAY and JSON are relatively simple. They are used for storage. There is no advanced retrieval requirement. Directly using JSON can replace PG's array and json. But MySQL's JSON and text do not support the default value, and can only be set at the application layer. You can use: https://github.com/FooBarWidget/default_value_for

adapts to TiDB

TiDB supports the MySQL transfer protocol and most of its syntax, but some features cannot be implemented well in a distributed environment, so there are still some differences in the performance of some features from MySQL. For details, see document https:// pingcap.com/docs-cn/stable/reference/mysql-compatibility/, next we will mainly look at some minor issues involved in this migration.

🌟 TiDB reserved keywords

TiDB supports Window Function in the new version (this migration uses v3.0.7), and introduces group , rank , row_number and other functions, but the special thing is that the above function names will be treated as keywords by TiDB, so we are opening the window When the function is used, it is necessary to modify the SQL named similar to the name of the window function, and enclose the relevant keywords in backquotes.

TiDB reserved keywords: https://pingcap.com/docs-cn/stable/reference/sql/language-structure/keywords-and-reserved-words/

TiDB window function: https://pingcap.com/docs-cn/stable/reference/sql/functions-and-operators/window-functions/

🌟 Insert into select syntax is not compatible

TiDB does not support this syntax for the time being, you can use insert into select from dual to bypass:

invalid: insert into t1 (i) select 1;
valid: insert into t1 (i) select 1 from dual;

🌟 nested transactions & savepoint

TiDB does not support nested transactions, nor does it support savepoint. But Rails ActiveRecord uses savepoint to simulate nested transactions when the database is MySQL or PostgreSQL, and uses the requires_new option to control, document: https://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html.

Therefore, after the database is migrated to TiDB, we need to adjust the business code. The original nested transactions is adjusted to a single-tier transaction, and the exception is rolled back. At the same time, the requires_new option is canceled in the discourse.

`TiDB strong compatibility`

TiDB is 100% compatible with MySQL 5.7 protocol. In addition, it also supports MySQL 5.7 commonly used functions and syntax. The system tools (PHPMyAdmin, Navicat, MySQL Workbench, mysqldump, Mydumper/Myloader) and client in the MySQL 5.7 ecosystem are all applicable to TiDB. At the same time, after TiDB 5.0, many new features will be released one after another, such as expression indexes, CTEs, temporary tables, etc. The compatibility of the new version of TiDB is getting better and better, and the migration from MySQL or PostgreSQL to TiDB will become more and more. The easier it is.

`summary`

The project has been 100% completed, and the AskTUG website ( https://asktug.com) has been running smoothly on TiDB (current version: tidb-v5.0.x) for more than a year. Yes, without changing the experience, no one has noticed that the database has been quietly changed ~ proving the feasibility of migrating the business running on PG to TiDB.

The address of the project is: https://github.com/tidb-incubator/discourse/tree/my-2.3.3, you can participate in the improvement and follow the progress of the project Friends from the Ruby community, Ruby On Rails community, and Discourse community come to feel the goodwill from the TiDB community.