How to Batch Import Huge CSV Fast using PHP (Million Records in Seconds)

by Vincy. Last modified on October 29th, 2022.

I know, I know! How can you put PHP and Fast in the same line? I would say, you are stuck in the past. You should try PHP 7 and experience it yourself. There will be a lot of criticism for this article. Kindly read through.

LOAD DATA INFILE is the best option to import a huge CSV file. In this article we are going to see how we are going to read a CSV file and INSERT records via PHP script. This can be used in such a special case.

In general PHP will not have such use cases. But you never know, when you will come across such a situation, a client requirement can come any time. When you are a “freelance programmer” (not playing a consultant role), you will be thrust upon with all sorts of crazy requirements.

Huge CSV Import PHP

I have written an article about all things CSV in PHP. You should go through that to learn about handling CSV files in PHP.

I am well aware about the micro-benchmarking and its pitfalls. So, I didn’t go in that direction for this articles. That’s why I have given in the title as just seconds. It’s an approximation. This is not for people who are going to see in micro, nano seconds.

If you want to insert a million records in few seconds, using PHP script programmatically, then this will definitely help. The use case is rare and so the solution. You may have to do some tweaks here and there in this sample code to fit for your case.

<?php

function file_get_contents_chunked($link, $file, $chunk_size, $queryValuePrefix, $callback)
{
    try {
        $handle = fopen($file, "r");
        $i = 0;
        while (! feof($handle)) {
            call_user_func_array($callback, array(
                fread($handle, $chunk_size),
                &$handle,
                $i,
                &$queryValuePrefix,
                $link
            ));
            $i ++;
        }
        fclose($handle);
    } catch (Exception $e) {
        trigger_error("file_get_contents_chunked::" . $e->getMessage(), E_USER_NOTICE);
        return false;
    }

    return true;
}
$link = mysqli_connect("localhost", "root", "pass", "huge-csv");
$success = file_get_contents_chunked($link, "sample-dataset.csv", 2048, '', function ($chunk, &$handle, $iteration, &$queryValuePrefix, $link) {
    $TABLENAME = 'tbl_lead';
    $chunk = $queryValuePrefix . $chunk;

    // split the chunk of string by newline. Not using PHP's EOF
    // as it may not work for content stored on external sources
    $lineArray = preg_split("/\r\n|\n|\r/", $chunk);
    $query = 'INSERT INTO ' . $TABLENAME . '(id, name, email) VALUES ';
    $numberOfRecords = count($lineArray);
    for ($i = 0; $i < $numberOfRecords - 2; $i ++) {
        // split single CSV row to columns
        $colArray = explode(',', $lineArray[$i]);
        $query = $query . '(' . $colArray[0] . ',"' . $colArray[1] . '","' . $colArray[2] . '"),';
    }
    // last row without a comma
    $colArray = explode(',', $lineArray[$i]);
    $query = $query . '(' . $colArray[0] . ',"' . $colArray[1] . '","' . $colArray[2] . '")';
    $i = $i + 1;

    // storing the last truncated record and this will become the
    // prefix in the next run
    $queryValuePrefix = $lineArray[$i];
    mysqli_query($link, $query) or die(mysqli_error($link));

    /*
     * {$handle} is passed in case you want to seek to different parts of the file
     * {$iteration} is the section of the file that has been read so
     * ($i * 4096) is your current offset within the file.
     */
});

if (! $success) {
    // It Failed
}

Two main things to note are,

  1. Read the file in chunks (batch).
  2. Insert multiple records in a single insert statement.

The above is key in speeding up the overall process. Reading line by line and iterating through a loop will slowdown the process. So everything boils down to reading chunks (batches) and multi insert. Then a third point worth mentioning is use PHP native functions wherever possible.

I have used a Regex to replace newline in the CSV file. If you got a better option please suggest via comments section below.

Now let me walkthrough the code. 

PHP’s fread allows to read in chunks of strings. Try experimenting with different chunk (batch) sizes. There is no particular right size. There are lots of variables, your server configuration, hardware, MySQL setup and lot more.

I have used a sample CSV file which I generated myself. I will detail the process below. You can use your own sample or real data and pass that as a parameter.

file_get_contents_chunked does the processing of CSV file and this has a callback function as the last argument. This callback takes care of parsing the record by the delimiter (comma in my example) and creating the multi-insert query and doing the actual insert.

You may have to modify the query part to suit your database table format. The overall structure of the script takes care of parsing the CSV in chunks (batches) and callbacks. Thanks to RobertPitt

One thing worth mentioning is, instead of reading line by line I have used the chunk as it is to improve the speed. All through the Internet users have suggested fgetcsv. But I have gone ahead with chunk (batch) read of fread. 

The fread when reading as chunk (batch) it will obviously have part a part of a CSV row truncated. I have stored that truncated last record in each chunk in $queryValuePrefix variable. It is retained by declaring as a reference via callback. 

Sample CSV dataset

I wrote a tiny PHP script to generate the required CSV dataset. There are a lot of sources like Government census record, now popular Covid data, weather data and lot more.

At this time of Artificial Intelligence era, where Data Mining is more popular than Twinkle Twinkle Little Star, getting a sample huge CSV file is just a click away.

But still for flexibility, I wrote the PHP script myself. 

<?php
$file = fopen('s-sample-dataset.csv', 'w');

for ($i = 1; $i <= 1000000; $i ++) {
    echo $i;
    $line[] = $i;
    $line[] = uniqid();
    $line[] = uniqid();
    fputcsv($file, $line);
    $line = null;
}
fclose($file);

This is a ruthless way of generating a sample data :-) At this juncture Faker is  worth mentioning. It is a good PHP library for generating nice fake data. It will look very close to real.

Last, for the nerds. This script ran and imported one million (1000000) records in 9 seconds. My machine is a MacBookPro with Catalina, 2.3 GHz i5, 16 GB RAM and PHP 7.4. This is just for an idea and please do not make much out of this.

Pour in all your view points and suggestions for improvement in the below comments. Happy parsing all your CSVs, chunk it out!

Vincy
Written by Vincy, a web developer with 15+ years of experience and a Masters degree in Computer Science. She specializes in building modern, lightweight websites using PHP, JavaScript, React, and related technologies. Phppot helps you in mastering web development through over a decade of publishing quality tutorials.

Comments to “How to Batch Import Huge CSV Fast using PHP (Million Records in Seconds)”

  • Nikunj Jadav says:

    Thanks you for sharing
    We are facing same issue to import large size CSV file.

  • Ángel says:

    Hi, only a tip: in for loop you have 100.000, not a million…

  • richdotward says:

    Interesting code and works great with the sample code thanks. Modded slightly to add an file upload (your previous sample code for small csv uploads).

    I can’t get it to read a more complex set though. Mine is 14 rows across – seems to error on the second chunk read. When importing a few 100 records its great – but on the full 70k set it errors.

    Made it more simple version to test with 3 rows but still same error when reading the 70k file.

    You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘””,””),(734,”22361″,”0″),(735,”22368″,”0″),(736,”22382″,”0″),(737,”22390″,”0″),(‘ at line 1

    Any ideas please?

    Thanks

    Rich

  • Riyaz says:

    Is there any solution for check data already exists in mysql database with this source code, because We have import large csv file with check duplicate data in database, can you give any idea?

    • Vincy says:

      Hi Riyaz,

      If I understand your requirement, you need to check for already existing record and then import. This script will support that, but you need to make changes to it. May be I will post an update on this article as time permits.

    • Darshana Sangwan says:

      You can try to use “ON DUPLICATE KEY UPDATE”. It will check if the record is already exist. But you need to create a unique index(may be a single column or combination of columns) to trigger the on Duplicate.

      INSERT INTO table-name (column1, colume2, column3)
      VALUES (value1, value2, value3)
      ON DUPLICATE KEY UPDATE
      column1 = VALUES(value1);

      The above query will do three things:
      1) Insert new rows
      2) check existing rows
      3) Update row if already exist

      I hope it will help.

  • Brajesh says:

    it works great with your example. But when I try to do it for my table, (my first column is varchar), it goes blank. Is there anywhere in the script where int or varchar is denoted which I am missing.

    • Vincy says:

      Thanks Brajesh.

      There is no specific configuration like that. If you can give me more information or post your sheet and I can help you out.

  • vidya says:

    Hi,

    My data format is like this

    35457,Hatinder Kaur-9811130738,Hatinder,Kaur,,9811130738,Active,”2618, B S Lan 9, Chuna Mandi, Paharganj
    New Delhi, 110055″,,New Delhi,Delhi,India,,”2618, B S Lan 9, Chuna Mandi, Paharganj
    New Delhi, 110055″,”Paharganj
    New Delhi, 110055″,New Delhi,Delhi,India

    My single column value is like “2618, B S Lan 9, Chuna Mandi,
    Paharganj
    New Delhi, 110055”. As this has new line value it is considered as next row when splitting. Is there any way to clear this out. It’s an urgent work. Please help me ASAP if anyone can.

  • Darshana Sangwan says:

    it works great with speed. I just have issue with the last row of each batch. Sometimes it doesn’t read complete row (Just half row) and not skipping the header row.

    Rest all is great and Thank you very much for this great post.

  • Yahya says:

    Thank you for sharing. In your code where is step 7 – Clear all memory references?

  • Brad Duncan says:

    This works great, thanks! However, if my csv file has field names in he the first row, how do I NOT include that in the very first query?

    • Vincy says:

      Hi Brad,

      Thanks. You should declare a loop iterator variable before the start of the loop. The for the first iteration skip processing the record. This will help you to exclude the first row if it has field names.

Leave a Reply

Your email address will not be published. Required fields are marked *

↑ Back to Top

Share this page