Oracle external table with preprocessor

enter image description here

A short while ago I was working on a PL/SQL task where I needed to loop through some particular files in a folder. As always, I aimed for a solution that is simple and requires less effort to get the result. Let’s see what I came up with.

Task description

The task was to find a good solution to list all the required files in a directory. The directory is on the same Linux server where the database server is running. In this specific case I had to list all the csv files.

For testing purposes I created a folder with the following files:

excel_file.xlsx
file_1.csv
file_2.csv
file_3.csv
file_4.csv

First ideas

My first idea was to write a shell script that lists the files into a simple text file (file_list.txt) and then to process it with PL/SQL.

The shell script would have looked like the following:

#!/bin/bash
cd /home/tcs_test/test/files
ls -p *.csv > file_list.txt

Giving the result:

file_1.csv
file_2.csv
file_3.csv
file_4.csv

Then I would have simply looped through the lines of file_list.txt (for now only in script format, not as a package):

declare
    f_file_list     utl_file.file_type;
    v_file_name     varchar2(256);
begin
    --
    /*
    TEST_FILES      - SQL directory pointing to /home/tcs_test/test/files folder
    file_list.txt   - name of the file containing the file list
    r               - opening the file with READ privilege
    */
    f_file_list := utl_file.fopen('TEST_FILES', 'file_list.txt', 'r');
    --
    loop
        begin
            utl_file.get_line(f_file_list, v_file_name);
        exception
            when no_data_found then
                exit;
        end;
        --
        ----Doing the necessary work with the file name.
        ----For now, we just write it on the output.
        dbms_output.put_line(v_file_name);
        --
    end loop;
    --
    utl_file.fclose(f_file_list);
    --
end;
/

Result (DBMS_OUTPUT) is correctly:

file_1.csv
file_2.csv
file_3.csv
file_4.csv

I also had a second idea: keeping the shell script, but using an external table to read file_list.txt, instead of looping through it “manually”.

My biggest concern with both solutions was that we have to make sure that the shell script is always executed before processing the file list. Obviously, the files on the server can change and we want to get the up-to-date result. Of course, we can execute a shell script from PL/SQL, but it just seemed too much hassle for this small task.

The final solution

After some research I discovered that there was a simple and effective way to do all of this, being available from Oracle 11g. We can create an external table with a preprocessor. We can define - among others - a shell script as a preprocessor that will generate the source data for our external table. And of course, the preprocessor always runs upon querying the table.

Let’s go through all the necessary steps to make this work.

First of all, we need to define two SQL directories: one for the preprocessor script and another for the directory containing the files. They could be in the same folder, but I would encourage everyone to keep the preprocessor scripts in a separate folder.

create directory PREPROC_DIR as '/home/tcs_test/test/preprocessors';
grant read, execute on directory PREPROC_DIR to PANEL;

create directory TEST_FILES as '/home/tcs_test/test/files';
grant read, write on directory TEST_FILES to PANEL;

If the directories are created with the admin user, then we need to grant the necessary privileges to the user (schema) that we will work with.
Please note that I only granted the minimum required privileges.

Now let us see how the preprocessor script looks like:

preroc_list_csv_files.sh

#!/bin/bash
cd /home/tcs_test/test/files
/bin/ls -p *.csv

As we can see it is just a simple ls command to list all the csv files. For a certain reason we need to cd into the /home/tcs_test/test/files folder. A few lines later I will talk about a dummy.txt file that is stored there.

The last remaining step is to actually create the external table:

CREATE TABLE panel.preproc_ext_table
(ls_line     VARCHAR2(256))
ORGANIZATION EXTERNAL ( TYPE oracle_loader
 DEFAULT DIRECTORY TEST_FILES 
 ACCESS PARAMETERS (
   RECORDS DELIMITED BY NEWLINE 
   PREPROCESSOR PREPROC_DIR: 'preproc_list_csv_files.sh'
   FIELDS (ls_line POSITION ( 1 : 256) )
 ) LOCATION ( 'dummy.txt' )
) REJECT LIMIT UNLIMITED;

The table has only one column referring to the file name. We need to specify the preprocessor directory (PREPROC_DIR) and the default directory (TEST_FILES).

It is important to highlight: thanks to Oracle, the dummy.txt must exist even if it is not used. In our case the shell script does not have to read anything from that file, however, for more complicated cases we can create a more dynamic preprocessor script by making it read this file. For example, the dummy.txt file could contain a list of directories where the files are to be listed from.
It is important that the Linux Oracle user has the permission to both read and write the dummy.txt file.

That is all there is to it, from now on we may just enjoy the automatic functioning of our work by running the following query:

select * from panel.preproc_ext_table;

The result is as expected:

file_1.csv
file_2.csv
file_3.csv
file_4.csv

Upon querying the external table the preprocessor script always runs, therefore we always get the up-to-date list of files.

It is often said that it is always the first thought that is right, well, I am happy that I didn’t go for it this time …

Comments

  1. Thanks so much for this. It worked like a charm! -Chris Nance OCP

    ReplyDelete

Post a Comment