sql - Efficiently find top-N values from multiple columns independently in Oracle -
Suppose I have 30 billion rows with multiple columns, and I want to search for the top n most frequently values for each column efficiently. For example, if I have freely, and with the most elegant SQL possible
First name LastName FavoritesAnimals favorite book --------- ------- - ------------- - ------------ Ferris Freemant Possum UBIC Nancy Freemant Lemur Housekeeping Nancy Drew Penguin UBiich Bill Ribbits Lemur Dhulgren And I want Top-1, then the result will be:
First name Last name FavoritesAnimals favorite book --------- -------- -------------- - ----------- Nancy Freemant Lemur Ubic Maybe I can think of ways to do this, but not sure whether they are optimal, which is important there are 30 billion rows; And the SQL can be bigger and ugly, and it may be that a much more temporary place will be used.
Using Oracle.
This should only be a pass on the table. You can use an analytical version of count () to get the frequency of each value independently: choose firstname, count (*) over (Division by first name) as C_FN, last name, (* split by last name), c_ln as favorite, my_table as c_fb as c_fn, favorite (split by favorite), c_fa, favorite book Calculate (*) divided by (preferred book); FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB ------ ---- -------- ---- ------- ---- --------- ------ Bill 1 Rebits 1 Lemur 2 Dholgra 1 Ferris 1 Freemont 2 Posam 1 Ubik 2 Nancy 2 Freemont 2 Lemer 2 Housekeeping 1 Nancy 2 Drew 1 Penguin 1 Ubik 2 You Then those CTEs (or subquire, as factoring, can think in Oracle terminology) and only pull high-frequency values from each column:
with tmp_tab form In (select / * + MATERIA C_fn as LIZE C_fn, c_fn, lastname, c_ln as count (*) over, c_ln, favorite animal, ginn (*) over (split by preferred animation) as c_fa, favorite book, calculation (*) My_table from c_fb In the first name (first_fn = 1) as the first name (* from favorite_by_fb as r_fn), select the first name (select firstname, row_number ()), select last name (last name, line_member) Select Last name as R_LN = 1) as the (c_ln day S according to the split order by division) In the form, select from favorite favorites (favorite favorites), more than row_number (), (split by empty sequence by c_fa name), from r_fa to tmp_tab) where r_fa = 1) Make (preferred book, more than line_member) (as split by spaces by c_fb desc) as r_fb from tmp_tab) where r_fb = 1) as favorite bookbook; First name LASTNAME FAVORITEBOOK ------ -------- ------- ------------ Nancy Fremont Lemur Ubic < P> You are making a pass on CTE for each column, but even then only the real table (thanks to the implementation sign) should be killed. This is similar to what the Thillo, Yast and others have suggested, and you want to add to the order by rules whether there is any relation. Except you can keep track of all the counting of Oracle. Edit: Hmm, the interpretation of the plan shows that it scans four full tables; It may need to think a bit more about this ... Edit 2: for CTT (Andra document) Adding content sign seems to solve this; This result is making a transient temporary table to keep the result, and only scans a full table, explain the plan cost is high - at least at this time on the sample data set. To do so, be interested in any comments on any negative side.
Comments
Post a Comment