sql - Efficiently find top-N values from multiple columns independently in Oracle -

- April 15, 2015

Suppose I have 30 billion rows with multiple columns, and I want to search for the top n most frequently values for each column efficiently. For example, if I have freely, and with the most elegant SQL possible

  First name LastName FavoritesAnimals favorite book --------- ------- - ------------- - ------------ Ferris Freemant Possum UBIC Nancy Freemant Lemur Housekeeping Nancy Drew Penguin UBiich Bill Ribbits Lemur Dhulgren    And I want Top-1, then the result will be:  
  First name Last name FavoritesAnimals favorite book --------- -------- -------------- - ----------- Nancy Freemant Lemur Ubic    Maybe I can think of ways to do this, but not sure whether they are optimal, which is important there are 30 billion rows; And the SQL can be bigger and ugly, and it may be that a much more temporary place will be used.  
 Using Oracle.   
 
  This should only be a pass on the table. You can use an analytical version of  count ()  to get the frequency of each value independently:  
  choose firstname, count (*) over (Division by first name) as C_FN, last name, (* split by last name), c_ln as favorite, my_table as c_fb as c_fn, favorite (split by favorite), c_fa, favorite book Calculate (*) divided by (preferred book); FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB ------ ---- -------- ---- ------- ---- --------- ------ Bill 1 Rebits 1 Lemur 2 Dholgra 1 Ferris 1 Freemont 2 Posam 1 Ubik 2 Nancy 2 Freemont 2 Lemer 2 Housekeeping 1 Nancy 2 Drew 1 Penguin 1 Ubik 2    You Then those CTEs (or subquire, as factoring, can think in Oracle terminology) and only pull high-frequency values from each column:  
  with tmp_tab form In (select / * + MATERIA C_fn as LIZE C_fn, c_fn, lastname, c_ln as count (*) over, c_ln, favorite animal, ginn (*) over (split by preferred animation) as c_fa, favorite book, calculation (*) My_table from c_fb In the first name (first_fn = 1) as the first name (* from favorite_by_fb as r_fn), select the first name (select firstname, row_number ()), select last name (last name, line_member) Select Last name as R_LN = 1) as the (c_ln day S according to the split order by division) In the form, select from favorite favorites (favorite favorites), more than row_number (), (split by empty sequence by c_fa name), from r_fa to tmp_tab) where r_fa = 1) Make (preferred book, more than line_member) (as split by spaces by c_fb desc) as r_fb from tmp_tab) where r_fb = 1) as favorite bookbook; First name LASTNAME FAVORITEBOOK ------ -------- ------- ------------ Nancy Fremont Lemur Ubic   < P> You are making a pass on CTE for each column, but even then only the real table (thanks to the  implementation  sign) should be killed.   This is similar to what the Thillo, Yast and others have suggested, and you want to add to the  order by  rules whether there is any relation. Except you can keep track of all the counting of Oracle.  
  Edit:  Hmm, the interpretation of the plan shows that it scans four full tables; It may need to think a bit more about this ...  Edit 2:  for CTT (Andra document)  Adding content  sign seems to solve this; This result is making a transient temporary table to keep the result, and only scans a full table, explain the plan cost is high - at least at this time on the sample data set. To do so, be interested in any comments on any negative side.   

 




  



















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




mysql - BLOB/TEXT column 'value' used in key specification without a
key length -



-



February 15, 2013








    I have developed an extension which works up to 1.6 on Magren (I'm trying Enterprise Edition, And I think the community is the same problem, because it is the same code). In my install script, I see the  $ installer-> gt; CreateEntityTables ($ this- & gt; getTable ('alphanum / info'));  The installation is done until it is not in the _text unit table. It crashed there! It turns out that when I log in to SQL and run it via PHPmyadmin, then this error is:  Blob / Text column 'value' is used without the key 'key' . I saw the code there, and this is what is trying to create an index on the value column:    -> addIndex ($ this- & gt; getIdxName ($ eavTableName, array ( 'attribute_id array (' attribute_id ',' value ')) - & gt; addIndex ($ this- & gt; getIdxName ($ eavTableName, array (' entity_type_id ' , 'Value')), array ('entity_type_id', 'value'))    If there is no  if  statement is n...





Read more





winapi - example code and API to log a user on and create a session and
desktop - looking for - for Windows 7 -



-



April 15, 2013








    I see example code and APIs to log on to a user and create session and desktop for Windows 7.   I need to do this with a non-interactive process running as a service.      Not sure what you are trying to archive, but if you just have a new session And want its desktop, check the LogonUser function sample. .   If you are trying to do graphics, then check    





Read more





memcached - Django cache performance -



-



August 15, 2015








    We are now using Redis for in-memory cache for our Django application (we used the first memcatch The difference in performance is no big, and we're using Redis because the disc dump feature).   The problem is that Django cache's performance in my view - terrible, we have a view with 102 cache hits (no miss), and it has 81 ms (only the cash part, which is with the DJBugBug toolbar Is measured). In my opinion - this is a very long time, I know, asking for a question for DB can be 10x more (or 100x) time, but that is also not good with cash performance of that fact.   We are running radis (and the first memacatch) on different hosts, connected to the local network with other servers.   Is there a way to maximize cache performance in Django?      Instead of displaying problem cache, there is a possibility of the number of items to be received for each page 102 Cash calls mean long time lost due to network latency. With full control of the code, you may be able to fix it with m...





Read more

Search This Blog

Lay Page

sql - Efficiently find top-N values from multiple columns independently in Oracle -

Comments

Post a Comment

Popular posts from this blog

mysql - BLOB/TEXT column 'value' used in key specification without a key length -

winapi - example code and API to log a user on and create a session and desktop - looking for - for Windows 7 -

memcached - Django cache performance -